Token Healing Script

This Python script aims to enhance the readability and coherence of text. The algorithm consists of three main steps:

Spelling and grammar correction
Addition of missing words
Removal of duplicate/extra words

Installation

Make sure you have Python 3.x installed on your system.

To run this script, you need to install the following dependencies:

gingerit: A Python library for spelling and grammar correction.
torch: The PyTorch library for deep learning.
transformers: The Transformers library for natural language processing.
spacy: A Python library for natural language processing.

You can install the required dependencies using the following command:

pip install gingerit torch transformers spacy
python -m spacy download en_core_web_sm

Usage

Make sure you have an active internet connection.

To use the token healing script, follow these steps:

Run the "Token_healing_script.py" file in your IDE
The program will prompt the user to enter the text (type or copy paste the text you want to enter)
Then the program will process the input and output the enhanced text

Note : In case you want the program to add a missing word/words then make sure you type "[MISSING]" in the input text wherever you want a missing word to inserted automatically. The program predicts the most suitable word to be inserted and replaces "[MISSING]" with that word. (look at the sample output for better understanding)

Working

Imports:

torch: This is the main library used for deep learning and tensor computations.
spacy: A popular library used for natural language processing (NLP) tasks.
string: Provides a collection of string constants, such as punctuation characters.
logging: Used for configuring the logging system to control the verbosity of the transformers library.
re: The regular expression module for pattern matching and string manipulation.
GingerIt from the gingerit library: A parser for spelling and grammar correction.
BertTokenizer and BertForMaskedLM from the transformers library: Components for utilizing BERT (Bidirectional Encoder
representations from Transformers) model for masked language modeling.

Logging Configuration:

A logging configuration is set to suppress logs generated by the transformers library, reducing the verbosity of the output.

Initialization:

parser: An instance of GingerIt is created, which will be used for spelling and grammar correction.
BERT Model Setup: model_name: Specifies the name of the pre-trained BERT model to be used (bert-base-uncased). tokenizer: Initializes a BERT tokenizer from the specified pre-trained model. model: Loads a pre-trained BERT model for masked language modeling and sets it to evaluation mode.
Spacy Model Setup: The English language model (en_core_web_sm) is loaded into the nlp object.

Function Definitions:

correct_spelling_and_grammar: This function takes a text as input and uses the parser object to parse and correct the spelling and grammar of the text. The corrected result is returned.
add_missing_words: This function takes a text as input and processes it with the help of BERT. It identifies missing words (tokens) in the input text by replacing them with the "[MISSING]" placeholder. The BERT model predicts the missing word, which is then substituted back into the text. The function returns the text with missing words added.
remove_duplicate_words: This function takes a text as input and utilizes a regular expression pattern to identify consecutive duplicate words, including those with punctuation. It removes the duplicates and returns the deduplicated text.

User Input:

The code prompts the user to enter a text.

Text Processing Pipeline:

corrected_spelling_grammar_text: The user's input text is processed by the correct_spelling_and_grammar function to correct any spelling and grammar errors.
final_text_with_missing_words: The resulting text from the previous step is processed by the add_missing_words function, which identifies missing words using BERT and replaces them with predicted tokens.
deduplicated_final_text: The text from the previous step is processed by the remove_duplicate_words function, which removes consecutive duplicate words.

Output:

The enhanced text, after going through the text processing pipeline, is printed as "Enhanced text".

Sample output

Sample input 1 : I will go to the office yesterday!

Sample output 1 : I went to the office yesterday!

Sample input 2 : I I I am veY thirsty thirsty.can you plese give guve [MISSING] a glass of [MISSING]? the [MISSING] fox ased the rabbit. I dont have water, the [MISSING] replied.

Sample output 2 : I am very thirsty. Can you please give me a glass of water? The white fox asked the rabbit. I don't have water, the rabbit replied.

Feel free to modify the code as needed and experiment with different inputs to observe the effects of the token healing algorithm.

Note: The current implementation uses the BERT model and may require significant computational resources. Adjusting the max_length parameter and using smaller models or alternative approaches can help optimize the performance if needed.

For any questions or issues, please feel free to contact me at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
Token_healing_script.py		Token_healing_script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Token Healing Script

Installation

Usage

Working

Imports:

Logging Configuration:

Initialization:

Function Definitions:

User Input:

Text Processing Pipeline:

Output:

Sample output

Sample input 1 : I will go to the office yesterday!

Sample output 1 : I went to the office yesterday!

Sample input 2 : I I I am veY thirsty thirsty.can you plese give guve [MISSING] a glass of [MISSING]? the [MISSING] fox ased the rabbit. I dont have water, the [MISSING] replied.

Sample output 2 : I am very thirsty. Can you please give me a glass of water? The white fox asked the rabbit. I don't have water, the rabbit replied.

About

Uh oh!

Releases

Packages

Languages

Pravin-Jalodiya/Token_Healing_Algorithm

Folders and files

Latest commit

History

Repository files navigation

Token Healing Script

Installation

Usage

Working

Imports:

Logging Configuration:

Initialization:

Function Definitions:

User Input:

Text Processing Pipeline:

Output:

Sample output

Sample input 1 : I will go to the office yesterday!

Sample output 1 : I went to the office yesterday!

Sample input 2 : I I I am veY thirsty thirsty.can you plese give guve [MISSING] a glass of [MISSING]? the [MISSING] fox ased the rabbit. I dont have water, the [MISSING] replied.

Sample output 2 : I am very thirsty. Can you please give me a glass of water? The white fox asked the rabbit. I don't have water, the rabbit replied.

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages