This Python script aims to enhance the readability and coherence of text. The algorithm consists of three main steps:
- Spelling and grammar correction
- Addition of missing words
- Removal of duplicate/extra words
Make sure you have Python 3.x installed on your system.
To run this script, you need to install the following dependencies:
- gingerit: A Python library for spelling and grammar correction.
- torch: The PyTorch library for deep learning.
- transformers: The Transformers library for natural language processing.
- spacy: A Python library for natural language processing.
You can install the required dependencies using the following command:
pip install gingerit torch transformers spacy
python -m spacy download en_core_web_sm
Make sure you have an active internet connection.
To use the token healing script, follow these steps:
- Run the "Token_healing_script.py" file in your IDE
- The program will prompt the user to enter the text (type or copy paste the text you want to enter)
- Then the program will process the input and output the enhanced text
Note : In case you want the program to add a missing word/words then make sure you type "[MISSING]" in the input text wherever you want a missing word to inserted automatically. The program predicts the most suitable word to be inserted and replaces "[MISSING]" with that word. (look at the sample output for better understanding)
- torch: This is the main library used for deep learning and tensor computations.
- spacy: A popular library used for natural language processing (NLP) tasks.
- string: Provides a collection of string constants, such as punctuation characters.
- logging: Used for configuring the logging system to control the verbosity of the transformers library.
- re: The regular expression module for pattern matching and string manipulation.
- GingerIt from the gingerit library: A parser for spelling and grammar correction.
- BertTokenizer and BertForMaskedLM from the transformers library: Components for utilizing BERT (Bidirectional Encoder
representations from Transformers) model for masked language modeling.
- A logging configuration is set to suppress logs generated by the transformers library, reducing the verbosity of the output.
- parser: An instance of GingerIt is created, which will be used for spelling and grammar correction.
- BERT Model Setup: model_name: Specifies the name of the pre-trained BERT model to be used (bert-base-uncased). tokenizer: Initializes a BERT tokenizer from the specified pre-trained model. model: Loads a pre-trained BERT model for masked language modeling and sets it to evaluation mode.
- Spacy Model Setup: The English language model (en_core_web_sm) is loaded into the nlp object.
- correct_spelling_and_grammar: This function takes a text as input and uses the parser object to parse and correct the spelling and grammar of the text. The corrected result is returned.
- add_missing_words: This function takes a text as input and processes it with the help of BERT. It identifies missing words (tokens) in the input text by replacing them with the "[MISSING]" placeholder. The BERT model predicts the missing word, which is then substituted back into the text. The function returns the text with missing words added.
- remove_duplicate_words: This function takes a text as input and utilizes a regular expression pattern to identify consecutive duplicate words, including those with punctuation. It removes the duplicates and returns the deduplicated text.
- The code prompts the user to enter a text.
- corrected_spelling_grammar_text: The user's input text is processed by the correct_spelling_and_grammar function to correct any spelling and grammar errors.
- final_text_with_missing_words: The resulting text from the previous step is processed by the add_missing_words function, which identifies missing words using BERT and replaces them with predicted tokens.
- deduplicated_final_text: The text from the previous step is processed by the remove_duplicate_words function, which removes consecutive duplicate words.
- The enhanced text, after going through the text processing pipeline, is printed as "Enhanced text".

Sample input 2 : I I I am veY thirsty thirsty.can you plese give guve [MISSING] a glass of [MISSING]? the [MISSING] fox ased the rabbit. I dont have water, the [MISSING] replied.
Sample output 2 : I am very thirsty. Can you please give me a glass of water? The white fox asked the rabbit. I don't have water, the rabbit replied.

Feel free to modify the code as needed and experiment with different inputs to observe the effects of the token healing algorithm.
Note: The current implementation uses the BERT model and may require significant computational resources. Adjusting the max_length parameter and using smaller models or alternative approaches can help optimize the performance if needed.
For any questions or issues, please feel free to contact me at [email protected].