Skip to content

Pravin-Jalodiya/Token_Healing_Algorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Token Healing Script

This Python script aims to enhance the readability and coherence of text. The algorithm consists of three main steps:

  1. Spelling and grammar correction
  2. Addition of missing words
  3. Removal of duplicate/extra words

Installation

Make sure you have Python 3.x installed on your system.

To run this script, you need to install the following dependencies:

  1. gingerit: A Python library for spelling and grammar correction.
  2. torch: The PyTorch library for deep learning.
  3. transformers: The Transformers library for natural language processing.
  4. spacy: A Python library for natural language processing.

You can install the required dependencies using the following command:

pip install gingerit torch transformers spacy
python -m spacy download en_core_web_sm

Usage

Make sure you have an active internet connection.

To use the token healing script, follow these steps:

  1. Run the "Token_healing_script.py" file in your IDE
  2. The program will prompt the user to enter the text (type or copy paste the text you want to enter)
  3. Then the program will process the input and output the enhanced text

Note : In case you want the program to add a missing word/words then make sure you type "[MISSING]" in the input text wherever you want a missing word to inserted automatically. The program predicts the most suitable word to be inserted and replaces "[MISSING]" with that word. (look at the sample output for better understanding)

Working

Imports:

  • torch: This is the main library used for deep learning and tensor computations.
  • spacy: A popular library used for natural language processing (NLP) tasks.
  • string: Provides a collection of string constants, such as punctuation characters.
  • logging: Used for configuring the logging system to control the verbosity of the transformers library.
  • re: The regular expression module for pattern matching and string manipulation.
  • GingerIt from the gingerit library: A parser for spelling and grammar correction.
  • BertTokenizer and BertForMaskedLM from the transformers library: Components for utilizing BERT (Bidirectional Encoder
    representations from Transformers) model for masked language modeling.

Logging Configuration:

  • A logging configuration is set to suppress logs generated by the transformers library, reducing the verbosity of the output.

Initialization:

  • parser: An instance of GingerIt is created, which will be used for spelling and grammar correction.
  • BERT Model Setup: model_name: Specifies the name of the pre-trained BERT model to be used (bert-base-uncased). tokenizer: Initializes a BERT tokenizer from the specified pre-trained model. model: Loads a pre-trained BERT model for masked language modeling and sets it to evaluation mode.
  • Spacy Model Setup: The English language model (en_core_web_sm) is loaded into the nlp object.

Function Definitions:

  • correct_spelling_and_grammar: This function takes a text as input and uses the parser object to parse and correct the spelling and grammar of the text. The corrected result is returned.
  • add_missing_words: This function takes a text as input and processes it with the help of BERT. It identifies missing words (tokens) in the input text by replacing them with the "[MISSING]" placeholder. The BERT model predicts the missing word, which is then substituted back into the text. The function returns the text with missing words added.
  • remove_duplicate_words: This function takes a text as input and utilizes a regular expression pattern to identify consecutive duplicate words, including those with punctuation. It removes the duplicates and returns the deduplicated text.

User Input:

  • The code prompts the user to enter a text.

Text Processing Pipeline:

  • corrected_spelling_grammar_text: The user's input text is processed by the correct_spelling_and_grammar function to correct any spelling and grammar errors.
  • final_text_with_missing_words: The resulting text from the previous step is processed by the add_missing_words function, which identifies missing words using BERT and replaces them with predicted tokens.
  • deduplicated_final_text: The text from the previous step is processed by the remove_duplicate_words function, which removes consecutive duplicate words.

Output:

  • The enhanced text, after going through the text processing pipeline, is printed as "Enhanced text".

Sample output

Sample input 1 : I will go to the office yesterday!

Sample output 1 : I went to the office yesterday!

Screenshot 2023-06-03 at 1 30 45 PM

Sample input 2 : I I I am veY thirsty thirsty.can you plese give guve [MISSING] a glass of [MISSING]? the [MISSING] fox ased the rabbit. I dont have water, the [MISSING] replied.

Sample output 2 : I am very thirsty. Can you please give me a glass of water? The white fox asked the rabbit. I don't have water, the rabbit replied.

Screenshot 2023-06-03 at 1 34 55 PM

Feel free to modify the code as needed and experiment with different inputs to observe the effects of the token healing algorithm.

Note: The current implementation uses the BERT model and may require significant computational resources. Adjusting the max_length parameter and using smaller models or alternative approaches can help optimize the performance if needed.

For any questions or issues, please feel free to contact me at [email protected].

About

Token healing script that aims to enhance the readability and coherence of text.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages