DEV Community

Cover image for Spell Checker-Predicting Correct Word-NLP-Part 2
datatoinfinity
datatoinfinity

Posted on

Spell Checker-Predicting Correct Word-NLP-Part 2

 def spell_checker(word,count=5): output=[] suggested_words=edit(word) for wrd in suggested_words: if wrd in word_probability.keys(): output.append([wrd,word_probability[wrd]]) return list(pd.DataFrame(output,columns=['word','prob']).sort_values(by='prob',ascending=False).head(count)['word'].values) 

Let's break it down step by step.

 def spell_checker(word,count=5): 
  • Defines a function called spell_checker.
  • word is the misspelled word you want to correct.
  • count=5 is the number of top suggestions you want to return (default = 5).
 output=[] 
  • Initializes an empty list to store valid suggested words with their probabilities.
 suggested_words=edit(word) 
  • Calls the edit() function which is defined earlier.

     def edit(word): return set(insert(word) + delete(word) + swap(word) + replace(word)) 
  • This returns a set of all words that are one edit away from the input word.

  • Examples: For "lve" → ['love', 'live', 'lave', ...]

 for wrd in suggested_words: if wrd in word_probability.keys(): output.append([wrd, word_probability[wrd]]) 

What happens here:

  • Loops through each wrd in the list of suggested words.
  • Checks: Is wrd a real word?
    • If yes (i.e., it's in word_probability, which comes from your big.txt dictionary),
  • Then it appends a pair [wrd, probability] to the output list.

Example:

If 'love' is in the corpus and has probability 0.0042:

 Output: [['love', 0.0042], ['live', 0.0021], ...] 
 return list(pd.DataFrame(output, columns=['word', 'prob']).sort_values(by='prob', ascending=False).head(count)['word'].values) 
  1. pd.DataFrame(output, columns=['word', 'prob'])

Converts the list of [word, prob] pairs into a pandas DataFrame:

 word prob 0 love 0.0042 1 live 0.0021 
  1. .sort_values(by='prob', ascending=False)
  • Sorts the DataFrame so the most frequent (most likely correct) words come first.
  1. .head(count)

    • Selects the top count words (default = 5)
  2. ['word'].values and list(...)

* Extracts just the `"word"` column as a list. 
Enter fullscreen mode Exit fullscreen mode
 spell_checker('famili') 

If the top edits (like family, familiar, fail, etc.) exist in the corpus and are frequent, you might get:

 ['family', 'familiar', 'fail', 'facility', 'famine'] 

Top comments (0)