DEV Community

Cover image for Spell Checker - Finding probability distribution-NLP
datatoinfinity
datatoinfinity

Posted on • Edited on

Spell Checker - Finding probability distribution-NLP

Finding Probability Distribution

Kaggle Dataset for Spelling Corrector

You need to download big.txt or Create Notebook.

 import re with open('/kaggle/input/spelling/big.txt','r') as fd: lines=fd.readlines() words=[] for line in lines: words+=line.split(' ') len(words) 
 Output: 1164968 

Explanation:

  1. Load the /kaggle/input/spelling/big.txt
  2. fd.readlines() readlines read all lines from file and make list of string. Example: lines = [ "I love NLP", "Spell checkers are helpful", "Python is powerful" ]
  3. Iterate through that list for line in lines:
  4. Now word+=line.split(' ') split line into words, using space as separator.
  5. len(words) return the length of words in words[] list.
 import re with open('/kaggle/input/spelling/big.txt','r') as fd: lines=fd.readlines() words=[] for line in lines: words+=re.findall('\w+',line) len(words) 
 Output: 1115585 

re.findall('\w+',line) does the same thing split line into word and then make list of those word. Now, question arises whats the difference.

line.split(' ')

This will separate the line with words by spacing but it will add other character also like spacing, '\','*','.'&' etc.

 print(words[:100]) 
 Output: ['The','Project','Gutenberg', 'EBook','of','The','Adventures', 'of','Sherlock','Holmes\n','by','Sir','Arthur','Conan','Doyle\n', '(#15','in','our','series','by','Sir','Arthur','Conan','Doyle)\n', '\n','Copyright','laws','are','changing','all','over','the', 'world.','Be','sure','to','check','the\n','copyright','laws', 'for','your','country','before','downloading','or','redistributing\n','this','or','any','other','Project','Gutenberg','eBook.\n', '\n','This','header','should','be','the','first','thing','seen', 'when','viewing','this','Project\n','Gutenberg','file.','', 'Please','do','not','remove','it.','','Do','not','change','or', 'edit','the\n','header','without','written','permission.\n', '\n','Please','read','the','"legal','small','print,"','and', 'other','information','about','the\n','eBook','and'] 

re.findall('\w+',line)

\w+ matches any word made up of:

  • Letters (A–Z, a–z)
  • Numbers (0–9)
  • Underscore _

It will not take other than these character.

 print(words[:30]) 
 Output: ['The','Project','Gutenberg','EBook','of','The','Adventures','of', 'Sherlock','Holmes','by','Sir','Arthur','Conan','Doyle','15', 'in','our','series','by','Sir','Arthur','Conan','Doyle','Copyright','laws','are','changing','all','over'] 

Lets check how many unique words

 print(len(words)) vocab=list(set(words)) print(len(vocab)) 
 Output: 1115585 38160 

Finding Probability Distribution

It means that how frequently word is repeated, what is the probability of getting the word.

 words.count('the') 
 Output: 79809 

There 79809 'the' word in list. Do remember that use lower() function because it can give 'The' or 'the' different value.
for line in lines:
words+=re.findall('\w+',line.lower())

Probability Distribution

 len(words)/words.count('the') 
 Output: 13.978185417684722 

Probability Distribution for first 10 word

 word_probability={} for word in vocab[:10]: print(word,words.count(word)) 
 Output: susan 1 Tillage 1 shortly 21 enlivened 2 1720 1 victors 3 shipments 2 Go 100 constitution 63 blur 1 

If we want probability of all word

 from tqdm import tqdm word_probability={} for word in tqdm(vocab): word_probability[word] = float(words.count(word)/len(words)) 

Top comments (0)