Posted on Jun 25 • Edited on Jun 29

Spell Checker - Finding probability distribution-NLP

Finding Probability Distribution

You need to download big.txt or Create Notebook.

 import re with open('/kaggle/input/spelling/big.txt','r') as fd: lines=fd.readlines() words=[] for line in lines: words+=line.split(' ') len(words)

 Output: 1164968

Explanation:

Load the /kaggle/input/spelling/big.txt
fd.readlines() readlines read all lines from file and make list of string. Example: lines = [ "I love NLP", "Spell checkers are helpful", "Python is powerful" ]
Iterate through that list for line in lines:
Now word+=line.split(' ') split line into words, using space as separator.
len(words) return the length of words in words[] list.

 import re with open('/kaggle/input/spelling/big.txt','r') as fd: lines=fd.readlines() words=[] for line in lines: words+=re.findall('\w+',line) len(words)

 Output: 1115585

re.findall('\w+',line) does the same thing split line into word and then make list of those word. Now, question arises whats the difference.

line.split(' ')

This will separate the line with words by spacing but it will add other character also like spacing, '\','*','.'&' etc.

 print(words[:100])

 Output: ['The','Project','Gutenberg', 'EBook','of','The','Adventures', 'of','Sherlock','Holmes\n','by','Sir','Arthur','Conan','Doyle\n', '(#15','in','our','series','by','Sir','Arthur','Conan','Doyle)\n', '\n','Copyright','laws','are','changing','all','over','the', 'world.','Be','sure','to','check','the\n','copyright','laws', 'for','your','country','before','downloading','or','redistributing\n','this','or','any','other','Project','Gutenberg','eBook.\n', '\n','This','header','should','be','the','first','thing','seen', 'when','viewing','this','Project\n','Gutenberg','file.','', 'Please','do','not','remove','it.','','Do','not','change','or', 'edit','the\n','header','without','written','permission.\n', '\n','Please','read','the','"legal','small','print,"','and', 'other','information','about','the\n','eBook','and']

re.findall('\w+',line)

\w+ matches any word made up of:

Letters (A–Z, a–z)
Numbers (0–9)
Underscore _

It will not take other than these character.

 print(words[:30])

 Output: ['The','Project','Gutenberg','EBook','of','The','Adventures','of', 'Sherlock','Holmes','by','Sir','Arthur','Conan','Doyle','15', 'in','our','series','by','Sir','Arthur','Conan','Doyle','Copyright','laws','are','changing','all','over']

Lets check how many unique words

 print(len(words)) vocab=list(set(words)) print(len(vocab))

 Output: 1115585 38160

Finding Probability Distribution

It means that how frequently word is repeated, what is the probability of getting the word.

 words.count('the')

 Output: 79809

There 79809 'the' word in list. Do remember that use lower() function because it can give 'The' or 'the' different value.
for line in lines: words+=re.findall('\w+',line.lower())

Probability Distribution

 len(words)/words.count('the')

 Output: 13.978185417684722

Probability Distribution for first 10 word

 word_probability={} for word in vocab[:10]: print(word,words.count(word))

 Output: susan 1 Tillage 1 shortly 21 enlivened 2 1720 1 victors 3 shipments 2 Go 100 constitution 63 blur 1

If we want probability of all word

 from tqdm import tqdm word_probability={} for word in tqdm(vocab): word_probability[word] = float(words.count(word)/len(words))

DEV Community

Spell Checker - Finding probability distribution-NLP

Finding Probability Distribution

line.split(' ')

re.findall('\w+',line)

Lets check how many unique words

Finding Probability Distribution

Top comments (0)