Finding Probability Distribution
Kaggle Dataset for Spelling Corrector
You need to download big.txt
or Create Notebook
.
import re with open('/kaggle/input/spelling/big.txt','r') as fd: lines=fd.readlines() words=[] for line in lines: words+=line.split(' ') len(words)
Output: 1164968
Explanation:
- Load the
/kaggle/input/spelling/big.txt
-
fd.readlines()
readlines read all lines from file and make list of string. Example: lines = [ "I love NLP", "Spell checkers are helpful", "Python is powerful" ] - Iterate through that list
for line in lines:
- Now word+=line.split(' ') split line into words, using space as separator.
-
len(words)
return the length of words in words[] list.
import re with open('/kaggle/input/spelling/big.txt','r') as fd: lines=fd.readlines() words=[] for line in lines: words+=re.findall('\w+',line) len(words)
Output: 1115585
re.findall('\w+',line) does the same thing split line into word and then make list of those word. Now, question arises whats the difference.
line.split(' ')
This will separate the line with words by spacing but it will add other character also like spacing, '\','*','.'&' etc.
print(words[:100])
Output: ['The','Project','Gutenberg', 'EBook','of','The','Adventures', 'of','Sherlock','Holmes\n','by','Sir','Arthur','Conan','Doyle\n', '(#15','in','our','series','by','Sir','Arthur','Conan','Doyle)\n', '\n','Copyright','laws','are','changing','all','over','the', 'world.','Be','sure','to','check','the\n','copyright','laws', 'for','your','country','before','downloading','or','redistributing\n','this','or','any','other','Project','Gutenberg','eBook.\n', '\n','This','header','should','be','the','first','thing','seen', 'when','viewing','this','Project\n','Gutenberg','file.','', 'Please','do','not','remove','it.','','Do','not','change','or', 'edit','the\n','header','without','written','permission.\n', '\n','Please','read','the','"legal','small','print,"','and', 'other','information','about','the\n','eBook','and']
re.findall('\w+',line)
\w+
matches any word made up of:
- Letters (A–Z, a–z)
- Numbers (0–9)
- Underscore _
It will not take other than these character.
print(words[:30])
Output: ['The','Project','Gutenberg','EBook','of','The','Adventures','of', 'Sherlock','Holmes','by','Sir','Arthur','Conan','Doyle','15', 'in','our','series','by','Sir','Arthur','Conan','Doyle','Copyright','laws','are','changing','all','over']
Lets check how many unique words
print(len(words)) vocab=list(set(words)) print(len(vocab))
Output: 1115585 38160
Finding Probability Distribution
It means that how frequently word is repeated, what is the probability of getting the word.
words.count('the')
Output: 79809
There 79809 'the' word in list. Do remember that use lower() function because it can give 'The' or 'the' different value.
for line in lines:
words+=re.findall('\w+',line.lower())
Probability Distribution
len(words)/words.count('the')
Output: 13.978185417684722
Probability Distribution for first 10 word
word_probability={} for word in vocab[:10]: print(word,words.count(word))
Output: susan 1 Tillage 1 shortly 21 enlivened 2 1720 1 victors 3 shipments 2 Go 100 constitution 63 blur 1
If we want probability of all word
from tqdm import tqdm word_probability={} for word in tqdm(vocab): word_probability[word] = float(words.count(word)/len(words))
Top comments (0)