Posted on Jan 27, 2024

Getting Started with Natural Language Toolkit (NLTK)

Introduction

NLTK (Natural Language Toolkit), one of the most popular libraries in Python for working with human language data (i.e., text). This tutorial will guide you through the installation process, basic concepts, and some key functionalities of NLTK.

Link for the Notebook

1.Installation

First, you need to install NLTK. You can do this easily using pip. In your command line (Terminal, Command Prompt, etc.), enter the following command:

!pip install nltk

2.Understanding the Role of nltk.download() in NLTK Setup

Use nltk.download() to fetch datasets and models for text processing with NLTK, ensuring updated resources and easing setup.

import nltk nltk.download()

3.Tokenization

Tokenization is the process of splitting a text into meaningful units, such as words or sentences.

from nltk.tokenize import word_tokenize, sent_tokenize text = "Hello there! How are you? I hope you're learning a lot from this tutorial." # Sentence Tokenization sentences = sent_tokenize(text) print(sentences) # Word Tokenization words = word_tokenize(text) print(words)

4. Part-of-Speech (POS) Tagging

POS tagging means labeling words with their part of speech (noun, verb, adjective, etc.).

from nltk import pos_tag  words = word_tokenize("I am learning NLP with NLTK") pos_tags = pos_tag(words) print(pos_tags)

5. Stopwords

Stopwords are common words that are usually removed from text because they carry little meaningful information.

from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize words = word_tokenize("Hello there! How are you? I hope you're learning a lot from this tutorial.") stop_words = set(stopwords.words('english')) filtered_words = [word for word in words if not word in stop_words] print(filtered_words)

6. Stemming

Stemming is a process of stripping suffixes from words to extract the base or root form, known as the 'stem'. For example, the stem of the words 'waiting', 'waited', and 'waits' is 'wait'.

from nltk.stem import PorterStemmer from nltk.tokenize import word_tokenize ps = PorterStemmer() sentence = "It's important to be waiting patiently when you're learning to code." words = word_tokenize(sentence) stemmed_words = [ps.stem(word) for word in words] print(stemmed_words)

7. Lemmatization

Lemmatization is the process of reducing a word to its base or dictionary form, known as the 'lemma'. Unlike stemming, lemmatization considers the context and converts the word to its meaningful base form. For instance, 'is', 'are', and 'am' would all be lemmatized to 'be'.

import nltk from nltk.stem import WordNetLemmatizer from nltk.tokenize import word_tokenize nltk.download('punkt') nltk.download('wordnet', download_dir='/usr/share/nltk_data/corpora/wordnet') # specify your NLTK data directory if it's not in the default location lemmatizer = WordNetLemmatizer() sentence = "The leaves on the ground were raked by the gardener, who was also planting bulbs for the coming spring." words = word_tokenize(sentence) lemmatized_words = [lemmatizer.lemmatize(word) for word in words] print(lemmatized_words)

8.Frequency Distribution

This is used to find the frequency of each vocabulary item in the text.

from nltk.probability import FreqDist
words = word_tokenize("I need to write a very, very simple sentence")
fdist = FreqDist(words)
print(fdist.most_common(1))

9. Named Entity Recognition (NER)

NER is used to identify entities like names, locations, dates, etc., in the text.

import nltk from nltk.tokenize import word_tokenize from nltk.tag import pos_tag from nltk.chunk import ne_chunk nltk.download('punkt') nltk.download('averaged_perceptron_tagger') nltk.download('maxent_ne_chunker') nltk.download('words') sentence = "I will travel to Spain" # Tokenize the sentence words = word_tokenize(sentence) # Part-of-speech tagging pos_tags = pos_tag(words) # Named entity recognition named_entities = ne_chunk(pos_tags) # Print named entities print(named_entities)