Python - Lemmatization Approaches with Examples
Last Updated : 23 Jul, 2025
Lemmatization is the process of reducing words to their base or dictionary form (lemma). Unlike stemming which simply cut off word endings, it uses a full vocabulary and linguistic rules to ensure accurate word reduction. For example:
- meeting → meet
- was → be
- mice → mouse
Lets explore several popular python libraries for performing lemmatization,
1. WordNet
WordNet is a large lexical database of the English language and one of the earliest methods for lemmatization in Python. It groups words into sets of synonyms (synsets) which are related to each other. The WordNet is part of the NLTK (Natural Language Toolkit) library and it is widely used for text preprocessing tasks.
For installation run the following command:
!pip install nltk
Lets see an example,
Python import nltk from nltk.stem import WordNetLemmatizer nltk.download('wordnet') lemmatizer = WordNetLemmatizer() word = "meeting" lemma = lemmatizer.lemmatize(word, pos='v') print(f"Lemmatized Word: {lemma}") Output:
meet
2. WordNet with POS Tagging
By default, WordNet Lemmatizer assumes words to be nouns. For more accurate lemmatization, especially for verbs and adjectives, Part of Speech (POS) tagging is required. POS tagging tells the lemmatizer whether the word is a noun, verb or adjective. Lets see an example to understand better,
Python from nltk import pos_tag from nltk.tokenize import word_tokenize from nltk.corpus import wordnet as wn from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() sentence = "The dogs are running" tokens = word_tokenize(sentence) tagged = pos_tag(tokens) lemmatized_words = [lemmatizer.lemmatize( word, pos='v' if tag.startswith('V') else 'n') for word, tag in tagged] print(lemmatized_words) Output:
['The', 'dog', 'be', 'run']
3. TextBlob
TextBlob is a simpler library built on top of NLTK and Pattern. It provides a convenient API to perform common NLP tasks like lemmatization. TextBlob’s lemmatization is easy to use and requires minimal setup.
For installation run the following command:
!pip install textblob
Lets see an example,
Python from textblob import Word word = Word("running") print(word.lemmatize("v")) Output:
run
4. TextBlob with POS Tagging
Using POS tagging with TextBlob ensures that words are lemmatized accurately. By default, TextBlob treats every word as a noun, so for verbs and adjectives, POS tagging can significantly improve lemmatization accuracy. Lets see an example for this,
Python from textblob import TextBlob sentence = "The dogs barking" blob = TextBlob(sentence) lemmatized_words = [word.lemmatize('v') if tag.startswith( 'VB') else word for word, tag in blob.tags] print(f"Lemmatized Sentence: {' '.join(lemmatized_words)}") Output:
Lemmatized Sentence: The dogs bark
5. SpaCy
spaCy is one of the most powerful NLP libraries in Python, known for its speed and ease of use. It provides pre-trained models for tokenization, lemmatization, POS tagging and more. spaCy's lemmatization is highly accurate and works well with complex sentence structures.
For installation run the following command:
pip install spacy
python -m spacy download en_core_web_sm
Lets see an example,
Python import spacy nlp = spacy.load('en_core_web_sm') doc = nlp("The cats are sitting") for token in doc: print(token.text, token.lemma_) Output:
The the
cats cat
are be
sitting sit
6. Gensim
Gensim is widely used for topic modeling, document similarity and lemmatization tasks in large text corpora. Its lemmatization relies on the Pattern library and focuses on processing tokens like nouns, verbs, adjectives and adverbs. It is suitable for large-scale text processing.
Installation:
!pip install gensim nltk
Lets see an example,
Python import nltk from nltk.stem import WordNetLemmatizer from gensim.utils import simple_preprocess nltk.download('wordnet') nltk.download('omw-1.4') lemmatizer = WordNetLemmatizer() text = "The cats are running and the dogs were barking." tokens = simple_preprocess(text) lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens] print("Original Tokens:", tokens) print("Lemmatized Tokens:", lemmatized_tokens) Output:
Original Tokens: ['the', 'cats', 'are', 'running', 'and', 'the', 'dogs', 'were', 'barking']
Lemmatized Tokens: ['the', 'cat', 'are', 'runn', 'and', 'the', 'dog', 'were', 'bark']
With all these techniques we can easily do Lemmatization in Python and can make real world projects.
Explore
Machine Learning Basics
Python for Machine Learning
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advanced Techniques
Machine Learning Practice
My Profile