Last Updated : 23 Jul, 2025
Lemmatization is the process of reducing words to their base or dictionary form (lemma). Unlike stemming which simply cut off word endings, it uses a full vocabulary and linguistic rules to ensure accurate word reduction. For example:
Lets explore several popular python libraries for performing lemmatization,
1. WordNetWordNet is a large lexical database of the English language and one of the earliest methods for lemmatization in Python. It groups words into sets of synonyms (synsets) which are related to each other. The WordNet is part of the NLTK (Natural Language Toolkit) library and it is widely used for text preprocessing tasks.
For installation run the following command:
!pip install nltk
Lets see an example,
Python
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
word = "meeting"
lemma = lemmatizer.lemmatize(word, pos='v')
print(f"Lemmatized Word: {lemma}")
Output:
2. WordNet with POS Taggingmeet
By default, WordNet Lemmatizer assumes words to be nouns. For more accurate lemmatization, especially for verbs and adjectives, Part of Speech (POS) tagging is required. POS tagging tells the lemmatizer whether the word is a noun, verb or adjective. Lets see an example to understand better,
Python
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
sentence = "The dogs are running"
tokens = word_tokenize(sentence)
tagged = pos_tag(tokens)
lemmatized_words = [lemmatizer.lemmatize(
word, pos='v' if tag.startswith('V') else 'n') for word, tag in tagged]
print(lemmatized_words)
Output:
3. TextBlob['The', 'dog', 'be', 'run']
TextBlob is a simpler library built on top of NLTK and Pattern. It provides a convenient API to perform common NLP tasks like lemmatization. TextBlob’s lemmatization is easy to use and requires minimal setup.
For installation run the following command:
!pip install textblob
Lets see an example,
Python
from textblob import Word
word = Word("running")
print(word.lemmatize("v"))
Output:
4. TextBlob with POS Taggingrun
Using POS tagging with TextBlob ensures that words are lemmatized accurately. By default, TextBlob treats every word as a noun, so for verbs and adjectives, POS tagging can significantly improve lemmatization accuracy. Lets see an example for this,
Python
from textblob import TextBlob
sentence = "The dogs barking"
blob = TextBlob(sentence)
lemmatized_words = [word.lemmatize('v') if tag.startswith(
'VB') else word for word, tag in blob.tags]
print(f"Lemmatized Sentence: {' '.join(lemmatized_words)}")
Output:
5. SpaCyLemmatized Sentence: The dogs bark
spaCy is one of the most powerful NLP libraries in Python, known for its speed and ease of use. It provides pre-trained models for tokenization, lemmatization, POS tagging and more. spaCy's lemmatization is highly accurate and works well with complex sentence structures.
For installation run the following command:
pip install spacy
python -m spacy download en_core_web_sm
Lets see an example,
Python
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("The cats are sitting")
for token in doc:
print(token.text, token.lemma_)
Output:
6. GensimThe the
cats cat
are be
sitting sit
Gensim is widely used for topic modeling, document similarity and lemmatization tasks in large text corpora. Its lemmatization relies on the Pattern library and focuses on processing tokens like nouns, verbs, adjectives and adverbs. It is suitable for large-scale text processing.
Installation:
!pip install gensim nltk
Lets see an example,
Python
import nltk
from nltk.stem import WordNetLemmatizer
from gensim.utils import simple_preprocess
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer()
text = "The cats are running and the dogs were barking."
tokens = simple_preprocess(text)
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
print("Original Tokens:", tokens)
print("Lemmatized Tokens:", lemmatized_tokens)
Output:
Original Tokens: ['the', 'cats', 'are', 'running', 'and', 'the', 'dogs', 'were', 'barking']
Lemmatized Tokens: ['the', 'cat', 'are', 'runn', 'and', 'the', 'dog', 'were', 'bark']
With all these techniques we can easily do Lemmatization in Python and can make real world projects.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4