Last Updated : 17 Jul, 2025
Lemmatization is an important text pre-processing technique in Natural Language Processing (NLP) that reduces words to their base form known as a "lemma." For example, the lemma of "running" is "run" and "better" becomes "good." Unlike stemming which simply removes prefixes or suffixes, it considers the word's meaning and part of speech (POS) and ensures that the base form is a valid word. This makes lemmatization more accurate as it avoids generating non-dictionary words.
Lemmatization is important for various reasons in NLP:
There are different techniques to perform lemmatization each with its own advantages and use cases:
1. Rule Based LemmatizationIn rule-based lemmatization, predefined rules are applied to a word to remove suffixes and get the root form. This approach works well for regular words but may not handle irregularities well.
For example:
Rule: For regular verbs ending in "-ed," remove the "-ed" suffix.
Example: "walked" -> "walk"
While this method is simple and interpretable, it doesn't account for irregular word forms like "better" which should be lemmatized to "good".
2. Dictionary-Based LemmatizationIt uses a predefined dictionary or lexicon such as WordNet to look up the base form of a word. This method is more accurate than rule-based lemmatization because it accounts for exceptions and irregular words.
For example:
- 'running' -> 'run'
- 'better' -> 'good'
- 'went' -> 'go
"I was running to become a better athlete and then I went home," -> "I was run to become a good athlete and then I go home."
By using dictionaries like WordNet this method can handle a range of words effectively, especially in languages with well-established dictionaries.
3. Machine Learning-Based LemmatizationIt uses algorithms trained on large datasets to automatically identify the base form of words. This approach is highly flexible and can handle irregular words and linguistic nuances better than the rule-based and dictionary-based methods.
For example:
A trained model may deduce that “went” corresponds to “go” even though the suffix removal rule doesn’t apply. Similarly, for 'happier' the model deduces 'happy' as the lemma.
Machine learning-based lemmatizers are more adaptive and can generalize across different word forms which makes them ideal for complex tasks involving diverse vocabularies.
Implementation of Lemmatization in PythonFor more details regarding these techniques refer to: Python - Lemmatization Approaches with Examples
Lets see step by step how Lemmatization works in Python:
Step 1: Installing NLTK and Downloading Necessary ResourcesIn Python, the NLTK library provides an easy and efficient way to implement lemmatization. First, we need to install the NLTK library and download the necessary datasets like WordNet and the punkt tokenizer.
Python
Now lets import the library and download the necessary datasets.
Python
import nltk
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger_eng')
Step 2: Lemmatizing Text with NLTK
Now we can tokenize the text and apply lemmatization using NLTK's WordNetLemmatizer.
Python
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
text = "The cats were running faster than the dogs."
tokens = word_tokenize(text)
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
print(f"Original Text: {text}")
print(f"Lemmatized Words: {lemmatized_words}")
Output:
Lemmatizing Text with NLTKIn this output, we can see that:
To improve the accuracy of lemmatization, it’s important to specify the correct Part of Speech (POS) for each word. By default, NLTK assumes that words are nouns when no POS tag is provided. However, it can be more accurate if we specify the correct POS tag for each word.
For example:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
sentence = "The children are running towards a better place."
tokens = word_tokenize(sentence)
tagged_tokens = pos_tag(tokens)
def get_wordnet_pos(tag):
if tag.startswith('J'):
return 'a'
elif tag.startswith('V'):
return 'v'
elif tag.startswith('N'):
return 'n'
elif tag.startswith('R'):
return 'r'
else:
return 'n'
lemmatized_sentence = []
for word, tag in tagged_tokens:
if word.lower() == 'are' or word.lower() in ['is', 'am']:
lemmatized_sentence.append(word)
else:
lemmatized_sentence.append(lemmatizer.lemmatize(word, get_wordnet_pos(tag)))
print("Original Sentence: ", sentence)
print("Lemmatized Sentence: ", ' '.join(lemmatized_sentence))
Output:
Improving Lemmatization with POS TaggingIn this improved version:
Lets see some key advantages:
You can refer to more related articles:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4