Last Updated : 21 Jul, 2025
Text summarization has become increasingly important as massive amounts of textual data is generated daily. The ability to extract key information quickly is important. Sumy is a Python library designed specifically for automatic text summarization as it provides multiple algorithms to tackle this challenge effectively.
Sumy for Text SummarizationSumy brings several advantages that make it useful for various text summarization tasks. The library supports multiple summarization algorithms including Luhn, Edmundson, LSA, LexRank and KL-summarizers which give us the flexibility to choose the approach that best fits your data. It integrates with other NLP libraries and requires minimal setup, making it accessible even for beginners. The library handles large documents efficiently and can be customized to meet summarization requirements.
Setting Up SumyGetting Sumy up and running is straightforward. We can install it through PyPI using pip:
Text Preprocessingpip install sumy
Before summarization, lets see text preprocessing techniques that are required to summarize a document or text. Sumy provides built-in capabilities to prepare text for effective summarization.
Tokenization with SumyTokenizationbreaks down text into manageable units like sentences or words. This process helps the summarization algorithms understand text structure and meaning more effectively.
from sumy.nlp.tokenizers import Tokenizer
import nltk
nltk.download('punkt')
# Create tokenizer for English
tokenizer = Tokenizer("en")
# Sample text
text = """Machine learning is transforming industries worldwide.
Companies are investing heavily in AI research and development.
The future of technology depends on these advancements."""
# Tokenize into sentences
sentences = tokenizer.to_sentences(text)
# Display tokenized words for each sentence
for sentence in sentences:
words = tokenizer.to_words(sentence)
print(words)
Output:
Stemming for Word Normalization
('Machine', 'learning', 'is', 'transforming', 'industries', 'worldwide')
('Companies', 'are', 'investing', 'heavily', 'in', 'AI', 'research', 'and', 'development')
('The', 'future', 'of', 'technology', 'depends', 'on', 'these', 'advancements')
Stemming reduces words to their root forms, helping algorithms recognize that words like "running" "runs" and "ran" are variations of the same concept.
from sumy.nlp.stemmers import Stemmer
# Create stemmer for English
stemmer = Stemmer("en")
# Test stemming on various words
test_words = ["programming", "developer", "coding", "algorithms"]
for word in test_words:
stemmed = stemmer(word)
print(f"{word} -> {stemmed}")
Output:
Summarization Algorithms in Sumy
programming -> program
developer -> develop
coding -> code
algorithms -> algorithm
Sumy provides several algorithms, each with different approaches to identifying important sentences. Let's explore the most effective ones.
1. Luhn Summarizer: Frequency-Based ApproachThe Luhn algorithm ranks sentences based on the frequency of significant words. It identifies important terms by filtering out stop words and focuses on sentences containing these high-frequency terms.
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
import nltk
nltk.download('punkt_tab')
def luhn_summarize(text, sentence_count=2):
# Parse the input text
parser = PlaintextParser.from_string(text, Tokenizer("english"))
# Initialize summarizer with stemmer
summarizer = LuhnSummarizer(Stemmer("english"))
summarizer.stop_words = get_stop_words("english")
# Generate summary
summary = summarizer(parser.document, sentence_count)
return summary
# Test with sample text
sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information.
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention.
Machine learning algorithms form the backbone of most AI applications today.
Deep learning, a subset of machine learning, uses neural networks to solve complex problems.
These technologies are revolutionizing industries from healthcare to finance.
The potential applications of AI seem limitless as research continues to advance.
"""
summary = luhn_summarize(sample_text, 2)
for sentence in summary:
print(sentence)
Output:
2. Edmundson Summarizer: Customizable Word WeightingMachine learning algorithms form the backbone of most AI applications today. Deep learning, a subset of machine learning, uses neural networks to solve complex problems.
The Edmundson algorithm allows fine-tuned control over summarization by using bonus words (emphasized), stigma words (de-emphasized) and null words (ignored).
from sumy.summarizers.edmundson import EdmundsonSummarizer
def edmundson_summarize(text, sentence_count=2, bonus_words=None, stigma_words=None):
parser = PlaintextParser.from_string(text, Tokenizer("english"))
# Initialize summarizer
summarizer = EdmundsonSummarizer(Stemmer("english"))
summarizer.stop_words = get_stop_words("english")
# Set null words
summarizer.null_words = get_stop_words("english")
# Set custom word weights
if bonus_words:
summarizer.bonus_words = bonus_words
if stigma_words:
summarizer.stigma_words = stigma_words
summary = summarizer(parser.document, sentence_count)
return summary
# Customize summarization focus
bonus_words = ["intelligence", "learning", "algorithms"]
stigma_words = ["simple", "basic"]
sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information.
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention.
Machine learning algorithms form the backbone of most AI applications today.
Deep learning, a subset of machine learning, uses neural networks to solve complex problems.
These technologies are revolutionizing industries from healthcare to finance.
The potential applications of AI seem limitless as research continues to advance.
"""
summary = edmundson_summarize(sample_text, 2, bonus_words, stigma_words)
for sentence in summary:
print(sentence)
Output:
3. LSA Summarizer: Semantic UnderstandingArtificial intelligence represents a paradigm shift in how machines process information. Machine learning algorithms form the backbone of most AI applications today.
Latent Semantic Analysis (LSA) goes beyond simple word frequency by understanding relationships and context between terms. This approach often produces more coherent and contextually accurate summaries. The code given below:
from sumy.summarizers.lsa import LsaSummarizer
def lsa_summarize(text, sentence_count=2):
parser = PlaintextParser.from_string(text, Tokenizer("english"))
# Initialize LSA summarizer
summarizer = LsaSummarizer(Stemmer("english"))
summarizer.stop_words = get_stop_words("english")
summary = summarizer(parser.document, sentence_count)
return summary
sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information.
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention.
Machine learning algorithms form the backbone of most AI applications today.
Deep learning, a subset of machine learning, uses neural networks to solve complex problems.
These technologies are revolutionizing industries from healthcare to finance.
The potential applications of AI seem limitless as research continues to advance.
"""
summary = lsa_summarize(sample_text, 2)
for sentence in summary:
print(sentence)
Output:
Performance ConsiderationsArtificial intelligence represents a paradigm shift in how machines process information. Modern AI systems can learn from data, recognize patterns and make decisions with minimal human intervention.
Time Complexity:
Space Complexity:
Sumy works well for:
But, it has limitations too:
The key to effective summarization with Sumy lies in understanding our text characteristics and choosing the algorithm that best matches specific requirements. Experimenting with different approaches and sentence counts will help us find the optimal configuration for the use case.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4