A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.geeksforgeeks.org/nlp/mastering-text-summarization-with-sumy-a-python-library-overview/ below:

Text Summarization with Sumy: A Complete Guide

Text Summarization with Sumy: A Complete Guide

Last Updated : 21 Jul, 2025

Text summarization has become increasingly important as massive amounts of textual data is generated daily. The ability to extract key information quickly is important. Sumy is a Python library designed specifically for automatic text summarization as it provides multiple algorithms to tackle this challenge effectively.

Sumy for Text Summarization

Sumy brings several advantages that make it useful for various text summarization tasks. The library supports multiple summarization algorithms including Luhn, Edmundson, LSA, LexRank and KL-summarizers which give us the flexibility to choose the approach that best fits your data. It integrates with other NLP libraries and requires minimal setup, making it accessible even for beginners. The library handles large documents efficiently and can be customized to meet summarization requirements.

Setting Up Sumy

Getting Sumy up and running is straightforward. We can install it through PyPI using pip:

pip install sumy

Text Preprocessing

Before summarization, lets see text preprocessing techniques that are required to summarize a document or text. Sumy provides built-in capabilities to prepare text for effective summarization.

Tokenization with Sumy

Tokenizationbreaks down text into manageable units like sentences or words. This process helps the summarization algorithms understand text structure and meaning more effectively.

Python
from sumy.nlp.tokenizers import Tokenizer
import nltk
nltk.download('punkt')

# Create tokenizer for English
tokenizer = Tokenizer("en")

# Sample text
text = """Machine learning is transforming industries worldwide. 
          Companies are investing heavily in AI research and development. 
          The future of technology depends on these advancements."""

# Tokenize into sentences
sentences = tokenizer.to_sentences(text)

# Display tokenized words for each sentence
for sentence in sentences:
    words = tokenizer.to_words(sentence)
    print(words)

Output:

('Machine', 'learning', 'is', 'transforming', 'industries', 'worldwide')
('Companies', 'are', 'investing', 'heavily', 'in', 'AI', 'research', 'and', 'development')
('The', 'future', 'of', 'technology', 'depends', 'on', 'these', 'advancements')

Stemming for Word Normalization

Stemming reduces words to their root forms, helping algorithms recognize that words like "running" "runs" and "ran" are variations of the same concept.

Python
from sumy.nlp.stemmers import Stemmer

# Create stemmer for English
stemmer = Stemmer("en")

# Test stemming on various words
test_words = ["programming", "developer", "coding", "algorithms"]

for word in test_words:
    stemmed = stemmer(word)
    print(f"{word} -> {stemmed}")

Output:

programming -> program
developer -> develop
coding -> code
algorithms -> algorithm

Summarization Algorithms in Sumy

Sumy provides several algorithms, each with different approaches to identifying important sentences. Let's explore the most effective ones.

1. Luhn Summarizer: Frequency-Based Approach

The Luhn algorithm ranks sentences based on the frequency of significant words. It identifies important terms by filtering out stop words and focuses on sentences containing these high-frequency terms.

Python
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
import nltk
nltk.download('punkt_tab')

def luhn_summarize(text, sentence_count=2):
    # Parse the input text
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    
    # Initialize summarizer with stemmer
    summarizer = LuhnSummarizer(Stemmer("english"))
    summarizer.stop_words = get_stop_words("english")
    
    # Generate summary
    summary = summarizer(parser.document, sentence_count)
    return summary

# Test with sample text
sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information. 
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention. 
Machine learning algorithms form the backbone of most AI applications today. 
Deep learning, a subset of machine learning, uses neural networks to solve complex problems. 
These technologies are revolutionizing industries from healthcare to finance. 
The potential applications of AI seem limitless as research continues to advance.
"""

summary = luhn_summarize(sample_text, 2)
for sentence in summary:
    print(sentence)

Output:

Machine learning algorithms form the backbone of most AI applications today. Deep learning, a subset of machine learning, uses neural networks to solve complex problems.

2. Edmundson Summarizer: Customizable Word Weighting

The Edmundson algorithm allows fine-tuned control over summarization by using bonus words (emphasized), stigma words (de-emphasized) and null words (ignored).

Python
from sumy.summarizers.edmundson import EdmundsonSummarizer

def edmundson_summarize(text, sentence_count=2, bonus_words=None, stigma_words=None):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))

    # Initialize summarizer
    summarizer = EdmundsonSummarizer(Stemmer("english"))
    summarizer.stop_words = get_stop_words("english")

    # Set null words
    summarizer.null_words = get_stop_words("english")

    # Set custom word weights
    if bonus_words:
        summarizer.bonus_words = bonus_words
    if stigma_words:
        summarizer.stigma_words = stigma_words

    summary = summarizer(parser.document, sentence_count)
    return summary

# Customize summarization focus
bonus_words = ["intelligence", "learning", "algorithms"]
stigma_words = ["simple", "basic"]

sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information.
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention.
Machine learning algorithms form the backbone of most AI applications today.
Deep learning, a subset of machine learning, uses neural networks to solve complex problems.
These technologies are revolutionizing industries from healthcare to finance.
The potential applications of AI seem limitless as research continues to advance.
"""

summary = edmundson_summarize(sample_text, 2, bonus_words, stigma_words)
for sentence in summary:
    print(sentence)

Output:

Artificial intelligence represents a paradigm shift in how machines process information. Machine learning algorithms form the backbone of most AI applications today.

3. LSA Summarizer: Semantic Understanding

Latent Semantic Analysis (LSA) goes beyond simple word frequency by understanding relationships and context between terms. This approach often produces more coherent and contextually accurate summaries. The code given below:

Python
from sumy.summarizers.lsa import LsaSummarizer

def lsa_summarize(text, sentence_count=2):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    
    # Initialize LSA summarizer
    summarizer = LsaSummarizer(Stemmer("english"))
    summarizer.stop_words = get_stop_words("english")
    
    summary = summarizer(parser.document, sentence_count)
    return summary

sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information.
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention.
Machine learning algorithms form the backbone of most AI applications today.
Deep learning, a subset of machine learning, uses neural networks to solve complex problems.
These technologies are revolutionizing industries from healthcare to finance.
The potential applications of AI seem limitless as research continues to advance.
"""

summary = lsa_summarize(sample_text, 2)
for sentence in summary:
    print(sentence)

Output:

Artificial intelligence represents a paradigm shift in how machines process information. Modern AI systems can learn from data, recognize patterns and make decisions with minimal human intervention.

Performance Considerations

Time Complexity:

Space Complexity:

Practical Applications and Limitations

Sumy works well for:

But, it has limitations too:

Choosing the Right Algorithm Algorithm Best For Advantages Disadvantages When to Use LSA General-purpose summarization Captures semantics, produces summaries, handles synonyms Computationally intensive and memory-heavy Default choice for most applications Luhn Quick, frequency-based summaries Fast, lightweight and easy to implement Limited semantic understanding and may overlook context Resource-constrained environments Edmundson Domain-specific content Offers customizable weighting and adapts well to specific domains Requires manual tuning and is complex to set up Specialized domains with predefined key terms

The key to effective summarization with Sumy lies in understanding our text characteristics and choosing the algorithm that best matches specific requirements. Experimenting with different approaches and sentence counts will help us find the optimal configuration for the use case.



RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4