RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://www.geeksforgeeks.org/nlp/mastering-text-summarization-with-sumy-a-python-library-overview/ below:

Text Summarization with Sumy: A Complete Guide

Last Updated : 21 Jul, 2025

Text summarization has become increasingly important as massive amounts of textual data is generated daily. The ability to extract key information quickly is important. Sumy is a Python library designed specifically for automatic text summarization as it provides multiple algorithms to tackle this challenge effectively.

Sumy for Text Summarization

Sumy brings several advantages that make it useful for various text summarization tasks. The library supports multiple summarization algorithms including Luhn, Edmundson, LSA, LexRank and KL-summarizers which give us the flexibility to choose the approach that best fits your data. It integrates with other NLP libraries and requires minimal setup, making it accessible even for beginners. The library handles large documents efficiently and can be customized to meet summarization requirements.

Setting Up Sumy

Getting Sumy up and running is straightforward. We can install it through PyPI using pip:

pip install sumy

Text Preprocessing

Before summarization, lets see text preprocessing techniques that are required to summarize a document or text. Sumy provides built-in capabilities to prepare text for effective summarization.

Tokenization with Sumy

Tokenizationbreaks down text into manageable units like sentences or words. This process helps the summarization algorithms understand text structure and meaning more effectively.

Tokenizer splits text into sentences first, then words
Punctuation is automatically handled and removed
Language-specific tokenization rules are applied

Python


 from sumy.nlp.tokenizers import Tokenizer
import nltk
nltk.download('punkt')

# Create tokenizer for English
tokenizer = Tokenizer("en")

# Sample text
text = """Machine learning is transforming industries worldwide. 
          Companies are investing heavily in AI research and development. 
          The future of technology depends on these advancements."""

# Tokenize into sentences
sentences = tokenizer.to_sentences(text)

# Display tokenized words for each sentence
for sentence in sentences:
    words = tokenizer.to_words(sentence)
    print(words)

Output:

('Machine', 'learning', 'is', 'transforming', 'industries', 'worldwide')
('Companies', 'are', 'investing', 'heavily', 'in', 'AI', 'research', 'and', 'development')
('The', 'future', 'of', 'technology', 'depends', 'on', 'these', 'advancements')

Stemming for Word Normalization

Stemming reduces words to their root forms, helping algorithms recognize that words like "running" "runs" and "ran" are variations of the same concept.

Stemming normalizes word variations
Improves algorithm accuracy by grouping related terms
Essential for frequency-based summarization methods

Python


 from sumy.nlp.stemmers import Stemmer

# Create stemmer for English
stemmer = Stemmer("en")

# Test stemming on various words
test_words = ["programming", "developer", "coding", "algorithms"]

for word in test_words:
    stemmed = stemmer(word)
    print(f"{word} -> {stemmed}")

Output:

programming -> program
developer -> develop
coding -> code
algorithms -> algorithm

Summarization Algorithms in Sumy

Sumy provides several algorithms, each with different approaches to identifying important sentences. Let's explore the most effective ones.

1. Luhn Summarizer: Frequency-Based Approach

The Luhn algorithm ranks sentences based on the frequency of significant words. It identifies important terms by filtering out stop words and focuses on sentences containing these high-frequency terms.

Sets up the Luhn summarizer using the Sumy library with English stemming and stop words.
Defines a function luhn_summarize() that takes text and returns a short summary.
Demonstrates the function with a sample paragraph and prints the top 2 sentences that capture the meaning of paragraph.

Python


 from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
import nltk
nltk.download('punkt_tab')

def luhn_summarize(text, sentence_count=2):
    # Parse the input text
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    
    # Initialize summarizer with stemmer
    summarizer = LuhnSummarizer(Stemmer("english"))
    summarizer.stop_words = get_stop_words("english")
    
    # Generate summary
    summary = summarizer(parser.document, sentence_count)
    return summary

# Test with sample text
sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information. 
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention. 
Machine learning algorithms form the backbone of most AI applications today. 
Deep learning, a subset of machine learning, uses neural networks to solve complex problems. 
These technologies are revolutionizing industries from healthcare to finance. 
The potential applications of AI seem limitless as research continues to advance.
"""

summary = luhn_summarize(sample_text, 2)
for sentence in summary:
    print(sentence)

Output:

Machine learning algorithms form the backbone of most AI applications today. Deep learning, a subset of machine learning, uses neural networks to solve complex problems.

2. Edmundson Summarizer: Customizable Word Weighting

The Edmundson algorithm allows fine-tuned control over summarization by using bonus words (emphasized), stigma words (de-emphasized) and null words (ignored).

Uses the Edmundson summarizer from the Sumy library with English stemmer and stop words.
Allows custom emphasis through bonus_words and stigma_words to guide what content gets prioritized or downplayed.
Runs a summarization example with a focus on AI-related terms and prints lines with most weighted words.

Python


 from sumy.summarizers.edmundson import EdmundsonSummarizer

def edmundson_summarize(text, sentence_count=2, bonus_words=None, stigma_words=None):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))

    # Initialize summarizer
    summarizer = EdmundsonSummarizer(Stemmer("english"))
    summarizer.stop_words = get_stop_words("english")

    # Set null words
    summarizer.null_words = get_stop_words("english")

    # Set custom word weights
    if bonus_words:
        summarizer.bonus_words = bonus_words
    if stigma_words:
        summarizer.stigma_words = stigma_words

    summary = summarizer(parser.document, sentence_count)
    return summary

# Customize summarization focus
bonus_words = ["intelligence", "learning", "algorithms"]
stigma_words = ["simple", "basic"]

sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information.
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention.
Machine learning algorithms form the backbone of most AI applications today.
Deep learning, a subset of machine learning, uses neural networks to solve complex problems.
These technologies are revolutionizing industries from healthcare to finance.
The potential applications of AI seem limitless as research continues to advance.
"""

summary = edmundson_summarize(sample_text, 2, bonus_words, stigma_words)
for sentence in summary:
    print(sentence)

Output:

Artificial intelligence represents a paradigm shift in how machines process information. Machine learning algorithms form the backbone of most AI applications today.

3. LSA Summarizer: Semantic Understanding

Latent Semantic Analysis (LSA) goes beyond simple word frequency by understanding relationships and context between terms. This approach often produces more coherent and contextually accurate summaries. The code given below:

Uses the LSA (Latent Semantic Analysis) summarizer from the Sumy library.
Converts the input text into a format suitable for summarization using a parser and tokenizer.
Applies LSA to extract key sentences based on underlying semantic structure.
Prints a 2-sentence summary from the given text.

Python


 from sumy.summarizers.lsa import LsaSummarizer

def lsa_summarize(text, sentence_count=2):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    
    # Initialize LSA summarizer
    summarizer = LsaSummarizer(Stemmer("english"))
    summarizer.stop_words = get_stop_words("english")
    
    summary = summarizer(parser.document, sentence_count)
    return summary

sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information.
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention.
Machine learning algorithms form the backbone of most AI applications today.
Deep learning, a subset of machine learning, uses neural networks to solve complex problems.
These technologies are revolutionizing industries from healthcare to finance.
The potential applications of AI seem limitless as research continues to advance.
"""

summary = lsa_summarize(sample_text, 2)
for sentence in summary:
    print(sentence)

Output:

Artificial intelligence represents a paradigm shift in how machines process information. Modern AI systems can learn from data, recognize patterns and make decisions with minimal human intervention.

Performance Considerations

Time Complexity:

Luhn: O(n²) where n is the number of sentences
Edmundson: O(n²) with additional overhead for custom word processing
LSA: O(n³) due to matrix decomposition operations

Space Complexity:

All algorithms: O(n×m) where n is sentences and m is vocabulary size
LSA requires additional space for matrix operations

Practical Applications and Limitations

Sumy works well for:

News articles and blog posts
Research paper abstracts
Technical documentation
Legal document summaries

But, it has limitations too:

Might struggle with highly technical or domain-specific content
Performance depends on text structure and sentence quality
Limited effectiveness on very short texts

Choosing the Right Algorithm Algorithm Best For Advantages Disadvantages When to Use LSA General-purpose summarization Captures semantics, produces summaries, handles synonyms Computationally intensive and memory-heavy Default choice for most applications Luhn Quick, frequency-based summaries Fast, lightweight and easy to implement Limited semantic understanding and may overlook context Resource-constrained environments Edmundson Domain-specific content Offers customizable weighting and adapts well to specific domains Requires manual tuning and is complex to set up Specialized domains with predefined key terms

The key to effective summarization with Sumy lies in understanding our text characteristics and choosing the algorithm that best matches specific requirements. Experimenting with different approaches and sentence counts will help us find the optimal configuration for the use case.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4