Last Updated : 23 Jul, 2025
Text classification is a fundamental task in Natural Language Processing (NLP) that involves assigning predefined categories or labels to textual data. It has a wide range of applications, including spam detection, sentiment analysis, topic categorization, and language identification.
Logistic Regression Working for Text ClassificationLogistic Regression is a statistical method used for binary classification problems and it can also be extended to handle multi-class classification. When applied to text classification, the goal is to predict the category or class of a given text document based on its features. Below are the steps for text classification in logistic regression.
1. Text Representation:
2. Feature Extraction:
3. Logistic Regression Model:
We'll use the popular SMS Collection Dataset, consists of a collection of SMS (Short Message Service) messages, which are labeled as either "ham" (non-spam) or "spam" based on their content. The implementation is designed to classify text messages into two categories: spam (unwanted messages) and ham (legitimate messages) using a logistic regression model. The process is broken down into several key steps:
Step 1. Import LibrariesThe first step involves importing necessary libraries.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
Step 2. Load and Prepare the Data
data = pd.read_csv('/content/spam.csv', encoding='latin-1')
data.rename(columns={'v1': 'label', 'v2': 'text'}, inplace=True)
data['label'] = data['label'].map({'ham': 0, 'spam': 1})
Step 3. Text Vectorization
Convert text data into a numeric format using CountVectorizer, which transforms the text into a sparse matrix of token counts.
Python
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['text'])
y = data['label']
Step 4. Split Data into Training and Testing Sets
Divide the dataset into training and testing sets to evaluate the model's performance on unseen data.
Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
Step 5. Train the Logistic Regression Model
Create and train the logistic regression model using the training set.
Python
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
Output:
Logistic Regression Step 6. Model EvaluationUse the trained model to make predictions on the test set and evaluate the model's accuracy and confusion matrix to understand its performance better.
Python
y_pred = model.predict(X_test)
print("Accuracy_score" ,accuracy_score(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(f"[[{cm[0,0]} {cm[0,1]}]")
print(f" [{cm[1,0]} {cm[1,1]}]]")
Output:
The model is 97.4% correct on unseen data. The Confusion Matrix stated:
To simplify the use of this model for predicting the category of new messages we create a function that takes a text input and classifies it as spam or ham.
Python
def classify_message(model, vectorizer, message):
message_vect = vectorizer.transform([message])
prediction = model.predict(message_vect)
return "spam" if prediction[0] == 0 else "ham"
message = "Congratulations! You've won a free ticket to Bahamas!"
print(classify_message(model, vectorizer, message))
Output:
spam
This function first vectorizes the input text using the previously fitted CountVectorizer then predicts the category using the trained logistic regression model, and finally returns the prediction as a human-readable label.
This experiment demonstrates that logistic regression is a powerful tool for classifying text even with a simple approach. Using the SMS Spam Collection dataset we achieved an impressive accuracy of 97.6%. This shows that the model successfully learned to distinguish between spam and legitimate text messages based on word patterns.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4