Last Updated : 12 Feb, 2025
When working with datasets, we often encounter categorical data, which needs to be converted into numerical format for machine learning algorithms to process. For example, a column representing car brands ("Toyota"
, "Honda"
, "Ford"
) or colors ("Red"
, "Blue"
, "Green"
) is categorical data for Cars Dataset. One common method to achieve this is Label Encoding.
In this Article, we will understand the concept of label encoding briefly with python implementation.
Label EncodingLabel Encoding is a technique that is used to convert categorical columns into numerical ones so that they can be fitted by machine learning models which only take numerical data. It is an important pre-processing step in a machine-learning project. It assigns a unique integer to each category in the data, making it suitable for machine learning models that work with numerical inputs.
Example of Label EncodingSuppose we have a column Height in some dataset that has elements as Tall, Medium, and short. To convert this categorical column into a numerical column we will apply label encoding to this column. After applying label encoding, the Height column is converted into a numerical column having elements 0, 1, and 2 where 0 is the label for tall, 1 is the label for medium, and 2 is the label for short height.
Height Height Tall 0 Medium 1 Short 2 How to Perform Label Encoding in PythonWe will apply Label Encoding on the iris dataset on the target column which is Species. It contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
Python
import numpy as np
import pandas as pd
df = pd.read_csv('../../data/Iris.csv')
df['species'].unique()
Output:
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
After applying Label Encoding with LabelEncoder() our categorical value will replace with the numerical value[int].
Python
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['species']= label_encoder.fit_transform(df['species'])
df['species'].unique()
Output:
array([0, 1, 2], dtype=int64)Advantages of Label Encoding
1. Label Encoding is straightforward to use. It requires less preprocessing because it directly converts each unique category into a numeric value. Wedon’t need to create additional features or complex transformations.
For example, if you have categories like ["Red", "Green", "Blue"]
, Label Encoding simply assigns integers like [0, 1, 2]
without extra steps
2. Label Encoding works well for ordinal data, where the order of categories is meaningful (e.g., Low
, Medium
, High
). The numerical representation saves the relationship between categories
Example: (Low = 0
, Medium = 1
, High = 2
), which helps the model understand their ranking or progression. It avoids unnecessary computations, making it both efficient and relevant in such cases.
If the encoded values imply a relationship (e.g., Red = 0
and Blue = 2
might suggest Red < Blue
), the model may incorrectly interpret the data as ordinal. To address this, we consider using One-Hot Encoding.
Label Encoding is an essential technique for preprocessing categorical data in machine learning. It's simple, efficient, and works well for ordinal data. However, be cautious of its limitations and use other encoding techniques like One-Hot Encoding when necessary.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4