Last Updated : 28 Jul, 2025
Outlier detection is an important task in data as identifying outliers can help us to understand the data better and improve the accuracy of our models. One common technique for detecting outliers is Z score. It is a statistical measurement that describes how far a data point is from the mean, expressed in terms of standard deviations. It helps us to identify if a data point is relatively higher or lower than the mean and how far it deviates from the average value.
Z-Score Formula:Z = \frac{X - \mu}{\sigma}
Where:
Outlier can be detected using Z-scores as follows:
Commonly, data points with a Z-score greater than 3 or less than -3 are considered outliers as they lie more than 3 standard deviations away from the mean. This threshold can be adjusted based on the dataset and the specific needs of the analysis.
Why Use Z-Score for Outlier Detection?Z-Score is useful for detecting outliers in normal distribution, as it highlights data points that lie far from the mean. A normal distributionis shown below and it is estimated that 68% of the data points lie between +/- 1 standard deviation. 95% of the data points lie between +/- 2 standard deviation 99.7% of the data points lie between +/- 3 standard deviation
normal distributionFor example, in a survey, it was asked how many children a person had. Suppose the data obtained from people is
1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2
Here, the value 15 is clearly an outlier as it deviates significantly from the other data points. The Z-Score for this data point will be much higher than the rest, showing it as an anomaly.
Lets see why it works so well:
Let’s see the steps of detecting outliers using the Z-Score method in Python.
Step 1: Importing Necessary LibrariesWe will be importing numpy, pandas, scipy and matplotlib for calculating the Z-Score and visualizing the outliers.
Python
import numpy as np
import pandas as pd
from scipy.stats import zscore
import matplotlib.pyplot as plt
Step 2: Creating the Dataset
For this example, we will use sample data and convert this into a pandas DataFrame.
Python
data = [5, 2, 4.5, 4, 3, 2, 6, 20, 9, 2.5, 3.5, 4.75, 6.5, 2.5, 8, 1]
df = pd.DataFrame(data, columns=['Value'])
Step 3: Calculating the Z-Scores
Now, we calculate the Z-scores for this dataset using the z-score function from scipy.stats.
Python
df['Z-score'] = zscore(df['Value'])
print(df)
Output:
Calculating the Z-ScoresThis give us the Z-Score for each data point in the dataset.
Step 4: Identifying OutliersNext, we'll identify the data points that have a Z-score greater than 3 or less than -3 which are commonly considered outliers.
Python
outliers = df[df['Z-score'].abs() > 3]
print(outliers)
Output:
Identifying Outliers Step 5: Visualizing the DataTo better understand the outliers, let’s create a scatter plot to visualize the dataset and highlight the outliers.
Python
plt.figure(figsize=(10, 6))
plt.scatter(df['Value'].index, df['Z-score'], label='Data Points')
plt.scatter(outliers['Value'].index, outliers['Z-score'], color='red', label='Outlier')
plt.xlabel('Index Value')
plt.ylabel('Z-score')
plt.title('Scatter Plot of Value vs. Z-score')
plt.legend()
plt.grid(True)
plt.show()
Output:
Visualizing the DataIn this case, the value 20 is an outlier because its Z-score is significantly higher than the rest of the values in the dataset.
Best Practices for Using Z-Score for Outlier DetectionWhile Z-Score is effective, there are a few important considerations:
By applying the Z-score method, we can quickly identify and deal with outliers which improves the accuracy of our data analysis and statistical models.
Detection Outliers using IQR and Z-score in ML
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4