Last Updated : 26 Jul, 2025
Outliers are data points that deviate significantly from other data points in a dataset. They can arise from a variety of factors such as measurement errors, rare events or natural variations in the data. If left unchecked it can distort data analysis, skew statistical results and impact machine learning model performance. In this article, we’ll see how to detect and handle outliers in Python using various techniques to improve the quality and reliability of our data.
Common Causes of OutliersUnderstanding the causes of outliers helps in finding the best approach to handle them. Some common causes include:
Outliers can create significant issues in data analysis and machine learning which makes their removal important:
By removing or handling outliers, we prevent these issues and ensure more accurate analysis and predictions.
Methods for Detecting and Removing OutliersThere are several ways to detect and handle outliers in Python. We can use visualization techniques or statistical methods depending on the nature of our data Each method serves different purposes and is suited for specific types of data. Here we will be using Pandas and Matplotlib libraries on the Diabetes dataset which is preloaded in the Sckit-learn library.
1. Visualizing and Removing Outliers Using Box PlotsA boxplot is an effective way for visualizing the distribution of data using quartiles and the points outside the "whiskers" of the plot are considered outliers. They provide a quick way to see where the data is concentrated and where potential outliers lie.
Python
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
diabetes = load_diabetes()
column_name = diabetes.feature_names
df_diabetics = pd.DataFrame(diabetes.data, columns=column_name)
sns.boxplot(df_diabetics['bmi'])
plt.title('Boxplot of BMI')
plt.show()
Output:
Box PlotIn the boxplot, outliers appear as points outside the whiskers. These values are much higher or lower than the rest of the data. For example, bmi values above 0.12 could be identified as outliers.
To remove outliers, we can define a threshold value and filter the data.
Python
def removal_box_plot(df, column, threshold):
removed_outliers = df[df[column] <= threshold]
sns.boxplot(removed_outliers[column])
plt.title(f'Box Plot without Outliers of {column}')
plt.show()
return removed_outliers
threshold_value = 0.12
no_outliers = removal_box_plot(df_diabetics, 'bmi', threshold_value)
Output:
Box Plot 2. Visualizing and Removing Outliers Using Scatter PlotsScatter plots help visualize relationships between two variables. It is used when we have paired numerical data and when our dependent variable has multiple values for each reading independent variable. Outliers appear as points far from the main cluster of data.
Python
fig, ax = plt.subplots(figsize=(6, 4))
ax.scatter(df_diabetics['bmi'], df_diabetics['bp'])
ax.set_xlabel('BMI')
ax.set_ylabel('Blood Pressure')
plt.title('Scatter Plot of BMI vs Blood Pressure')
plt.show()
Output:
Visualizing Using Scatter PlotsLooking at the graph can summarize that most of the data points are in the bottom left corner of the graph but there are few points that are exactly opposite that is the top right corner of the graph. Those points in the top right corner can be regarded as Outliers.
Here’s how we can remove the outliers identified visually from the scatter plot.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
outlier_indices = np.where((df_diabetics['bmi'] > 0.12) & (df_diabetics['bp'] < 0.8))
no_outliers = df_diabetics.drop(outlier_indices[0])
fig, ax_no_outliers = plt.subplots(figsize=(6, 4))
ax_no_outliers.scatter(no_outliers['bmi'], no_outliers['bp'])
ax_no_outliers.set_xlabel('(body mass index of people)')
ax_no_outliers.set_ylabel('(bp of the people )')
plt.show()
Output:
Removing Outliers Using Scatter PlotsThis removes rows where BMI > 0.12 and BP < 0.8 conditions derived from visual inspection.
3. Z-Score Method for Outlier DetectionZ- Score is also called a standard score. This score measures how far a data point is from the mean, in terms of standard deviations. If the Z-score exceeds a given threshold (commonly 3) the data point is considered an outlier.
Z-score = \frac{x - \mu}{\sigma}
Where:
Here we are calculating the Z scores for the 'age' column in the DataFrame df_diabetics
using the z-score
function from the SciPy stats module. The resulting array z
contains the absolute Z scores for each data point in the 'age' column which shows how many standard deviations each value is from the mean.
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(df_diabetics['age']))
print(z)
Output:
Z scoresNow to define an outlier threshold value is chosen which is generally 3.0. As 99.7% of the data points lie between \pm3 standard deviation using Gaussian Distribution approach.
Let's remove rows where Z value is greater than 2.
import numpy as np
threshold_z = 2
outlier_indices = np.where(z > threshold_z)[0]
no_outliers = df_diabetics.drop(outlier_indices)
print("Original DataFrame Shape:", df_diabetics.shape)
print("DataFrame Shape after Removing Outliers:", no_outliers.shape)
Output:
4. Interquartile Range (IQR) MethodOriginal DataFrame Shape: (442, 10)
DataFrame Shape after Removing Outliers: (426, 10)
IQR (Inter Quartile Range) method is a widely used and reliable technique for detecting outliers. It is robust to skewed data and helps identify extreme values based on quartiles and it most trusted approach used in the research field. The IQR is calculated as the difference between the third quartile (Q3) and the first quartile (Q1):
IQR = Q3 - Q1
Syntax : numpy.percentile(arr, n, axis=None, out=None)
Parameters:
Here we are calculating the interquartile range (IQR) for the 'bmi' column in the DataFrame df_diabetics
. It first finds the first quartile (Q1) and third quartile (Q3) using the midpoint method then calculates the IQR as the difference between Q3 and Q1 which providing a measure of the spread of the middle 50% of the data in the 'bmi' column.
Q1 = np.percentile(df_diabetics['bmi'], 25, method='midpoint')
Q3 = np.percentile(df_diabetics['bmi'], 75, method='midpoint')
IQR = Q3 - Q1
print(IQR)
Output:
0.06520763046978838
To define the outlier base value is defined above and below dataset's normal range namely Upper and Lower bounds define the upper and the lower bound (1.5*IQR value is considered i.e:
In the above formula the 0.5 scale-up of IQR (new_IQR = IQR + 0.5*IQR) is taken to consider all the data between 2.7 standard deviations in the Gaussian Distribution.
Python
upper = Q3+1.5*IQR
upper_array = np.array(df_diabetics['bmi'] >= upper)
print("Upper Bound:", upper)
print(upper_array.sum())
lower = Q1-1.5*IQR
lower_array = np.array(df_diabetics['bmi'] <= lower)
print("Lower Bound:", lower)
print(lower_array.sum())
Output:
Upper and Lower boundsNow lets detect and remove outlier using the interquartile range (IQR).
Here we are using the interquartile range (IQR) method to detect and remove outliers in the 'bmi' column of the diabetes dataset. It calculates the upper and lower limits based on the IQR it identifies outlier indices using Boolean arrays and then removes the corresponding rows from the DataFrame which results in a new DataFrame with outliers excluded. The before and after shapes of the DataFrame are printed for comparison.
Python
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
diabetes = load_diabetes()
column_name = diabetes.feature_names
df_diabetes = pd.DataFrame(diabetes.data)
df_diabetes .columns = column_name
df_diabetes .head()
print("Old Shape: ", df_diabetes.shape)
Q1 = df_diabetes['bmi'].quantile(0.25)
Q3 = df_diabetes['bmi'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR
upper_array = np.where(df_diabetes['bmi'] >= upper)[0]
lower_array = np.where(df_diabetes['bmi'] <= lower)[0]
df_diabetes.drop(index=upper_array, inplace=True)
df_diabetes.drop(index=lower_array, inplace=True)
print("New Shape: ", df_diabetes.shape)
Output:
Old Shape: (442, 10)
New Shape: (439, 10)
With outlier detection and removal we ensure that our data is clean, reliable and ready to provide valuable insights and setting the foundation for robust analysis and accurate models.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4