RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://www.geeksforgeeks.org/data-science/detect-and-remove-the-outliers-using-python/ below:

Detect and Remove the Outliers using Python

Last Updated : 26 Jul, 2025

Outliers are data points that deviate significantly from other data points in a dataset. They can arise from a variety of factors such as measurement errors, rare events or natural variations in the data. If left unchecked it can distort data analysis, skew statistical results and impact machine learning model performance. In this article, we’ll see how to detect and handle outliers in Python using various techniques to improve the quality and reliability of our data.

Common Causes of Outliers

Understanding the causes of outliers helps in finding the best approach to handle them. Some common causes include:

Measurement errors: Errors during data collection or from instruments can result in extreme values that don't reflect the underlying data distribution.
Sampling errors: Outliers can arise if the sample we collected isn’t representative of the population we're studying.
Natural variability: Certain data points naturally fall outside the expected range especially in datasets with inherently high variability.
Data entry errors: Mistakes made during manual data entry such as incorrect values or typos can create outliers.
Experimental errors: Outliers can occur due to equipment malfunctions, environmental factors or unaccounted variables in experiments.
Sampling from multiple populations: Combining data from distinct populations with different characteristics can create outliers if researchers don't properly segment the datasets.
Intentional outliers: Sometimes outliers are deliberately introduced into datasets for testing purposes to evaluate the robustness of models or algorithms.

Need for Outliers Removal

Outliers can create significant issues in data analysis and machine learning which makes their removal important:

Skewed Statistical Measures: Outliers can distort the mean, standard deviation and correlation values. For example an extreme value can make the mean unrepresentative of the actual data which leads to incorrect conclusions.
Reduced Model Accuracy: Outliers can influence machine learning models especially those sensitive to extreme values like linear regression. They may cause the model to focus too much on these rare events helps in reducing its ability to generalize to new, unseen data.
Misleading Visualizations: Outliers can stretch the scale of charts and graphs helps in making it difficult to interpret the main data trends. For example when visualizing a dataset with a few extreme values, it might find meaningful patterns in the majority of the data.

By removing or handling outliers, we prevent these issues and ensure more accurate analysis and predictions.

Methods for Detecting and Removing Outliers

There are several ways to detect and handle outliers in Python. We can use visualization techniques or statistical methods depending on the nature of our data Each method serves different purposes and is suited for specific types of data. Here we will be using Pandas and Matplotlib libraries on the Diabetes dataset which is preloaded in the Sckit-learn library.

1. Visualizing and Removing Outliers Using Box Plots

A boxplot is an effective way for visualizing the distribution of data using quartiles and the points outside the "whiskers" of the plot are considered outliers. They provide a quick way to see where the data is concentrated and where potential outliers lie.

Python


 import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

diabetes = load_diabetes()

column_name = diabetes.feature_names
df_diabetics = pd.DataFrame(diabetes.data, columns=column_name)

sns.boxplot(df_diabetics['bmi'])
plt.title('Boxplot of BMI')
plt.show()

Output:

Box Plot

In the boxplot, outliers appear as points outside the whiskers. These values are much higher or lower than the rest of the data. For example, bmi values above 0.12 could be identified as outliers.

To remove outliers, we can define a threshold value and filter the data.

Python


 def removal_box_plot(df, column, threshold):
    removed_outliers = df[df[column] <= threshold]

    sns.boxplot(removed_outliers[column])
    plt.title(f'Box Plot without Outliers of {column}')
    plt.show()
    return removed_outliers


threshold_value = 0.12

no_outliers = removal_box_plot(df_diabetics, 'bmi', threshold_value)

Output:

Box Plot 2. Visualizing and Removing Outliers Using Scatter Plots

Scatter plots help visualize relationships between two variables. It is used when we have paired numerical data and when our dependent variable has multiple values for each reading independent variable. Outliers appear as points far from the main cluster of data.

Python


 fig, ax = plt.subplots(figsize=(6, 4))
ax.scatter(df_diabetics['bmi'], df_diabetics['bp'])
ax.set_xlabel('BMI')
ax.set_ylabel('Blood Pressure')
plt.title('Scatter Plot of BMI vs Blood Pressure')
plt.show()

Output:

Visualizing Using Scatter Plots

Looking at the graph can summarize that most of the data points are in the bottom left corner of the graph but there are few points that are exactly opposite that is the top right corner of the graph. Those points in the top right corner can be regarded as Outliers.

Here’s how we can remove the outliers identified visually from the scatter plot.

np.where(): Used to find the positions (indices) where the condition is true in the DataFrame.
(df_diabetics['bmi'] > 0.12) & (df_diabetics['bp'] < 0.8): Checks for outliers where 'bmi' is greater than 0.12 and 'bp' is less than 0.8.

Python


 import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

outlier_indices = np.where((df_diabetics['bmi'] > 0.12) & (df_diabetics['bp'] < 0.8))

no_outliers = df_diabetics.drop(outlier_indices[0])

fig, ax_no_outliers = plt.subplots(figsize=(6, 4))
ax_no_outliers.scatter(no_outliers['bmi'], no_outliers['bp'])
ax_no_outliers.set_xlabel('(body mass index of people)')
ax_no_outliers.set_ylabel('(bp of the people )')
plt.show()

Output:

Removing Outliers Using Scatter Plots

This removes rows where BMI > 0.12 and BP < 0.8 conditions derived from visual inspection.

3. Z-Score Method for Outlier Detection

Z- Score is also called a standard score. This score measures how far a data point is from the mean, in terms of standard deviations. If the Z-score exceeds a given threshold (commonly 3) the data point is considered an outlier.

Z-score = \frac{x - \mu}{\sigma}

Where:

x = data point
μ = mean
σ = standard deviation

Here we are calculating the Z scores for the 'age' column in the DataFrame df_diabetics using the z-score function from the SciPy stats module. The resulting array z contains the absolute Z scores for each data point in the 'age' column which shows how many standard deviations each value is from the mean.

Python


 from scipy import stats
import numpy as np
z = np.abs(stats.zscore(df_diabetics['age']))
print(z)

Output:

Z scores

Now to define an outlier threshold value is chosen which is generally 3.0. As 99.7% of the data points lie between \pm3 standard deviation using Gaussian Distribution approach.

Let's remove rows where Z value is greater than 2.

np.where() : Used to find the positions (indices) in the Z-score array where the condition is true.
z > threshold : Checks for outliers in the 'age' column where the absolute Z-score exceeds the defined threshold (typically 2 or 3).
threshold = 2 : A cutoff value used to identify outliers, data points with a Z-score greater than 2 are considered outliers.

Python


 import numpy as np

threshold_z = 2

outlier_indices = np.where(z > threshold_z)[0]
no_outliers = df_diabetics.drop(outlier_indices)
print("Original DataFrame Shape:", df_diabetics.shape)
print("DataFrame Shape after Removing Outliers:", no_outliers.shape)

Output:

Original DataFrame Shape: (442, 10)
DataFrame Shape after Removing Outliers: (426, 10)

4. Interquartile Range (IQR) Method

IQR (Inter Quartile Range) method is a widely used and reliable technique for detecting outliers. It is robust to skewed data and helps identify extreme values based on quartiles and it most trusted approach used in the research field. The IQR is calculated as the difference between the third quartile (Q3) and the first quartile (Q1):

IQR = Q3 - Q1

Syntax : numpy.percentile(arr, n, axis=None, out=None)

Parameters:

arr: Input array.
n: Percentile value.

Here we are calculating the interquartile range (IQR) for the 'bmi' column in the DataFrame df_diabetics. It first finds the first quartile (Q1) and third quartile (Q3) using the midpoint method then calculates the IQR as the difference between Q3 and Q1 which providing a measure of the spread of the middle 50% of the data in the 'bmi' column.

Python


 Q1 = np.percentile(df_diabetics['bmi'], 25, method='midpoint')
Q3 = np.percentile(df_diabetics['bmi'], 75, method='midpoint')
IQR = Q3 - Q1
print(IQR)

Output:

0.06520763046978838

To define the outlier base value is defined above and below dataset's normal range namely Upper and Lower bounds define the upper and the lower bound (1.5*IQR value is considered i.e:

upper = Q3 +1.5*IQR
lower = Q1 - 1.5*IQR

In the above formula the 0.5 scale-up of IQR (new_IQR = IQR + 0.5*IQR) is taken to consider all the data between 2.7 standard deviations in the Gaussian Distribution.

Python


 upper = Q3+1.5*IQR
upper_array = np.array(df_diabetics['bmi'] >= upper)
print("Upper Bound:", upper)
print(upper_array.sum())

lower = Q1-1.5*IQR
lower_array = np.array(df_diabetics['bmi'] <= lower)
print("Lower Bound:", lower)
print(lower_array.sum())

Output:

Upper and Lower bounds

Now lets detect and remove outlier using the interquartile range (IQR).

Here we are using the interquartile range (IQR) method to detect and remove outliers in the 'bmi' column of the diabetes dataset. It calculates the upper and lower limits based on the IQR it identifies outlier indices using Boolean arrays and then removes the corresponding rows from the DataFrame which results in a new DataFrame with outliers excluded. The before and after shapes of the DataFrame are printed for comparison.

Python


 import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
diabetes = load_diabetes()

column_name = diabetes.feature_names
df_diabetes = pd.DataFrame(diabetes.data)
df_diabetes .columns = column_name
df_diabetes .head()
print("Old Shape: ", df_diabetes.shape)


Q1 = df_diabetes['bmi'].quantile(0.25)
Q3 = df_diabetes['bmi'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR

upper_array = np.where(df_diabetes['bmi'] >= upper)[0]
lower_array = np.where(df_diabetes['bmi'] <= lower)[0]

df_diabetes.drop(index=upper_array, inplace=True)
df_diabetes.drop(index=lower_array, inplace=True)

print("New Shape: ", df_diabetes.shape)

Output:

Old Shape: (442, 10)
New Shape: (439, 10)

With outlier detection and removal we ensure that our data is clean, reliable and ready to provide valuable insights and setting the foundation for robust analysis and accurate models.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4