Last Updated : 21 Jul, 2025
Missing values are a common challenge in machine learning and data analysis. They occur when certain data points are missing for specific variables in a dataset. These gaps in information can take the form of blank cells, null values or special symbols like "NA", "NaN" or "unknown." If not addressed properly, missing values can harm the accuracy and reliability of our models. They can reduce the sample size, introduce bias and make it difficult to apply certain analysis techniques that require complete data. Efficiently handling missing values is important to ensure our machine learning models produce accurate and unbiased results. In this article, we'll see more about the methods and strategies to deal with missing data effectively.
Missing Values Importance of Handling Missing ValuesHandling missing values is important for ensuring the accuracy and reliability of data analysis and machine learning models. Key reasons include:
Missing values can introduce several challenges in data analysis including:
Data can be missing from a dataset for several reasons and understanding the cause is important for selecting the most effective way to handle it. Common reasons for missing data include:
By identifying the reason behind the missing data, we can better assess its impact whether it's causing bias or affecting the analysis and select the proper handling method such as imputation or removal.
Types of Missing ValuesMissing values in a dataset can be categorized into three main types each with different implications for how they should be handled:
Detecting and managing missing data is important for data analysis. Let's see some useful functions for detecting, removing and replacing null values in Pandas DataFrame.
Functions
Descriptions
.isnull()
Identifies missing values in a Series or DataFrame.
.notnull()
Opposite of .isnull(), returns True for non-missing values and False for missing values.
.info()
Displays DataFrame summary including data types, memory usage and the count of missing values.
.isna()
Works similarly to .notnull() but returns True for missing data and False for valid data.
dropna() Removes rows or columns with missing values with customizable options for axis and threshold. fillna() Fills missing values with a specified value (like mean, median) or method (forward/backward fill). replace() Replaces specified values in the DataFrame, useful for correcting or standardizing data. drop_duplicates() Removes duplicate rows based on specified columns. unique() Finds unique values in a Series or DataFrame.Representation of Missing Values in DatasetsFor more detail refer to Working with Missing Data in Pandas
Missing values can be represented by blank cells, specific values like "NA" or codes. It's important to use consistent and documented representation to ensure transparency and ease in data handling.
Common representations include:
Depending on the nature of the data and the missingness, several strategies can help maintain the integrity of our analysis. Let's see some of the most effective methods to handle missing values.
Before moving to various strategies, let's first create a Sample Dataframe so that we can use it for different methods.
Creating a Sample DataframeHere we will be using Pandas and Numpy libraries.
Python
import pandas as pd
import numpy as np
data = {
'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry'],
'Address': ['123 Main St', '456 Oak Ave', '789 Pine Ln', '101 Elm St', np.nan, '222 Maple Rd', '444 Cedar Blvd', '555 Birch Dr'],
'City': ['Los Angeles', 'New York', 'Houston', 'Los Angeles', 'Miami', np.nan, 'Houston', 'New York'],
'Subject': ['Math', 'English', 'Science', 'Math', 'History', 'Math', 'Science', 'English'],
'Marks': [85, 92, 78, 89, np.nan, 95, 80, 88],
'Rank': [2, 1, 4, 3, 8, 1, 5, 3],
'Grade': ['B', 'A', 'C', 'B', 'D', 'A', 'C', 'B']
}
df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)
Output:
Creating a Sample Dataframe 1. Removing Rows with Missing ValuesRemoving rows with missing values is a simple and straightforward method to handle missing data, used when we want to keep our analysis clean and minimize complexity.
Advantages:
Disadvantages:
In this example, we are removing rows with missing values from the original DataFrame (df) using the dropna() method and then displaying the cleaned DataFrame (df_cleaned).
Python
df_cleaned = df.dropna()
print("\nDataFrame after removing rows with missing values:")
print(df_cleaned)
Output:
Removing Rows with Missing Values 2. Imputation MethodsImputation involves replacing missing values with estimated values. This approach is beneficial when we want to preserve the dataset’s sample size and avoid losing data points. However, it's important to note that the accuracy of the imputed values may not always be reliable.
Let's see some common imputation methods:
2.1 Mean, Median and Mode Imputation:
This method involves replacing missing values with the mean, median or mode of the relevant variable. It's a simple approach but it doesn't account for the relationships between variables.
In this example, we are explaining the imputation techniques for handling missing values in the 'Marks' column of the DataFrame (df). It calculates and fills missing values with the mean, median and mode of the existing values in that column and then prints the results for observation.
mean_imputation = df['Marks'].fillna(df['Marks'].mean())
median_imputation = df['Marks'].fillna(df['Marks'].median())
mode_imputation = df['Marks'].fillna(df['Marks'].mode().iloc[0])
print("\nImputation using Mean:")
print(mean_imputation)
print("\nImputation using Median:")
print(median_imputation)
print("\nImputation using Mode:")
print(mode_imputation)
Output:
Mean, Median and Mode ImputationAdvantages:
Disadvantages:
2.2 Forward and Backward Fill
Forward and backward fill techniques are used to replace missing values by filling them with the nearest non-missing values from the same column. This is useful when there’s an inherent order or sequence in the data.
The method parameter in fillna() allows to specify the filling strategy.
forward_fill = df['Marks'].fillna(method='ffill')
backward_fill = df['Marks'].fillna(method='bfill')
print("\nForward Fill:")
print(forward_fill)
print("\nBackward Fill:")
print(backward_fill)
Output:
Forward and Backward FillAdvantages:
Disadvantages:
3. Interpolation TechniquesNote:
- Forward fill uses the last valid observation to fill missing values.
- Backward fill uses the next valid observation to fill missing values.
Interpolation is a technique used to estimate missing values based on the values of surrounding data points. Unlike simpler imputation methods (e.g mean, median, mode), interpolation uses the relationship between neighboring values to make more informed estimations.
The interpolate() method in pandas are divided into Linear and Quadratic.
linear_interpolation = df['Marks'].interpolate(method='linear')
quadratic_interpolation = df['Marks'].interpolate(method='quadratic')
print("\nLinear Interpolation:")
print(linear_interpolation)
print("\nQuadratic Interpolation:")
print(quadratic_interpolation)
Output:
Interpolation techniquesAdvantages:
Disadvantages:
Impact of Handling Missing ValuesNote:
- Linear interpolation assumes a straight line between two adjacent non-missing values.
- Quadratic interpolation assumes a quadratic curve that passes through three adjacent non-missing values.
Handling missing values effectively is important to ensure the accuracy and reliability of our findings.
Let's see some key impacts of handling missing values:
Effectively handling missing values is important for maintaining data integrity, improving model performance and ensuring reliable analysis. By carefully choosing appropriate strategies for imputation or removal, we increase the quality of our data, minimize bias and maximize the accuracy of our findings.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4