A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.geeksforgeeks.org/machine-learning/ml-handling-missing-values/ below:

ML | Handling Missing Values

ML | Handling Missing Values

Last Updated : 21 Jul, 2025

Missing values are a common challenge in machine learning and data analysis. They occur when certain data points are missing for specific variables in a dataset. These gaps in information can take the form of blank cells, null values or special symbols like "NA", "NaN" or "unknown." If not addressed properly, missing values can harm the accuracy and reliability of our models. They can reduce the sample size, introduce bias and make it difficult to apply certain analysis techniques that require complete data. Efficiently handling missing values is important to ensure our machine learning models produce accurate and unbiased results. In this article, we'll see more about the methods and strategies to deal with missing data effectively.

Missing Values Importance of Handling Missing Values

Handling missing values is important for ensuring the accuracy and reliability of data analysis and machine learning models. Key reasons include:

Challenges Posed by Missing Values

Missing values can introduce several challenges in data analysis including:

Reasons Behind Missing Values in the Dataset

Data can be missing from a dataset for several reasons and understanding the cause is important for selecting the most effective way to handle it. Common reasons for missing data include:

By identifying the reason behind the missing data, we can better assess its impact whether it's causing bias or affecting the analysis and select the proper handling method such as imputation or removal.

Types of Missing Values

Missing values in a dataset can be categorized into three main types each with different implications for how they should be handled:

  1. Missing Completely at Random (MCAR): In this case, the missing data is completely random and unrelated to any other variable in the dataset. The absence of data points occurs without any systematic pattern such as a random technical failure or data omission.
  2. Missing at Random (MAR): The missingness is related to other observed variables but not to the value of the missing data itself. For example, if younger individuals are more likely to skip a particular survey question, the missingness can be explained by age but not by the content of the missing data.
  3. Missing Not at Random (MNAR): Here, the probability of missing data is related to the value of the missing data itself. For example, people with higher incomes may be less likely to report their income, leading to a direct connection between the missingness and the value of the missing data.
Methods for Identifying Missing Data

Detecting and managing missing data is important for data analysis. Let's see some useful functions for detecting, removing and replacing null values in Pandas DataFrame.

Functions

Descriptions

.isnull()

Identifies missing values in a Series or DataFrame.

.notnull()

Opposite of .isnull(), returns True for non-missing values and False for missing values.

.info()

Displays DataFrame summary including data types, memory usage and the count of missing values.

.isna()

Works similarly to .notnull() but returns True for missing data and False for valid data.

dropna() Removes rows or columns with missing values with customizable options for axis and threshold. fillna() Fills missing values with a specified value (like mean, median) or method (forward/backward fill). replace() Replaces specified values in the DataFrame, useful for correcting or standardizing data. drop_duplicates() Removes duplicate rows based on specified columns. unique() Finds unique values in a Series or DataFrame.

For more detail refer to Working with Missing Data in Pandas

Representation of Missing Values in Datasets

Missing values can be represented by blank cells, specific values like "NA" or codes. It's important to use consistent and documented representation to ensure transparency and ease in data handling.

Common representations include:

  1. Blank Cells: Empty cells in data tables or spreadsheets are used to signify missing values. This is common in many data formats like CSVs.
  2. Specific Values: It is commonly used placeholders for missing data include "NA", "NaN", "NULL" or even arbitrary values like -999. It’s important to choose a standardized value and document its meaning to prevent confusion.
  3. Codes or Flags: In some cases, non-numeric codes or flags (e.g "MISSING", "UNKNOWN") are used to show missing data. These can be useful in distinguishing between different types of missingness or categorizing missing data based on its origin.
Strategies for Handling Missing Values in Data Analysis

Depending on the nature of the data and the missingness, several strategies can help maintain the integrity of our analysis. Let's see some of the most effective methods to handle missing values.

Before moving to various strategies, let's first create a Sample Dataframe so that we can use it for different methods.

Creating a Sample Dataframe

Here we will be using Pandas and Numpy libraries.

Python
import pandas as pd
import numpy as np

data = {
    'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry'],
    'Address': ['123 Main St', '456 Oak Ave', '789 Pine Ln', '101 Elm St', np.nan, '222 Maple Rd', '444 Cedar Blvd', '555 Birch Dr'],
    'City': ['Los Angeles', 'New York', 'Houston', 'Los Angeles', 'Miami', np.nan, 'Houston', 'New York'],
    'Subject': ['Math', 'English', 'Science', 'Math', 'History', 'Math', 'Science', 'English'],
    'Marks': [85, 92, 78, 89, np.nan, 95, 80, 88],
    'Rank': [2, 1, 4, 3, 8, 1, 5, 3],
    'Grade': ['B', 'A', 'C', 'B', 'D', 'A', 'C', 'B']
}

df = pd.DataFrame(data)
print("Sample DataFrame:")

print(df)

Output:

Creating a Sample Dataframe 1. Removing Rows with Missing Values

Removing rows with missing values is a simple and straightforward method to handle missing data, used when we want to keep our analysis clean and minimize complexity.

Advantages:

Disadvantages:

In this example, we are removing rows with missing values from the original DataFrame (df) using the dropna() method and then displaying the cleaned DataFrame (df_cleaned).

Python
df_cleaned = df.dropna()

print("\nDataFrame after removing rows with missing values:")
print(df_cleaned)

Output:

Removing Rows with Missing Values 2. Imputation Methods

Imputation involves replacing missing values with estimated values. This approach is beneficial when we want to preserve the dataset’s sample size and avoid losing data points. However, it's important to note that the accuracy of the imputed values may not always be reliable.

Let's see some common imputation methods:

2.1 Mean, Median and Mode Imputation:

This method involves replacing missing values with the mean, median or mode of the relevant variable. It's a simple approach but it doesn't account for the relationships between variables.

In this example, we are explaining the imputation techniques for handling missing values in the 'Marks' column of the DataFrame (df). It calculates and fills missing values with the mean, median and mode of the existing values in that column and then prints the results for observation.

Python
mean_imputation = df['Marks'].fillna(df['Marks'].mean())
median_imputation = df['Marks'].fillna(df['Marks'].median())
mode_imputation = df['Marks'].fillna(df['Marks'].mode().iloc[0])

print("\nImputation using Mean:")
print(mean_imputation)

print("\nImputation using Median:")
print(median_imputation)

print("\nImputation using Mode:")
print(mode_imputation)

Output:

Mean, Median and Mode Imputation

Advantages:

Disadvantages:

2.2 Forward and Backward Fill

Forward and backward fill techniques are used to replace missing values by filling them with the nearest non-missing values from the same column. This is useful when there’s an inherent order or sequence in the data.

The method parameter in fillna() allows to specify the filling strategy.

Python
forward_fill = df['Marks'].fillna(method='ffill')
backward_fill = df['Marks'].fillna(method='bfill')

print("\nForward Fill:")
print(forward_fill)

print("\nBackward Fill:")
print(backward_fill)

Output:

Forward and Backward Fill

Advantages:

Disadvantages:

Note:

3. Interpolation Techniques

Interpolation is a technique used to estimate missing values based on the values of surrounding data points. Unlike simpler imputation methods (e.g mean, median, mode), interpolation uses the relationship between neighboring values to make more informed estimations.

The interpolate() method in pandas are divided into Linear and Quadratic.

Python
linear_interpolation = df['Marks'].interpolate(method='linear')
quadratic_interpolation = df['Marks'].interpolate(method='quadratic')

print("\nLinear Interpolation:")
print(linear_interpolation)

print("\nQuadratic Interpolation:")
print(quadratic_interpolation)

Output:

Interpolation techniques

Advantages:

Disadvantages:

Note:

Impact of Handling Missing Values

Handling missing values effectively is important to ensure the accuracy and reliability of our findings.

Let's see some key impacts of handling missing values:

  1. Improved data quality: A cleaner dataset with fewer missing values is more reliable for analysis and model training.
  2. Enhanced model performance: Properly handling missing values helps models perform better by training on complete data, leading to more accurate predictions.
  3. Preservation of Data Integrity: Imputing or removing missing values ensures consistency and accuracy in the dataset, maintaining its integrity for further analysis.
  4. Reduced bias: Addressing missing values prevents bias in analysis, ensuring a more accurate representation of the underlying patterns in the data.

Effectively handling missing values is important for maintaining data integrity, improving model performance and ensuring reliable analysis. By carefully choosing appropriate strategies for imputation or removal, we increase the quality of our data, minimize bias and maximize the accuracy of our findings.



RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4