Last Updated : 13 Jun, 2025
When working with data in Pandas one common task is removing duplicate rows to ensure clean and accurate datasets. The drop_duplicates() method in Pandas is designed to make this process quick and easy. It allows us to remove duplicate rows from a DataFrame based on all columns or specific ones.
By default drop_duplicates() scans the entire DataFrame and retains the first occurrence of each row and removes any duplicates that follow. In this article, we will see how to use the drop_duplicates() method and its examples.
Let's start with a basic example to see how drop_duplicates() works.
Python
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Alice", "David"],
"Age": [25, 30, 25, 40],
"City": ["NY", "LA", "NY", "Chicago"]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
df_cleaned = df.drop_duplicates()
print("\nModified DataFrame (no duplicates)")
print(df_cleaned)
Output:
Basic ExampleThis example shows how duplicate rows are removed while retaining the first occurrence using dataframe.drop_duplicates().
Syntax:
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
Parameters:
1. subset: Specifies the columns to check for duplicates. If not provided all columns are considered.
2. keep: Finds which duplicate to keep:
3. inplace: If True it modifies the original DataFrame directly. If False (default), returns a new DataFrame.
Return type: Method returns a new DataFrame with duplicates removed unless inplace=True.
ExamplesBelow are some examples of dataframe.drop_duplicates() method:
1. Dropping Duplicates Based on Specific ColumnsWe can target duplicates in specific columns using the subset parameter. This is useful when some columns are more relevant for identifying duplicates.
Python
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice', 'David'],
'Age': [25, 30, 25, 40],
'City': ['NY', 'LA', 'SF', 'Chicago']
})
df_cleaned = df.drop_duplicates(subset=["Name"])
print(df_cleaned)
Output:
Dropping Duplicates Based on Specific ColumnsHere duplicates are removed only based on the Name column while Age and City are ignored for the purpose of removing duplicates.
2. Keeping the Last Occurrence of DuplicatesBy default drop_duplicates() retains the first occurrence of duplicates. If we want to keep the last occurrence we can use keep='last'.
Python
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice', 'David'],
'Age': [25, 30, 25, 40],
'City': ['NY', 'LA', 'NY', 'Chicago']
})
df_cleaned= df.drop_duplicates(keep='last')
print(df_cleaned)
Output:
Keeping the Last OccurrenceHere the last occurrence of Alice is kept and the first occurrence is removed.
3. Dropping All DuplicatesIf we want to remove all rows that are duplicates i.e retain only completely unique rows amd here we can set keep=False.
Python
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice', 'David'],
'Age': [25, 30, 25, 40],
'City': ['NY', 'LA', 'NY', 'Chicago']
})
df_cleaned = df.drop_duplicates(keep=False)
print(df_cleaned)
Output:
Dropping All DuplicatesWith keep=False both occurrences of Alice are removed leaving only the rows with unique values across all columns.
4. Modifying the Original DataFrame DirectlyIf we'll like to modify the DataFrame in place without creating a new DataFrame set inplace=True.
Python
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice', 'David'],
'Age': [25, 30, 25, 40],
'City': ['NY', 'LA', 'NY', 'Chicago']
})
df.drop_duplicates(inplace=True)
print(df)
Output:
Modifying the Original DataFrameUsing inplace=True directly modifies the original DataFrame saving memory and avoiding the need to assign the result to a new variable.
5. Dropping Duplicates Based on Partially Identical ColumnsSometimes we might encounter situations where duplicates are not exact rows but have identical values in certain columns. For example after merging datasets we may want to drop rows that have the same values in a subset of columns.
Python
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Alice", "David", "Bob"],
"Age": [25, 30, 25, 40, 30],
"City": ["NY", "LA", "NY", "Chicago", "LA"]
}
df = pd.DataFrame(data)
df_cleaned = df.drop_duplicates(subset=["Name", "City"])
print(df_cleaned)
Output:
Dropping Partially Identical ColumnsHere duplicates are removed based on the Name and City columns leaving only unique combinations of Name and City.
By mastering the drop_duplicates() method, we'll ensure that our datasets are clean and reliable which allow us to get accurate insights and make informed decisions.
How to Remove Duplicate Rows in Pandas Dataframe? | Pandas in Python
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4