Return boolean Series denoting duplicate rows.
Considering certain columns is optional.
Only consider certain columns for identifying duplicates, by default use all of the columns.
Determines which duplicates (if any) to mark.
first
: Mark duplicates as True
except for the first occurrence.
last
: Mark duplicates as True
except for the last occurrence.
False : Mark all duplicates as True
.
Boolean series for each duplicated rows.
Examples
Consider dataset containing ramen rating.
>>> df = pd.DataFrame({ ... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], ... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], ... 'rating': [4, 4, 3.5, 15, 5] ... }) >>> df brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
By default, for each set of duplicated values, the first occurrence is set on False and all others on True.
>>> df.duplicated() 0 False 1 True 2 False 3 False 4 False dtype: bool
By using âlastâ, the last occurrence of each set of duplicated values is set on False and all others on True.
>>> df.duplicated(keep='last') 0 True 1 False 2 False 3 False 4 False dtype: bool
By setting keep
on False, all duplicates are True.
>>> df.duplicated(keep=False) 0 True 1 True 2 False 3 False 4 False dtype: bool
To find duplicates on specific column(s), use subset
.
>>> df.duplicated(subset=['brand']) 0 False 1 True 2 False 3 True 4 True dtype: bool
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4