Find duplicate rows in a Dataframe based on all or selected columns in python

Find duplicate rows in a Dataframe based on all or selected columns in python

To find duplicate rows in a pandas DataFrame based on all or selected columns, you can use the duplicated() method.

Let's see how you can find duplicate rows in a DataFrame:

  1. Based on all columns:

    • To find duplicate rows based on all columns:

      import pandas as pd # Sample dataframe df = pd.DataFrame({ 'A': [1, 2, 3, 2], 'B': [4, 5, 6, 5], 'C': [7, 8, 9, 8] }) duplicate_rows = df[df.duplicated()] print(duplicate_rows) 
  2. Based on selected columns:

    • To find duplicate rows based on specific columns, use the subset parameter:

      duplicate_rows_based_on_A_B = df[df.duplicated(subset=['A', 'B'])] print(duplicate_rows_based_on_A_B) 

If you want to keep the first occurrence and mark the subsequent occurrences as duplicates, you can use keep='first' (which is the default behavior). If you want to do the opposite, i.e., keep the last occurrence and mark the previous ones as duplicates, you can use keep='last'. If you want to mark all occurrences of duplicates as True, use keep=False.

Example:

# Mark all occurrences of duplicates all_duplicate_rows = df[df.duplicated(keep=False)] print(all_duplicate_rows) 

To drop duplicates, you can use the drop_duplicates() method and similarly use the subset parameter to specify columns and the keep parameter to specify which occurrences to keep.


More Tags

android-autofill-manager batch-processing mysql-json carousel closedxml properties-file infinite maintainability sympy maven-ear-plugin

More Programming Guides

Other Guides

More Programming Examples