How to drop duplicates and keep one in PySpark dataframe

How to drop duplicates and keep one in PySpark dataframe

In PySpark, you can use the dropDuplicates method of a DataFrame to drop duplicate rows. By default, this method considers all columns to determine duplicates. You can also specify a subset of columns to consider for identifying duplicates.

Here's how you can drop duplicates and keep only one instance:

  1. Setup: First, make sure you have PySpark installed and set up:

    pip install pyspark 
  2. Drop Duplicates:

    from pyspark.sql import SparkSession # Create a Spark session spark = SparkSession.builder.appName("dropDuplicates").getOrCreate() # Sample data data = [ ("John", "Doe", 29), ("Jane", "Doe", 22), ("John", "Doe", 29), # Duplicate ("Mike", "Smith", 25) ] columns = ["first_name", "last_name", "age"] # Create DataFrame df = spark.createDataFrame(data, columns) # Drop duplicates across all columns df_no_duplicates = df.dropDuplicates() df_no_duplicates.show() 
  3. Drop Duplicates Based on Specific Columns:

    If you want to consider only specific columns for identifying duplicates, you can provide those column names as a list to the dropDuplicates method.

    # Drop duplicates based on 'first_name' and 'last_name' columns only df_no_name_duplicates = df.dropDuplicates(subset=['first_name', 'last_name']) df_no_name_duplicates.show() 
  4. Keep a Specific Duplicate:

    If you have a reason to keep a specific duplicate (for example, based on another column's maximum or minimum value), you can use the orderBy method combined with dropDuplicates.

    # Suppose we had another column 'timestamp', and we want to keep the latest record data_with_timestamp = [ ("John", "Doe", 29, "2022-01-01"), ("Jane", "Doe", 22, "2022-01-02"), ("John", "Doe", 29, "2022-01-03"), # This should be retained ("Mike", "Smith", 25, "2022-01-04") ] columns_with_timestamp = ["first_name", "last_name", "age", "timestamp"] df_with_timestamp = spark.createDataFrame(data_with_timestamp, columns_with_timestamp) # Drop duplicates based on name but keep the latest timestamp df_no_name_duplicates_latest = df_with_timestamp.orderBy("timestamp", ascending=False).dropDuplicates(subset=['first_name', 'last_name']) df_no_name_duplicates_latest.show() 

By following these methods, you can drop duplicates in a PySpark DataFrame and even have control over which duplicates to retain.


More Tags

templates stackdriver tr formatexception numpy-ndarray tidytext perlin-noise animate-on-scroll chown exceldatareader

More Programming Guides

Other Guides

More Programming Examples