Removing duplicate rows based on specific column in PySpark DataFrame

Removing duplicate rows based on specific column in PySpark DataFrame

In PySpark, you can use the dropDuplicates or distinct method to remove duplicate rows based on one or more specific columns of a DataFrame.

Here is an example of how to remove duplicate rows based on a specific column:

from pyspark.sql import SparkSession # Initialize a SparkSession spark = SparkSession.builder.appName("duplicateRows").getOrCreate() # Sample data data = [("John", 1), ("John", 2), ("Anna", 1), ("Anna", 1)] # Create DataFrame df = spark.createDataFrame(data, ["Name", "ID"]) # Show original DataFrame df.show() # Drop duplicate rows based on the "Name" column df_deduped = df.dropDuplicates(["Name"]) # Show DataFrame after removing duplicates df_deduped.show() # Stop the SparkSession spark.stop() 

The dropDuplicates method retains the first occurrence of each unique row based on the specified columns and removes the subsequent duplicates. If you do not specify any columns, it removes the duplicates based on all columns.

Alternatively, if you want to remove completely duplicate rows (where all column values are the same), you can use distinct:

# Drop completely duplicate rows df_distinct = df.distinct() # Show DataFrame after removing completely duplicate rows df_distinct.show() 

In the above code, distinct is used without any arguments, which tells PySpark to consider all columns for determining duplicates. To apply distinct on specific columns, you still need to use dropDuplicates.


More Tags

touches air checkbox swiftmessages three.js react-native powershell-3.0 pipeline comparator avplayer

More Programming Guides

Other Guides

More Programming Examples