Remove duplicates from a dataframe in PySpark

Remove duplicates from a dataframe in PySpark

Removing duplicates from a DataFrame in PySpark can be easily done using the dropDuplicates() method. This method considers rows as duplicates if the values in all columns match. You can also specify a subset of columns to consider for identifying duplicates.

Here's a step-by-step guide on how to do it:

Step 1: Create a PySpark Session

First, you need to create a SparkSession if you haven't already:

from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Remove Duplicates") \ .getOrCreate() 

Step 2: Create a DataFrame

Create a DataFrame with which you'll work. If you already have a DataFrame, you can skip this step.

data = [("John", 30), ("Jane", 25), ("John", 30), ("Mike", 35)] columns = ["Name", "Age"] df = spark.createDataFrame(data, columns) df.show() 

This will show:

+----+---+ |Name|Age| +----+---+ |John| 30| |Jane| 25| |John| 30| |Mike| 35| +----+---+ 

Step 3: Remove Duplicates

To remove duplicates, use the dropDuplicates() method.

  • Remove All Duplicates:

    df_no_duplicates = df.dropDuplicates() df_no_duplicates.show() 
  • Remove Duplicates Based on Specific Columns:

    If you want to remove duplicates based on specific columns, pass a list of column names to dropDuplicates.

    df_no_duplicates = df.dropDuplicates(['Name']) df_no_duplicates.show() 

Step 4: Review the Result

The resulting DataFrame df_no_duplicates will have the duplicates removed.

Additional Notes

  • In the case of multiple duplicate rows, dropDuplicates() keeps the first occurrence of the duplicates.
  • Be aware that dropDuplicates() is a transformation operation in Spark and it will only be executed when an action (like show(), count(), etc.) is called due to Spark's lazy evaluation.
  • Remember to properly manage your SparkSession, especially in a shared or production environment.

This method is a straightforward way to handle duplicates in PySpark and is commonly used in data cleaning and preprocessing stages of a data pipeline.


More Tags

mvvm body-parser nsarray artifactory caching macos-sierra sequel dto string-formatting polling

More Programming Guides

Other Guides

More Programming Examples