How to drop duplicates and keep one in PySpark dataframe

In PySpark, you can use the dropDuplicates method of a DataFrame to drop duplicate rows. By default, this method considers all columns to determine duplicates. You can also specify a subset of columns to consider for identifying duplicates.

Here's how you can drop duplicates and keep only one instance:

Setup: First, make sure you have PySpark installed and set up:
```
pip install pyspark 
```

Drop Duplicates:

from pyspark.sql import SparkSession # Create a Spark session spark = SparkSession.builder.appName("dropDuplicates").getOrCreate() # Sample data data = [ ("John", "Doe", 29), ("Jane", "Doe", 22), ("John", "Doe", 29), # Duplicate ("Mike", "Smith", 25) ] columns = ["first_name", "last_name", "age"] # Create DataFrame df = spark.createDataFrame(data, columns) # Drop duplicates across all columns df_no_duplicates = df.dropDuplicates() df_no_duplicates.show()

Drop Duplicates Based on Specific Columns:
If you want to consider only specific columns for identifying duplicates, you can provide those column names as a list to the dropDuplicates method.
```
# Drop duplicates based on 'first_name' and 'last_name' columns only df_no_name_duplicates = df.dropDuplicates(subset=['first_name', 'last_name']) df_no_name_duplicates.show() 
```

Keep a Specific Duplicate:

If you have a reason to keep a specific duplicate (for example, based on another column's maximum or minimum value), you can use the orderBy method combined with dropDuplicates.

# Suppose we had another column 'timestamp', and we want to keep the latest record data_with_timestamp = [ ("John", "Doe", 29, "2022-01-01"), ("Jane", "Doe", 22, "2022-01-02"), ("John", "Doe", 29, "2022-01-03"), # This should be retained ("Mike", "Smith", 25, "2022-01-04") ] columns_with_timestamp = ["first_name", "last_name", "age", "timestamp"] df_with_timestamp = spark.createDataFrame(data_with_timestamp, columns_with_timestamp) # Drop duplicates based on name but keep the latest timestamp df_no_name_duplicates_latest = df_with_timestamp.orderBy("timestamp", ascending=False).dropDuplicates(subset=['first_name', 'last_name']) df_no_name_duplicates_latest.show()

By following these methods, you can drop duplicates in a PySpark DataFrame and even have control over which duplicates to retain.

More Tags

templates stackdriver tr formatexception numpy-ndarray tidytext perlin-noise animate-on-scroll chown exceldatareader

How to drop duplicates and keep one in PySpark dataframe

More Tags

More Programming Guides

Other Guides

More Programming Examples

Fitness Calculators

Auto Calculators

Financial Calculators

Date and Time Calculators

Internet Calculators

Pregnancy Calculators

Investment Calculators

Math Calculators

Housing/Building Calculators

Health Calculators

Retirement Calculators

Statistics Calculators

Various Measurements/Units Calculators

Everyday Utility Calculators

Weather Calculators

Real Estate Calculators

Tax and Salary Calculators

Geometry Calculators

Electronics/Circuits Calculators

Transportation Calculators

Entertainment/Anecdotes Calculators