Removing duplicate rows based on specific column in PySpark DataFrame

In PySpark, you can use the dropDuplicates or distinct method to remove duplicate rows based on one or more specific columns of a DataFrame.

Here is an example of how to remove duplicate rows based on a specific column:

from pyspark.sql import SparkSession # Initialize a SparkSession spark = SparkSession.builder.appName("duplicateRows").getOrCreate() # Sample data data = [("John", 1), ("John", 2), ("Anna", 1), ("Anna", 1)] # Create DataFrame df = spark.createDataFrame(data, ["Name", "ID"]) # Show original DataFrame df.show() # Drop duplicate rows based on the "Name" column df_deduped = df.dropDuplicates(["Name"]) # Show DataFrame after removing duplicates df_deduped.show() # Stop the SparkSession spark.stop()

The dropDuplicates method retains the first occurrence of each unique row based on the specified columns and removes the subsequent duplicates. If you do not specify any columns, it removes the duplicates based on all columns.

Alternatively, if you want to remove completely duplicate rows (where all column values are the same), you can use distinct:

# Drop completely duplicate rows df_distinct = df.distinct() # Show DataFrame after removing completely duplicate rows df_distinct.show()

In the above code, distinct is used without any arguments, which tells PySpark to consider all columns for determining duplicates. To apply distinct on specific columns, you still need to use dropDuplicates.

More Tags

touches air checkbox swiftmessages three.js react-native powershell-3.0 pipeline comparator avplayer

Removing duplicate rows based on specific column in PySpark DataFrame

More Tags

More Programming Guides

Other Guides

More Programming Examples

Fitness Calculators

Auto Calculators

Financial Calculators

Date and Time Calculators

Internet Calculators

Pregnancy Calculators

Investment Calculators

Math Calculators

Housing/Building Calculators

Health Calculators

Retirement Calculators

Statistics Calculators

Various Measurements/Units Calculators

Everyday Utility Calculators

Weather Calculators

Real Estate Calculators

Tax and Salary Calculators

Geometry Calculators

Electronics/Circuits Calculators

Transportation Calculators

Entertainment/Anecdotes Calculators