Remove duplicates from a dataframe in PySpark

Remove duplicates from a dataframe in PySpark

In PySpark, you can remove duplicates from a DataFrame using the dropDuplicates() transformation. This transformation returns a new DataFrame with the duplicate rows removed based on the specified columns. Here's how you can do it:

from pyspark.sql import SparkSession from pyspark.sql.functions import col # Create a Spark session spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate() # Sample data data = [ ("John", 25), ("Alice", 30), ("John", 25), # Duplicate ("Bob", 28), ("Alice", 30), # Duplicate ("Eve", 22) ] # Create a DataFrame columns = ["name", "age"] df = spark.createDataFrame(data, columns) # Remove duplicates based on all columns distinct_df = df.dropDuplicates() # Remove duplicates based on selected columns (e.g., "name") distinct_name_df = df.dropDuplicates(["name"]) # Show the results print("Original DataFrame:") df.show() print("DataFrame with duplicates removed:") distinct_df.show() print("DataFrame with duplicates removed based on name:") distinct_name_df.show() # Stop the Spark session spark.stop() 

In this example:

  1. We create a Spark session.
  2. We define sample data and create a DataFrame df.
  3. We use dropDuplicates() without specifying any columns, which removes rows that are completely identical.
  4. We use dropDuplicates(["name"]) to remove rows with duplicate names while keeping the first occurrence.
  5. We show the original DataFrame and the DataFrames with duplicates removed.
  6. We stop the Spark session.

Remember that dropDuplicates() does not modify the original DataFrame; it returns a new DataFrame with duplicates removed. If you want to modify the original DataFrame, you can reassign the result of dropDuplicates() to the original DataFrame name.

Make sure to adjust the column names and criteria based on your actual DataFrame structure and requirements.

Examples

  1. "How to remove duplicate rows in a PySpark DataFrame?"

    • This query demonstrates removing duplicate rows from a DataFrame.
    !pip install pyspark 
    from pyspark.sql import SparkSession spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate() data = [("John", 25), ("Anna", 22), ("John", 25), ("Mike", 30)] df = spark.createDataFrame(data, ["Name", "Age"]) unique_df = df.dropDuplicates() # Removes duplicate rows unique_df.show() # Displays the unique rows 
  2. "Remove duplicates from a specific column in PySpark?"

    • This query explores removing duplicates based on a specific column in a DataFrame.
    !pip install pyspark 
    from pyspark.sql import SparkSession spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate() data = [("John", 25), ("Anna", 22), ("John", 30), ("Mike", 30)] df = spark.createDataFrame(data, ["Name", "Age"]) unique_df = df.dropDuplicates(["Name"]) # Remove duplicates based on "Name" unique_df.show() # Displays unique rows based on "Name" 
  3. "Remove duplicates while preserving a specific order in PySpark?"

    • This query discusses removing duplicates while preserving a specific order in the DataFrame.
    !pip install pyspark 
    from pyspark.sql import SparkSession from pyspark.sql import functions as F spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate() data = [("John", 25), ("Anna", 22), ("John", 30), ("Mike", 30)] df = spark.createDataFrame(data, ["Name", "Age"]) # Add an index to preserve the original order df = df.withColumn("index", F.monotonically_increasing_id()) unique_df = df.dropDuplicates(["Name"]).orderBy("index") # Preserves the original order unique_df.show() # Displays unique rows in the original order 
  4. "How to remove duplicates based on multiple columns in PySpark?"

    • This query shows removing duplicates based on multiple columns in the DataFrame.
    !pip install pyspark 
    from pyspark.sql import SparkSession spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate() data = [("John", 25), ("Anna", 22), ("John", 25), ("Mike", 30)] df = spark.createDataFrame(data, ["Name", "Age"]) unique_df = df.dropDuplicates(["Name", "Age"]) # Remove duplicates based on both columns unique_df.show() # Displays the unique rows based on both columns 
  5. "Remove duplicates with a condition in PySpark?"

    • This query discusses removing duplicates based on a condition or business logic in PySpark.
    !pip install pyspark 
    from pyspark.sql import SparkSession from pyspark.sql import functions as F spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate() data = [("John", 25), ("Anna", 22), ("John", 25), ("Mike", 30)] df = spark.createDataFrame(data, ["Name", "Age"]) # Custom condition: remove duplicates where age is over 25 condition = df["Age"] > 25 df_filtered = df.filter(~condition) # Remove duplicates based on the condition unique_df = df_filtered.dropDuplicates() # Remove duplicates after filtering unique_df.show() # Displays the unique rows after condition-based filtering 
  6. "Remove duplicates but keep one instance in PySpark?"

    • This query addresses removing duplicates but keeping only one instance of each duplicated row.
    !pip install pyspark 
    from pyspark.sql import SparkSession spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate() data = [("John", 25), ("Anna", 22), ("John", 25), ("Mike", 30)] df = spark.createDataFrame(data, ["Name", "Age"]) unique_df = df.dropDuplicates() # Keeps only one instance of each duplicated row unique_df.show() # Displays the unique rows with only one instance of duplicates 
  7. "Remove duplicate rows from a large PySpark DataFrame?"

    • This query explores strategies for removing duplicates from large DataFrames in PySpark.
    !pip install pyspark 
    from pyspark.sql import SparkSession spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate() # Example data simulating a large DataFrame with potential duplicates data = [("John", 25), ("Anna", 22), ("John", 25), ("Mike", 30)] df = spark.createDataFrame(data, ["Name", "Age"]) # Use dropDuplicates() to remove duplicates, even in large DataFrames unique_df = df.dropDuplicates() # Removes duplicates effectively, regardless of size unique_df.show() # Displays the unique rows after removing duplicates 
  8. "Remove duplicates based on case-insensitive conditions in PySpark?"

    • This query discusses removing duplicates based on case-insensitive comparisons in PySpark.
    !pip install pyspark 
    from pyspark.sql import SparkSession from pyspark.sql import functions as F spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate() data = [("John", 25), ("john", 30), ("Mike", 30)] df = spark.createDataFrame(data, ["Name", "Age"]) # Use lower-case transformation to remove case-insensitive duplicates df_lower = df.withColumn("Name", F.lower("Name")) # Transform names to lower-case unique_df = df_lower.dropDuplicates(["Name"]) # Removes case-insensitive duplicates unique_df.show() # Displays unique rows with case-insensitive duplicates removed 
  9. "How to remove duplicates and count their occurrences in PySpark?"

    • This query explores removing duplicates and counting the occurrences of unique rows in PySpark.
    !pip install pyspark 
    from pyspark.sql import SparkSession from pyspark.sql import functions as F spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate() data = [("John", 25), ("Anna", 22), ("John", 25), ("Mike", 30)] df = spark.createDataFrame(data, ["Name", "Age"]) # Get unique rows and their count occurrences unique_df = df.dropDuplicates(["Name", "Age"]) count_df = df.groupBy("Name", "Age").count() # Counts occurrences of each unique row unique_df.show() # Displays unique rows count_df.show() # Displays counts for each unique row 

More Tags

kendo-datasource directory-listing text-formatting photo dispatcher go-templates zoneddatetime imagedecoder teradata-sql-assistant node-crypto

More Python Questions

More Math Calculators

More Electrochemistry Calculators

More Pregnancy Calculators

More Stoichiometry Calculators