Apache spark - Calculate the mode of a PySpark DataFrame column?

In PySpark, calculating the mode (the most frequently occurring value) of a DataFrame column requires some aggregation and filtering. Spark does not have a built-in function to compute the mode, but you can achieve this with a combination of transformations.

Here's a step-by-step approach to calculate the mode of a PySpark DataFrame column:

Create a DataFrame: Let's assume you have a DataFrame with some data in it.
Group By and Count: Group the DataFrame by the column whose mode you want to find, then count the number of occurrences for each unique value.
Find the Maximum Count: Using the result from step 2, determine the maximum count.
Filter for Mode: Filter the results to get the value(s) with the maximum count, which represents the mode(s).

Code Example

Here's an example that demonstrates how to calculate the mode for a DataFrame column:

from pyspark.sql import SparkSession from pyspark.sql import functions as F # Create a SparkSession spark = SparkSession.builder.appName("CalculateMode").getOrCreate() # Sample DataFrame data = [("a",), ("b",), ("a",), ("c",), ("a",), ("b",)] df = spark.createDataFrame(data, ["category"]) # Step 1: Group by the column and count the occurrences df_count = df.groupBy("category").count() # Step 2: Find the maximum count (this is the mode frequency) max_count = df_count.agg(F.max("count")).collect()[0][0] # Step 3: Filter the DataFrame to get the value(s) with the maximum count mode_df = df_count.filter(F.col("count") == max_count) # Show the mode(s) mode_df.show()

This code snippet:

Creates a SparkSession and a sample DataFrame with a column named category.
Groups the data by the category column and counts the occurrences.
Finds the maximum count to identify the most frequent value(s).
Filters the grouped DataFrame to get the value(s) that match the maximum count, which represents the mode.

Examples

PySpark: Calculate Mode of a DataFrame Column

Use groupBy and count to determine the most frequent value in a column.

from pyspark.sql import SparkSession from pyspark.sql.functions import col, desc spark = SparkSession.builder.appName("CalculateMode").getOrCreate() # Create a sample DataFrame df = spark.createDataFrame([ (1, "apple"), (2, "banana"), (3, "apple"), (4, "orange"), (5, "banana"), (6, "banana") ], ["id", "fruit"]) # Group by the column and count occurrences mode_df = df.groupBy("fruit").count().orderBy(desc("count")) # Get the mode (most frequent value) mode = mode_df.first()["fruit"] print("Mode of the 'fruit' column:", mode)

PySpark: Calculate Mode for Multiple Columns

Determine the mode for multiple columns in a DataFrame.

from pyspark.sql import SparkSession from pyspark.sql.functions import col, desc spark = SparkSession.builder.appName("CalculateMultipleModes").getOrCreate() # Create a DataFrame with multiple columns df = spark.createDataFrame([ (1, "apple", "red"), (2, "banana", "yellow"), (3, "apple", "green"), (4, "orange", "orange"), (5, "banana", "yellow"), (6, "banana", "yellow") ], ["id", "fruit", "color"]) # Group by each column and count occurrences mode_fruit = df.groupBy("fruit").count().orderBy(desc("count")).first()["fruit"] mode_color = df.groupBy("color").count().orderBy(desc("count")).first()["color"] print("Mode of the 'fruit' column:", mode_fruit) print("Mode of the 'color' column:", mode_color)

PySpark: Calculate Mode with a Window Function

Use a window function to get the mode within a specific window or partition.

from pyspark.sql import SparkSession from pyspark.sql.functions import count, col, row_number from pyspark.sql.window import Window spark = SparkSession.builder.appName("WindowedMode").getOrCreate() # Create a DataFrame with partitions df = spark.createDataFrame([ (1, "apple", "group1"), (2, "banana", "group1"), (3, "apple", "group2"), (4, "orange", "group2"), (5, "banana", "group1"), (6, "banana", "group2") ], ["id", "fruit", "group"]) # Create a window definition for mode calculation window = Window.partitionBy("group").orderBy(desc("count")) # Calculate mode within each group mode_df = df.groupBy("group", "fruit").count().withColumn("rank", row_number().over(window)) mode_per_group = mode_df.filter(col("rank") == 1).select("group", "fruit").collect() for row in mode_per_group: print(f"Mode for group {row['group']} is: {row['fruit']}")

PySpark: Calculate Mode Using UDF (User-Defined Function)

Create a UDF to calculate the mode for a column.

from pyspark.sql import SparkSession from pyspark.sql.functions import udf from pyspark.sql.types import StringType from collections import Counter spark = SparkSession.builder.appName("ModeUDF").getOrCreate() # Create a DataFrame with repeated values df = spark.createDataFrame([ (1, ["apple", "banana", "apple"]), (2, ["orange", "apple", "orange"]), (3, ["banana", "banana", "apple"]) ], ["id", "fruits"]) # Define a UDF to calculate the mode def calculate_mode(fruit_list): return Counter(fruit_list).most_common(1)[0][0] mode_udf = udf(calculate_mode, StringType()) # Apply the UDF to get the mode for each row mode_df = df.withColumn("mode_fruit", mode_udf("fruits")) mode_df.show()

PySpark: Calculate Mode with approxQuantile for Large Datasets

Use approxQuantile to approximate the mode for large datasets with a specified error tolerance.

from pyspark.sql import SparkSession spark = SparkSession.builder.appName("ApproximateMode").getOrCreate() # Create a DataFrame with a large dataset df = spark.createDataFrame([ (1, "apple"), (2, "banana"), (3, "apple"), (4, "orange"), (5, "banana"), (6, "banana") ], ["id", "fruit"]) # Approximate mode by finding the most frequent quantile quantiles = df.stat.approxQuantile("fruit", [0.5], 0.1) # Median (middle quantile) print("Approximate mode:", quantiles[0])

PySpark: Calculate Mode with SQL Queries

Use Spark SQL to determine the mode of a DataFrame column.

from pyspark.sql import SparkSession spark = SparkSession.builder.appName("SQLMode").getOrCreate() # Create a DataFrame df = spark.createDataFrame([ (1, "apple"), (2, "banana"), (3, "apple"), (4, "orange"), (5, "banana"), (6, "banana") ], ["id", "fruit"]) # Register DataFrame as a table df.createOrReplaceTempView("fruits") # Use SQL query to find the mode mode_sql = spark.sql(""" SELECT fruit, COUNT(fruit) AS count FROM fruits GROUP BY fruit ORDER BY count DESC LIMIT 1 """) mode_sql.show()

PySpark: Calculate Mode for a Grouped DataFrame

Group the DataFrame and calculate the mode for each group.

from pyspark.sql import SparkSession from pyspark.sql.functions import col, desc spark = SparkSession.builder.appName("GroupedMode").getOrCreate() # Create a DataFrame with groups df = spark.createDataFrame([ (1, "apple", "group1"), (2, "banana", "group1"), (3, "apple", "group2"), (4, "orange", "group2"), (5, "banana", "group1"), (6, "banana", "group2") ], ["id", "fruit", "group"]) # Calculate mode for each group mode_per_group = df.groupBy("group", "fruit").count().orderBy("group", desc("count")).groupBy("group").first() mode_per_group.show()

PySpark: Calculate Mode for a Column with Null Values

Handle null values when calculating the mode for a column.

from pyspark.sql import SparkSession from pyspark.sql.functions import col, desc, when, count spark = SparkSession.builder.appName("HandleNullMode").getOrCreate() # Create a DataFrame with null values df = spark.createDataFrame([ (1, None), (2, "banana"), (3, "apple"), (4, "orange"), (5, "banana"), (6, "banana") ], ["id", "fruit"]) # Replace null values with a default before calculating mode df = df.withColumn("fruit", when(col("fruit").isNull(), "unknown")) # Calculate the mode excluding null values mode_df = df.groupBy("fruit").count().orderBy(desc("count")) mode = mode_df.first()["fruit"] print("Mode of the 'fruit' column, handling null values:", mode)

PySpark: Calculate Mode for a Sorted DataFrame

Calculate the mode for a DataFrame that's been sorted by another column.

from pyspark.sql import SparkSession from pyspark.sql.functions import col, desc spark = SparkSession.builder.appName("SortedMode").getOrCreate() # Create a DataFrame with multiple columns df = spark.createDataFrame([ (1, "apple", 10), (2, "banana", 20), (3, "apple", 30), (4, "orange", 40), (5, "banana", 50), (6, "banana", 60) ], ["id", "fruit", "value"]) # Sort by another column sorted_df = df.orderBy("value", ascending=True) # Calculate the mode after sorting mode_df = sorted_df.groupBy("fruit").count().orderBy(desc("count")) mode = mode_df.first()["fruit"] print("Mode of the 'fruit' column in a sorted DataFrame:", mode)

More Tags

heroku-api return-type cascadingdropdown sasl mse mode multidimensional-array cd angular-bootstrap uibezierpath

Apache spark - Calculate the mode of a PySpark DataFrame column?

Code Example

Examples

More Tags

More Programming Questions

More Gardening and crops Calculators

More Retirement Calculators

More Housing Building Calculators

More Fitness-Health Calculators

Fitness Calculators

Auto Calculators

Financial Calculators

Date and Time Calculators

Internet Calculators

Pregnancy Calculators

Investment Calculators

Math Calculators

Housing/Building Calculators

Health Calculators

Retirement Calculators

Statistics Calculators

Various Measurements/Units Calculators

Everyday Utility Calculators

Weather Calculators

Real Estate Calculators

Tax and Salary Calculators

Geometry Calculators

Electronics/Circuits Calculators

Transportation Calculators

Entertainment/Anecdotes Calculators