dataframe - Is there a way in pyspark to count unique values

Dataframe - Is there a way in pyspark to count unique values

In PySpark, you can use the distinct() and count() functions to count unique values in a DataFrame. Here's an example:

from pyspark.sql import SparkSession from pyspark.sql.functions import count, col # Create a Spark session spark = SparkSession.builder.appName("UniqueValuesCount").getOrCreate() # Sample DataFrame data = [("Alice", 1), ("Bob", 2), ("Alice", 3), ("Charlie", 4), ("Bob", 5)] columns = ["Name", "Value"] df = spark.createDataFrame(data, columns) # Count unique values in the "Name" column unique_values_count = df.select("Name").distinct().count() print("Unique values count:", unique_values_count) 

In this example, we use the distinct() function on the "Name" column to get unique values, and then use count() to count them. Adjust the column name according to your DataFrame structure.

This will output the count of unique values in the specified column.

Examples

  1. "PySpark count unique values in a DataFrame column"

    • Code:
      from pyspark.sql import SparkSession from pyspark.sql.functions import countDistinct spark = SparkSession.builder.appName("example").getOrCreate() df = spark.read.csv("your_data.csv", header=True) unique_count = df.select("your_column").agg(countDistinct("your_column").alias("unique_count")).collect()[0]["unique_count"] print("Unique count:", unique_count) 
    • Description: Uses the countDistinct function to count the unique values in a specific column of a PySpark DataFrame.
  2. "PySpark count unique values for each column in DataFrame"

    • Code:
      from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.appName("example").getOrCreate() df = spark.read.csv("your_data.csv", header=True) unique_counts = [df.select(col(column_name)).distinct().count() for column_name in df.columns] print("Unique counts for each column:", dict(zip(df.columns, unique_counts))) 
    • Description: Iterates through each column in the DataFrame and counts the number of unique values for each.
  3. "PySpark count distinct values and their occurrences"

    • Code:
      from pyspark.sql import SparkSession from pyspark.sql.functions import col, count spark = SparkSession.builder.appName("example").getOrCreate() df = spark.read.csv("your_data.csv", header=True) distinct_counts = df.groupBy("your_column").agg(count("*").alias("occurrences")).collect() print("Distinct values and their occurrences:", distinct_counts) 
    • Description: Groups the DataFrame by a specific column, counts the occurrences of each distinct value, and prints the result.
  4. "PySpark DataFrame count unique values with condition"

    • Code:
      from pyspark.sql import SparkSession from pyspark.sql.functions import col, countDistinct spark = SparkSession.builder.appName("example").getOrCreate() df = spark.read.csv("your_data.csv", header=True) unique_count_with_condition = df.filter(col("your_column") > 0).agg(countDistinct("your_column").alias("unique_count")).collect()[0]["unique_count"] print("Unique count with condition:", unique_count_with_condition) 
    • Description: Applies a condition to count unique values only for rows that meet a specific criterion.
  5. "PySpark count distinct values in multiple columns"

    • Code:
      from pyspark.sql import SparkSession from pyspark.sql.functions import col, countDistinct spark = SparkSession.builder.appName("example").getOrCreate() df = spark.read.csv("your_data.csv", header=True) unique_counts_multiple_columns = [df.select(col(column_name)).distinct().count() for column_name in ["column1", "column2", "column3"]] print("Unique counts for selected columns:", dict(zip(["column1", "column2", "column3"], unique_counts_multiple_columns))) 
    • Description: Counts the number of unique values for specified multiple columns in a PySpark DataFrame.
  6. "PySpark count distinct values and percentage"

    • Code:
      from pyspark.sql import SparkSession from pyspark.sql.functions import col, count, expr spark = SparkSession.builder.appName("example").getOrCreate() df = spark.read.csv("your_data.csv", header=True) distinct_counts_percentage = df.groupBy("your_column").agg(count("*").alias("occurrences"), expr("(count(*) / (select count(*) from df)) * 100").alias("percentage")).collect() print("Distinct values, occurrences, and percentage:", distinct_counts_percentage) 
    • Description: Calculates the percentage of occurrences for each distinct value in a specific column.
  7. "PySpark DataFrame count unique values excluding nulls"

    • Code:
      from pyspark.sql import SparkSession from pyspark.sql.functions import col, countDistinct spark = SparkSession.builder.appName("example").getOrCreate() df = spark.read.csv("your_data.csv", header=True) unique_count_excluding_nulls = df.select("your_column").filter(col("your_column").isNotNull()).agg(countDistinct("your_column").alias("unique_count")).collect()[0]["unique_count"] print("Unique count excluding nulls:", unique_count_excluding_nulls) 
    • Description: Counts unique values in a column excluding null values.
  8. "PySpark count distinct values using SQL expression"

    • Code:
      from pyspark.sql import SparkSession spark = SparkSession.builder.appName("example").getOrCreate() df = spark.read.csv("your_data.csv", header=True) df.createOrReplaceTempView("temp_table") distinct_values_sql = spark.sql("SELECT COUNT(DISTINCT your_column) AS unique_count FROM temp_table").collect()[0]["unique_count"] print("Unique count using SQL expression:", distinct_values_sql) 
    • Description: Uses a SQL expression to count the distinct values in a specific column.
  9. "PySpark count unique values in a DataFrame and show top N values"

    • Code:
      from pyspark.sql import SparkSession from pyspark.sql.functions import col, desc spark = SparkSession.builder.appName("example").getOrCreate() df = spark.read.csv("your_data.csv", header=True) top_n_values = df.groupBy("your_column").agg(count("*").alias("occurrences")).orderBy(desc("occurrences")).limit(5).collect() print("Top 5 unique values and their occurrences:", top_n_values) 
    • Description: Counts unique values and displays the top N values along with their occurrences.
  10. "PySpark DataFrame count unique values using broadcast join"

    • Code:
      from pyspark.sql import SparkSession from pyspark.sql.functions import col, broadcast spark = SparkSession.builder.appName("example").getOrCreate() df = spark.read.csv("your_data.csv", header=True) unique_count_with_broadcast_join = df.join(broadcast(df.select("your_column").distinct()), on="your_column").count() print("Unique count using broadcast join:", unique_count_with_broadcast_join) 
    • Description: Utilizes broadcast join to efficiently count unique values in a specific column of a PySpark DataFrame.

More Tags

farsi toggle streamreader ngroute html-input libavformat percentile mode custom-controls docusignapi

More Programming Questions

More Retirement Calculators

More Weather Calculators

More Genetics Calculators

More Mixtures and solutions Calculators