How to find the sum of Particular Column in PySpark Dataframe

How to find the sum of Particular Column in PySpark Dataframe

In PySpark, to find the sum of a particular column in a DataFrame, you can use the agg() function along with the sum() function available in the pyspark.sql.functions module.

Here's a step-by-step guide:

  1. Setup PySpark:

    Make sure you've set up PySpark properly in your environment.

  2. Example:

    Let's assume you have a DataFrame df with a column named value, and you want to find the sum of this column.

    from pyspark.sql import SparkSession from pyspark.sql.functions import sum # Create a Spark session spark = SparkSession.builder.appName("SumColumn").getOrCreate() # Sample DataFrame data = [("A", 10), ("B", 20), ("A", 15), ("C", 30)] columns = ["name", "value"] df = spark.createDataFrame(data, columns) # Display the original DataFrame df.show() # Find the sum of the 'value' column total = df.agg(sum("value").alias("Total_Sum")).collect()[0]["Total_Sum"] print(f"Total Sum of 'value' column: {total}") # Stop the Spark session spark.stop() 

In the above example:

  • We create a sample DataFrame df.
  • Use agg() to aggregate the sum of the value column.
  • Extract the result from the aggregated DataFrame using collect().

The output will show the original DataFrame and then print the total sum of the value column.


More Tags

page-factory .net-4.0 service blockchain raku slidetoggle manifest discord.js android-alarms fileapi

More Programming Guides

Other Guides

More Programming Examples