Find Minimum, Maximum, and Average Value of PySpark Dataframe column

Find Minimum, Maximum, and Average Value of PySpark Dataframe column

To find the minimum, maximum, and average values of a PySpark DataFrame column, you can use the aggregation functions provided by PySpark. The functions we'll use are min(), max(), and avg().

Here's a step-by-step guide:

1. Set Up PySpark:

Ensure you've set up PySpark. If not, you can do so by installing it:

pip install pyspark 

2. Sample DataFrame:

Let's first create a sample DataFrame:

from pyspark.sql import SparkSession from pyspark.sql.functions import min, max, avg # Create a Spark session spark = SparkSession.builder.appName("Aggregations").getOrCreate() # Sample DataFrame data = [("John", 29), ("Sara", 30), ("Mike", 25), ("Anna", 28)] columns = ["Name", "Age"] df = spark.createDataFrame(data, columns) df.show() 

3. Find Minimum, Maximum, and Average:

a. Minimum:

min_value = df.agg(min("Age")).collect()[0][0] print(f"Minimum Age: {min_value}") 

b. Maximum:

max_value = df.agg(max("Age")).collect()[0][0] print(f"Maximum Age: {max_value}") 

c. Average:

avg_value = df.agg(avg("Age")).collect()[0][0] print(f"Average Age: {avg_value:.2f}") 

4. Aggregating All at Once:

You can also aggregate multiple metrics in a single command:

agg_results = df.agg( min("Age").alias("Min_Age"), max("Age").alias("Max_Age"), avg("Age").alias("Avg_Age") ).collect()[0] print(f"Minimum Age: {agg_results['Min_Age']}") print(f"Maximum Age: {agg_results['Max_Age']}") print(f"Average Age: {agg_results['Avg_Age']:.2f}") 

With this approach, you can easily aggregate multiple columns and metrics simultaneously, making your code more efficient.


More Tags

onscroll non-printable windows-container for-xml distinct-values gensim opencart azure-pipelines hadoop-partitioning getelementbyid

More Programming Guides

Other Guides

More Programming Examples