Skip to content

DEV Community

Posted on Jul 16, 2023 • Edited on Dec 8, 2023

Spark on AWS Glue: Performance Tuning 2 (Glue DynamicFrame vs Spark DataFrame)

#aws #glue #spark #performance

This is a continuation of my previous posts as follows.

Glue DynamicFrame vs Spark DataFrame

Let's compare them using the Parquet file which I created in the part 1.

Data Read Speed Comparison

We will read a single large Parquet file and a highly partitioned Parquet file.

with timer('df'): dyf = glueContext.create_dynamic_frame.from_options( "s3", { "paths": [ "s3://.../parquet-chunk-high/" ] }, "parquet", ) print(dyf.count()) with timer('df partition'): dyf = glueContext.create_dynamic_frame.from_options( "s3", { "paths": [ "s3:/.../parquet-partition-high/" ] }, "parquet", ) print(dyf.count())

324917265 [df] done in 125.9965 s 324917265 [df partition] done in 55.9798 s

DynamicFrame is too slow...

Summary

Based on the part 1 (Reading Speed Comparison), spark.read is 27.1 s (for single large file) and 36.3 s (for highly partitioned file), so DynamicFrame is quite slow.
Interestingly, the speed of reading partitioned data is faster than single large Parquet file.

Top comments (0)

Subscribe