Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads

1 Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads Ahsan Javed Awan EMJD-DC (KTH-UPC) (https://www.kth.se/profile/ajawan/) Mats Brorsson(KTH), Eduard Ayguade(UPC and BSC), Vladimir Vlassov(KTH)

2 Motivation Why should we care about architecture support? *Taken from Babak's slides Data Growing Faster Than Technology

3 Motivation Cont... Our GoalOur Goal Improve the node level performance through architecture support *Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/ Phoenix ++, Metis, Ostrich, etc.. Hadoop, Spark, Flink, etc..

4 Our Approach ● Performance characterization of in-memory data analytics on a modern cloud server, in 5th International IEEE Conference on Big Data and Cloud Computing, 2015 (Best Paper Award). ● How Data Volume Affects Spark Based Data Analytics on a Scale-up Server in 6th International Workshop on Big Data Benchmarks, Performance Optimization and Emerging Hardware (BpoE), held in conjunction with VLDB 2015, Hawaii, USA – Limited to batch processing workloads only – Does not consider the velocity aspect of big data – Experiments are based on older version of Spark. What are the major performance bottlenecks??

5 Our Approach ● Does micro-architectural performance remains consistent across batch and stream processing workloads ? ● How Data-frames micro-architecturally compare to RDDs ? ● How data velocity affect the micro-architectural performance ? What are the remaining questions??

6 Progress Meeting 12-12-14 Which Scale-out Framework ? [Picture Courtesy: Amir H. Payberah] ● Tuning of Spark internal Parameters ● Tuning of JVM Parameters (Heap size etc..) ● Micro-architecture Level Analysis using Hardware Performance Counters.

7 Our Approach Which benchmarks?

8 Our Hardware Configuration Which Machine ? Hyper Threading and Turbo-boost are disabled Intel's Ivy Bridge Server

9 Does micro-architectural performance remains consistent ? Stream processing is micro-architecturally similar to batch processing in Spark

10 Cont.. Stream processing is micro-architecturally similar to batch processing in Spark

11 Cont.. Streaming workloads with similar Spark transformations have different micro-architectural behavior

14 Cont.. Workload Spark Transformation Input data rate Window size (s) Working Set with 2s sampling interval WWc FlatMap, Map, ReduceByKeyAndWindow 10^4 30 15 x 10^4 CSpc FlatMap, Map, CountByValueAndWindow 10^4 10 5 x 10^4 CErpz FlatMap, Map, Window, GroupByKey 10^4 30 15 x 10^4 CAuC FlatMap, Map, Window, GroupByKey, Count 10^4 10 5 x 10^4 Tpt FlatMap, ReduceByKeyAndWindow, Transform 10^1 60 30 x 10^1 Micro-batch size determines the micro-architectural behavior of stream processing workloads with similar Spark transformations

15 Do Dataframes perform better than RDDs at micro-architectural level? DataFrame exhibit 25% less back-end bound stalls 64% less DRAM bound stalled cycles 25% less BW consumption10% less starvation of execution resources Dataframes have better micro-architectural performance than RDDs

16 How Data Velocity affect micro-architectural performance? Better CPU utilization at higher data velocity

17 Cont.. Higher instruction retirement at higher data velocity Higher L1-Bound stalls at higher data velocity Less starvation at higher data velocity Higher BW consumption at higher velocity

18 Our Approach Conclusion ● Batch processing and stream processing has same micro-architectural behavior in Spark if the difference between two implementations is of micro-batching only. ● Spark workloads using DataFrames have improved instruction retirement over workloads using RDDs. ● If the input data rates are small, stream processing workloads are front-end bound. However, the front end bound stalls are reduced at larger input data rates and instruction retirement is improved.

20 Our Approach List of Papers ● Performance characterization of in-memory data analytics on a modern cloud server, in 5th International IEEE Conference on Big Data and Cloud Computing, 2015 (Best Paper Award). ● How Data Volume Affects Spark Based Data Analytics on a Scale-up Server in 6th International Workshop on Big Data Benchmarks, Performance Optimization and Emerging Hardware (BpoE), held in conjunction with VLDB 2015, Hawaii, USA . ● Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads. (accepted to BDCloud 2016) ● Node Architecture Implications for In-Memory Data Analytics in Scale- in Clusters (accepted to IEEE BDCAT 2016) ● Implications of In-Memory Data Analytics with Apache Spark on Near Data Computing Architectures (under submission).

Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads

More Related Content

What's hot

Viewers also liked

Similar to Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads

Recently uploaded

Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads