Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server

Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server Ahsan Javed Awan

2 About me? 1988 20112010 20132012 2014 2015 2016 2017 B.E. MTS NUST, Pakistan EMECS, TUKL, Germany EMECS, UoS, UK Lecturer, NUST, Pakistan EMJD-DC, KTH/SICS, Sweden PhD Intern, Recore Netherlands EMJD-DC, UPC/BSC, Spain PhD Intern, IBM Research, Japan Born in Pakistan Research Assistant/Associate Imperial College London, UK

3 What is the focus of this talk ? 3 ● Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server, PhD thesis, Ahsan Javed Awan (ISBN: 978-91-7729-584-6) ● Identifying the potential of Near Data Processing for Apache Spark in ACM Memory Systems Symposium, 2017. ● Node Architecture Implications for In-Memory Data Analytics in Scale-in Clusters in IEEE/ACM Conference in Big Data Computing, Applications and Technologies, 2016. ● Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads, in IEEE Conference on Big Data and Cloud Computing, 2016. ● How Data Volume Affects Spark Based Data Analytics on a Scale-up Server in 6th Workshop on Big Data Benchmarks, Performance Optimization and Emerging Hardware (BpoE), held in conjunction with VLDB 2015, Hawaii, USA . ● Performance characterization of in-memory data analytics on a modern cloud server, in IEEE Conference on Big Data and Cloud Computing, 2015 (Best Paper Award).

4 The thesis statement ? 4 Scale-out big data processing frameworks like Apache Spark fail to fully exploit the potential of modern off-the-shelf commodity machines (scale-up servers) and require modern servers to be augmented with programmable accelerators near-memory and near-storage

5 Where does this thesis fit in ? 5 ● Clearing the clouds, ASPLOS' 12 ● Characterizing data analysis workloads, IISWC' 13 ● Understanding the behavior of in- memory computing workloads, IISWC' 14 ● Exponential increase in core count. ● A mismatch between the characteristics of emerging big data workloads and the underlying hardware. ● Newer promising technologies (Hybrid Memory Cubes, NVRAM etc)

6 Cont... 6 Improve the node level performance of scale-out frameworks like Apache Spark Phoenix ++, Metis, Ostrich, etc.. Hadoop, Spark, Flink, etc.. Focus of thesis *Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/

7 Cont... Quantification of mismatch between scale-out big data processing frameworks and scale-up servers Architectural Impact on the performance of big data processing frameworks Exploiting Near Data Processing to boost the performance of big data processing frameworks

8 What is the methodology used? 8 ● Empirical studies of representative benchmarks on a dual socket server. ● Implications of performance numbers on future server architectures. ● Relied on performance analysis tools provided by the vendor and mixed it with modeling to get estimates. ● Some numbers are taken from previous studies in the literature. .

9 Which Scale-out Framework ? 9 [Picture Courtesy: Amir H. Payberah]

11 Which Benchmarks ? 11 Spark-Core Word Count, Grep, Sort, Naive Bayes Spark-SQL Join, Aggregation, Difference, Order By, Cross Product Spark-MLlib K-Means, Support Vector Machines, Logistic Regression, Linear Regression, Decision Trees, Sparse Naive Bayes Graph-X Page Rank, Connected Components, Triangle Counting Spark-Streaming Networked Word Count, Stateful Word Count, Count Min Sketch, Hyper Log Log, Windowed Word Count, Streaming K-Means

12 The summary of work? 12 Problems Identified Solutions Proposed https://databricks.com/session/near-data-computing-architectures-apache-spark-challenges-opportunities/ https://spark-summit.org/eu-2016/events/performance-characterization-of-apache-spark-on-scale-up-servers/ Work Time Inflation Poor Multi-core Scalability of data analytics with Spark Thread Level Load Imbalance Wait Time on I/O GC overhead DRAM Bound Latency NUMA Awareness Hyper Threaded Cores No next-line prefetchers Lower DRAM speed Future Hybrid Node with ISP + 2D PIM Choice of GC algorithm Multiple Small executors

13 What will I focus in detail? 13 Problems Identified Solutions Proposed https://databricks.com/session/near-data-computing-architectures-apache-spark-challenges-opportunities/ https://spark-summit.org/eu-2016/events/performance-characterization-of-apache-spark-on-scale-up-servers/ Work Time Inflation Poor Multi-core Scalability of data analytics with Spark Thread Level Load Imbalance Wait Time on I/O GC overhead DRAM Bound Latency NUMA Awareness Hyper Threaded Cores No next-line prefetchers Lower DRAM speed Exploiting Near Data Processing Choice of GC algorithm Multiple Small executors

14 Do Spark workloads have good multi-core scalability ? Spark scales poorly in Scale-up configuration

15 Is there work-time inflation ? K-means (Km)

16 Is File I/O detrimental to performance ? Fraction of file I/O increases by 25x in Sort respectively when input data is increased by 4x

17 Are workloads DRAM Bound ? Poor instruction retirement due to frequent DRAM accesses

18 Exploiting NDP/Moving compute closer to data ? 18 Loh et al. A processing in memory taxonomy and a case for studying fixedfunction pim. In Workshop on Near-Data Processing (WoNDP), 2013. 1. Processing in Memory 2. In-Storage Processing Improve the performance by reducing costly data movements back and forth between the CPUs and Memories

19 Trends of Integrating NVM in the System Architecture ? 19 Chang et al. A limits study of benefits from nanostore-based future data-centric system architectures. In Computing Frontiers 2012

20 Can Spark workloads benefit from Near data processing ? 20 Host CPU PIM device ISP device Project: Night-King

21 The case for in-storage processing ? 21 Grep (Gp) K-means (Km)Windowed Word Count (Wwc)

22 The case for 2D integrated PIM instead of 3D Stacked PIM ? 22 M. Radulovic et al. Another Trip to the Wall: How Much Will Stacked DRAM Benefit HPC?

23 A refined hypothesis based on workload characterization ? 23 ● Spark workloads, which are not iterative and have high ratio of I/O wait time / CPU time like join, aggregation, filter, word count and sort are ideal candidates for ISP. ● Spark workloads, which have low ratio of I/O wait time / CPU time like stream processing and iterative graph processing workloads are bound by latency of frequent accesses to DRAM and are ideal candidates for 2D integrated PIM. ● Spark workloads, which are iterative and have moderate ratio of I/O wait time / CPU time like K-means, have both I/O bound and memory bound phases and hence will benefit from hybrid 2D integrated PIM and ISP. ● In order to satisfy the varying compute demands of Spark workloads, we envision an NDC architecture with programmable logic based hybrid ISP and 2D integrated PIM.

24 How to test the refined hypothesis ? 24 ● Simulation Approach ● Very slow for big data applications :( ● Modeling Approach ● Overly estimated numbers :( ● Emulation Approach ● A lot of development :( How about a combination of Modeling and partial Emulation ?

25 Can existing tightly coupled servers be used as emulators ? 25

27 Which programming model ? 27 Iterative MapReduce *Source: JudyQiu-Talk-IIT-Nov-4-2011

28 Which workloads ? 28 K-means and SGD Mahan et al. TABLA: A unified template-based framework for accelerating statistical machine learning

29 Design Parameters ? 29 ● Assumption 01: Training data, model and intermediate data fit in the FPGA internal memory and is kept across the iterations. ● Assumption 02: Model and intermediate data fit in the FPGA internal memory but training data does not fit inside FPGA and is kept on FPGA external DDR3 memory. ● Assumption 03: Training data does not fit on the FPGA external memory but model fits inside the FPGA. ● Assumption 04: Training data does not fit on the FPGA external memory but fits on the System memory and model does not fit inside the FPGA memory.

30 Our programmable accelerators ? 30

31 Advantages of the design ? 31 ● Template based design to support generality. ● No of mappers and reducers can be instantiated based on the FPGA card. ● General Sequencer is a Finite State Machine whose states can be varied to meet the diverse set of workloads ● Mappers and Reducers can be programmed in C/C++ and can be synthesized using High Level Synthesis. ● Support hardware acceleration of Diverse set of workloads

32 Let's show some numbers ? 32 ~9x

33 What are the opportunities ? 33 K-means (Km) Conservatively, Near-data accelerators augmented Scale-up Servers can improve ● the performance of Spark MLlib by 4x

34 What High Level Synthesis (Xilinx SDSoC Tool Chain) can do ? 34 20x 10x High Level Synthesis approach has a potential to solve the programmability issues of NDP

35 What are the challenges? 35 ● How to design the best hybrid CPU + FPGA ML workloads ? ● How to attain peak performance on CPU side ? ● How to attain peak performance on FPGA side ? ● How to balance load between CPU and FPGA ? ● How hide communication between JVM and FPGA ? ● How to attain peak CAPI bandwidth consumption ? ● How to design the clever ML workload accelerators using HLS tools ?

36 A Quick Summary ? 36 Problems Identified Solutions Proposed https://databricks.com/session/near-data-computing-architectures-apache-spark-challenges-opportunities/ https://spark-summit.org/eu-2016/events/performance-characterization-of-apache-spark-on-scale-up-servers/ Work Time Inflation Poor Multi-core Scalability of data analytics with Spark Thread Level Load Imbalance Wait Time on I/O GC overhead DRAM Bound Latency NUMA Awareness Hyper Threaded Cores No next-line prefetchers Lower DRAM speed Future Hybrid Node with ISP + 2D PIM Choice of GC algorithm Multiple Small executors

37 Cont.. 37 Scale-out big data processing frameworks like Apache Spark fail to fully exploit the potential of modern off-the-shelf commodity machines (scale-up servers) and require modern servers to be augmented with programmable accelerators near-memory and near-storage Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server, PhD thesis, Ahsan Javed Awan (ISBN: 978-91-7729-584-6)

38 That's all for now ? 38 Email: ajawan@kth.seEmail: ajawan@kth.se Profile:Profile: www.kth.se/profile/ajawan/www.kth.se/profile/ajawan/ https://se.linkedin.com/in/ahsanjavedawanhttps://se.linkedin.com/in/ahsanjavedawan THANK YOU

39 What are the limitations of my work ? 39 ● Assumption 01: Apache Spark is to stay as the state-of-art for the foreseeable future. ● Assumption 02: In the big data analytics domain, synthetic benchmarks/dwarfs should be given preference over real-world workloads. ● Assumption 03: SSDs will stick around despite the availability of terabyte scale DRAMs. ● Assumption 04: Tools in the high-level synthesis domain are getting mature enough to support programmable accelerators near DRAM and NVRAM

40 Is GC detrimental to scalability of Spark applications ? 40 GC time does not scale linearly at larger datasets

41 How about using multiple small executors over single large executor ? Multiple small executors can provide up-to 36% performance gain

42 Is GC detrimental to scalability of Spark applications ? 42 NUMA awareness results in 10% speed up on average

43 Is Hyper-Threading Effective ? Hyper threading reduces the DRAM bound stalls by 50%

44 How effective are existing data prefetchers ? Disabling next-line prefetchers can improve the performance by 15%

45 How effective cache aware optimizations in project tungsten are ? DataFrame exhibit 25% less back-end bound stalls 64% less DRAM bound stalled cycles 25% less BW consumption10% less starvation of execution resources Dataframes have better micro-architectural performance than RDDs

46 Is there thread-level load imbalance ?

Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server

More Related Content

What's hot

Similar to Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server

Recently uploaded

Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server