Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server Ahsan Javed Awan
2 About me? 1988 20112010 20132012 2014 2015 2016 2017 B.E. MTS NUST, Pakistan EMECS, TUKL, Germany EMECS, UoS, UK Lecturer, NUST, Pakistan EMJD-DC, KTH/SICS, Sweden PhD Intern, Recore Netherlands EMJD-DC, UPC/BSC, Spain PhD Intern, IBM Research, Japan Born in Pakistan Research Assistant/Associate Imperial College London, UK
3 What is the focus of this talk ? 3 ● Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server, PhD thesis, Ahsan Javed Awan (ISBN: 978-91-7729-584-6) ● Identifying the potential of Near Data Processing for Apache Spark in ACM Memory Systems Symposium, 2017. ● Node Architecture Implications for In-Memory Data Analytics in Scale-in Clusters in IEEE/ACM Conference in Big Data Computing, Applications and Technologies, 2016. ● Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads, in IEEE Conference on Big Data and Cloud Computing, 2016. ● How Data Volume Affects Spark Based Data Analytics on a Scale-up Server in 6th Workshop on Big Data Benchmarks, Performance Optimization and Emerging Hardware (BpoE), held in conjunction with VLDB 2015, Hawaii, USA . ● Performance characterization of in-memory data analytics on a modern cloud server, in IEEE Conference on Big Data and Cloud Computing, 2015 (Best Paper Award).
4 The thesis statement ? 4 Scale-out big data processing frameworks like Apache Spark fail to fully exploit the potential of modern off-the-shelf commodity machines (scale-up servers) and require modern servers to be augmented with programmable accelerators near-memory and near-storage
5 Where does this thesis fit in ? 5 ● Clearing the clouds, ASPLOS' 12 ● Characterizing data analysis workloads, IISWC' 13 ● Understanding the behavior of in- memory computing workloads, IISWC' 14 ● Exponential increase in core count. ● A mismatch between the characteristics of emerging big data workloads and the underlying hardware. ● Newer promising technologies (Hybrid Memory Cubes, NVRAM etc)
6 Cont... 6 Improve the node level performance of scale-out frameworks like Apache Spark Phoenix ++, Metis, Ostrich, etc.. Hadoop, Spark, Flink, etc.. Focus of thesis *Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/
7 Cont... Quantification of mismatch between scale-out big data processing frameworks and scale-up servers Architectural Impact on the performance of big data processing frameworks Exploiting Near Data Processing to boost the performance of big data processing frameworks
8 What is the methodology used? 8 ● Empirical studies of representative benchmarks on a dual socket server. ● Implications of performance numbers on future server architectures. ● Relied on performance analysis tools provided by the vendor and mixed it with modeling to get estimates. ● Some numbers are taken from previous studies in the literature. .
9 Which Scale-out Framework ? 9 [Picture Courtesy: Amir H. Payberah]
10 Which Machine ? 10
11 Which Benchmarks ? 11 Spark-Core Word Count, Grep, Sort, Naive Bayes Spark-SQL Join, Aggregation, Difference, Order By, Cross Product Spark-MLlib K-Means, Support Vector Machines, Logistic Regression, Linear Regression, Decision Trees, Sparse Naive Bayes Graph-X Page Rank, Connected Components, Triangle Counting Spark-Streaming Networked Word Count, Stateful Word Count, Count Min Sketch, Hyper Log Log, Windowed Word Count, Streaming K-Means
12 The summary of work? 12 Problems Identified Solutions Proposed https://databricks.com/session/near-data-computing-architectures-apache-spark-challenges-opportunities/ https://spark-summit.org/eu-2016/events/performance-characterization-of-apache-spark-on-scale-up-servers/ Work Time Inflation Poor Multi-core Scalability of data analytics with Spark Thread Level Load Imbalance Wait Time on I/O GC overhead DRAM Bound Latency NUMA Awareness Hyper Threaded Cores No next-line prefetchers Lower DRAM speed Future Hybrid Node with ISP + 2D PIM Choice of GC algorithm Multiple Small executors
13 What will I focus in detail? 13 Problems Identified Solutions Proposed https://databricks.com/session/near-data-computing-architectures-apache-spark-challenges-opportunities/ https://spark-summit.org/eu-2016/events/performance-characterization-of-apache-spark-on-scale-up-servers/ Work Time Inflation Poor Multi-core Scalability of data analytics with Spark Thread Level Load Imbalance Wait Time on I/O GC overhead DRAM Bound Latency NUMA Awareness Hyper Threaded Cores No next-line prefetchers Lower DRAM speed Exploiting Near Data Processing Choice of GC algorithm Multiple Small executors
14 Do Spark workloads have good multi-core scalability ? Spark scales poorly in Scale-up configuration
15 Is there work-time inflation ? K-means (Km)
16 Is File I/O detrimental to performance ? Fraction of file I/O increases by 25x in Sort respectively when input data is increased by 4x
17 Are workloads DRAM Bound ? Poor instruction retirement due to frequent DRAM accesses
18 Exploiting NDP/Moving compute closer to data ? 18 Loh et al. A processing in memory taxonomy and a case for studying fixedfunction pim. In Workshop on Near-Data Processing (WoNDP), 2013. 1. Processing in Memory 2. In-Storage Processing Improve the performance by reducing costly data movements back and forth between the CPUs and Memories
19 Trends of Integrating NVM in the System Architecture ? 19 Chang et al. A limits study of benefits from nanostore-based future data-centric system architectures. In Computing Frontiers 2012
20 Can Spark workloads benefit from Near data processing ? 20 Host CPU PIM device ISP device Project: Night-King
21 The case for in-storage processing ? 21 Grep (Gp) K-means (Km)Windowed Word Count (Wwc)
22 The case for 2D integrated PIM instead of 3D Stacked PIM ? 22 M. Radulovic et al. Another Trip to the Wall: How Much Will Stacked DRAM Benefit HPC?
23 A refined hypothesis based on workload characterization ? 23 ● Spark workloads, which are not iterative and have high ratio of I/O wait time / CPU time like join, aggregation, filter, word count and sort are ideal candidates for ISP. ● Spark workloads, which have low ratio of I/O wait time / CPU time like stream processing and iterative graph processing workloads are bound by latency of frequent accesses to DRAM and are ideal candidates for 2D integrated PIM. ● Spark workloads, which are iterative and have moderate ratio of I/O wait time / CPU time like K-means, have both I/O bound and memory bound phases and hence will benefit from hybrid 2D integrated PIM and ISP. ● In order to satisfy the varying compute demands of Spark workloads, we envision an NDC architecture with programmable logic based hybrid ISP and 2D integrated PIM.
24 How to test the refined hypothesis ? 24 ● Simulation Approach ● Very slow for big data applications :( ● Modeling Approach ● Overly estimated numbers :( ● Emulation Approach ● A lot of development :( How about a combination of Modeling and partial Emulation ?
25 Can existing tightly coupled servers be used as emulators ? 25
26 Our System Design ? 26
27 Which programming model ? 27 Iterative MapReduce *Source: JudyQiu-Talk-IIT-Nov-4-2011
28 Which workloads ? 28 K-means and SGD Mahan et al. TABLA: A unified template-based framework for accelerating statistical machine learning
29 Design Parameters ? 29 ● Assumption 01: Training data, model and intermediate data fit in the FPGA internal memory and is kept across the iterations. ● Assumption 02: Model and intermediate data fit in the FPGA internal memory but training data does not fit inside FPGA and is kept on FPGA external DDR3 memory. ● Assumption 03: Training data does not fit on the FPGA external memory but model fits inside the FPGA. ● Assumption 04: Training data does not fit on the FPGA external memory but fits on the System memory and model does not fit inside the FPGA memory.
30 Our programmable accelerators ? 30
31 Advantages of the design ? 31 ● Template based design to support generality. ● No of mappers and reducers can be instantiated based on the FPGA card. ● General Sequencer is a Finite State Machine whose states can be varied to meet the diverse set of workloads ● Mappers and Reducers can be programmed in C/C++ and can be synthesized using High Level Synthesis. ● Support hardware acceleration of Diverse set of workloads
32 Let's show some numbers ? 32 ~9x
33 What are the opportunities ? 33 K-means (Km) Conservatively, Near-data accelerators augmented Scale-up Servers can improve ● the performance of Spark MLlib by 4x
34 What High Level Synthesis (Xilinx SDSoC Tool Chain) can do ? 34 20x 10x High Level Synthesis approach has a potential to solve the programmability issues of NDP
35 What are the challenges? 35 ● How to design the best hybrid CPU + FPGA ML workloads ? ● How to attain peak performance on CPU side ? ● How to attain peak performance on FPGA side ? ● How to balance load between CPU and FPGA ? ● How hide communication between JVM and FPGA ? ● How to attain peak CAPI bandwidth consumption ? ● How to design the clever ML workload accelerators using HLS tools ?
36 A Quick Summary ? 36 Problems Identified Solutions Proposed https://databricks.com/session/near-data-computing-architectures-apache-spark-challenges-opportunities/ https://spark-summit.org/eu-2016/events/performance-characterization-of-apache-spark-on-scale-up-servers/ Work Time Inflation Poor Multi-core Scalability of data analytics with Spark Thread Level Load Imbalance Wait Time on I/O GC overhead DRAM Bound Latency NUMA Awareness Hyper Threaded Cores No next-line prefetchers Lower DRAM speed Future Hybrid Node with ISP + 2D PIM Choice of GC algorithm Multiple Small executors
37 Cont.. 37 Scale-out big data processing frameworks like Apache Spark fail to fully exploit the potential of modern off-the-shelf commodity machines (scale-up servers) and require modern servers to be augmented with programmable accelerators near-memory and near-storage Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server, PhD thesis, Ahsan Javed Awan (ISBN: 978-91-7729-584-6)
38 That's all for now ? 38 Email: ajawan@kth.seEmail: ajawan@kth.se Profile:Profile: www.kth.se/profile/ajawan/www.kth.se/profile/ajawan/ https://se.linkedin.com/in/ahsanjavedawanhttps://se.linkedin.com/in/ahsanjavedawan THANK YOU
39 What are the limitations of my work ? 39 ● Assumption 01: Apache Spark is to stay as the state-of-art for the foreseeable future. ● Assumption 02: In the big data analytics domain, synthetic benchmarks/dwarfs should be given preference over real-world workloads. ● Assumption 03: SSDs will stick around despite the availability of terabyte scale DRAMs. ● Assumption 04: Tools in the high-level synthesis domain are getting mature enough to support programmable accelerators near DRAM and NVRAM
40 Is GC detrimental to scalability of Spark applications ? 40 GC time does not scale linearly at larger datasets
41 How about using multiple small executors over single large executor ? Multiple small executors can provide up-to 36% performance gain
42 Is GC detrimental to scalability of Spark applications ? 42 NUMA awareness results in 10% speed up on average
43 Is Hyper-Threading Effective ? Hyper threading reduces the DRAM bound stalls by 50%
44 How effective are existing data prefetchers ? Disabling next-line prefetchers can improve the performance by 15%
45 How effective cache aware optimizations in project tungsten are ? DataFrame exhibit 25% less back-end bound stalls 64% less DRAM bound stalled cycles 25% less BW consumption10% less starvation of execution resources Dataframes have better micro-architectural performance than RDDs
46 Is there thread-level load imbalance ?

Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server

  • 1.
    Performance Characterization and Optimizationof In-Memory Data Analytics on a Scale-up Server Ahsan Javed Awan
  • 2.
    2 About me? 1988 2011201020132012 2014 2015 2016 2017 B.E. MTS NUST, Pakistan EMECS, TUKL, Germany EMECS, UoS, UK Lecturer, NUST, Pakistan EMJD-DC, KTH/SICS, Sweden PhD Intern, Recore Netherlands EMJD-DC, UPC/BSC, Spain PhD Intern, IBM Research, Japan Born in Pakistan Research Assistant/Associate Imperial College London, UK
  • 3.
    3 What is thefocus of this talk ? 3 ● Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server, PhD thesis, Ahsan Javed Awan (ISBN: 978-91-7729-584-6) ● Identifying the potential of Near Data Processing for Apache Spark in ACM Memory Systems Symposium, 2017. ● Node Architecture Implications for In-Memory Data Analytics in Scale-in Clusters in IEEE/ACM Conference in Big Data Computing, Applications and Technologies, 2016. ● Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads, in IEEE Conference on Big Data and Cloud Computing, 2016. ● How Data Volume Affects Spark Based Data Analytics on a Scale-up Server in 6th Workshop on Big Data Benchmarks, Performance Optimization and Emerging Hardware (BpoE), held in conjunction with VLDB 2015, Hawaii, USA . ● Performance characterization of in-memory data analytics on a modern cloud server, in IEEE Conference on Big Data and Cloud Computing, 2015 (Best Paper Award).
  • 4.
    4 The thesis statement? 4 Scale-out big data processing frameworks like Apache Spark fail to fully exploit the potential of modern off-the-shelf commodity machines (scale-up servers) and require modern servers to be augmented with programmable accelerators near-memory and near-storage
  • 5.
    5 Where does thisthesis fit in ? 5 ● Clearing the clouds, ASPLOS' 12 ● Characterizing data analysis workloads, IISWC' 13 ● Understanding the behavior of in- memory computing workloads, IISWC' 14 ● Exponential increase in core count. ● A mismatch between the characteristics of emerging big data workloads and the underlying hardware. ● Newer promising technologies (Hybrid Memory Cubes, NVRAM etc)
  • 6.
    6 Cont... 6 Improve the nodelevel performance of scale-out frameworks like Apache Spark Phoenix ++, Metis, Ostrich, etc.. Hadoop, Spark, Flink, etc.. Focus of thesis *Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/
  • 7.
    7 Cont... Quantification of mismatchbetween scale-out big data processing frameworks and scale-up servers Architectural Impact on the performance of big data processing frameworks Exploiting Near Data Processing to boost the performance of big data processing frameworks
  • 8.
    8 What is themethodology used? 8 ● Empirical studies of representative benchmarks on a dual socket server. ● Implications of performance numbers on future server architectures. ● Relied on performance analysis tools provided by the vendor and mixed it with modeling to get estimates. ● Some numbers are taken from previous studies in the literature. .
  • 9.
    9 Which Scale-out Framework? 9 [Picture Courtesy: Amir H. Payberah]
  • 10.
  • 11.
    11 Which Benchmarks ? 11 Spark-CoreWord Count, Grep, Sort, Naive Bayes Spark-SQL Join, Aggregation, Difference, Order By, Cross Product Spark-MLlib K-Means, Support Vector Machines, Logistic Regression, Linear Regression, Decision Trees, Sparse Naive Bayes Graph-X Page Rank, Connected Components, Triangle Counting Spark-Streaming Networked Word Count, Stateful Word Count, Count Min Sketch, Hyper Log Log, Windowed Word Count, Streaming K-Means
  • 12.
    12 The summary ofwork? 12 Problems Identified Solutions Proposed https://databricks.com/session/near-data-computing-architectures-apache-spark-challenges-opportunities/ https://spark-summit.org/eu-2016/events/performance-characterization-of-apache-spark-on-scale-up-servers/ Work Time Inflation Poor Multi-core Scalability of data analytics with Spark Thread Level Load Imbalance Wait Time on I/O GC overhead DRAM Bound Latency NUMA Awareness Hyper Threaded Cores No next-line prefetchers Lower DRAM speed Future Hybrid Node with ISP + 2D PIM Choice of GC algorithm Multiple Small executors
  • 13.
    13 What will Ifocus in detail? 13 Problems Identified Solutions Proposed https://databricks.com/session/near-data-computing-architectures-apache-spark-challenges-opportunities/ https://spark-summit.org/eu-2016/events/performance-characterization-of-apache-spark-on-scale-up-servers/ Work Time Inflation Poor Multi-core Scalability of data analytics with Spark Thread Level Load Imbalance Wait Time on I/O GC overhead DRAM Bound Latency NUMA Awareness Hyper Threaded Cores No next-line prefetchers Lower DRAM speed Exploiting Near Data Processing Choice of GC algorithm Multiple Small executors
  • 14.
    14 Do Spark workloadshave good multi-core scalability ? Spark scales poorly in Scale-up configuration
  • 15.
    15 Is there work-timeinflation ? K-means (Km)
  • 16.
    16 Is File I/Odetrimental to performance ? Fraction of file I/O increases by 25x in Sort respectively when input data is increased by 4x
  • 17.
    17 Are workloads DRAMBound ? Poor instruction retirement due to frequent DRAM accesses
  • 18.
    18 Exploiting NDP/Moving computecloser to data ? 18 Loh et al. A processing in memory taxonomy and a case for studying fixedfunction pim. In Workshop on Near-Data Processing (WoNDP), 2013. 1. Processing in Memory 2. In-Storage Processing Improve the performance by reducing costly data movements back and forth between the CPUs and Memories
  • 19.
    19 Trends of IntegratingNVM in the System Architecture ? 19 Chang et al. A limits study of benefits from nanostore-based future data-centric system architectures. In Computing Frontiers 2012
  • 20.
    20 Can Spark workloadsbenefit from Near data processing ? 20 Host CPU PIM device ISP device Project: Night-King
  • 21.
    21 The case forin-storage processing ? 21 Grep (Gp) K-means (Km)Windowed Word Count (Wwc)
  • 22.
    22 The case for2D integrated PIM instead of 3D Stacked PIM ? 22 M. Radulovic et al. Another Trip to the Wall: How Much Will Stacked DRAM Benefit HPC?
  • 23.
    23 A refined hypothesisbased on workload characterization ? 23 ● Spark workloads, which are not iterative and have high ratio of I/O wait time / CPU time like join, aggregation, filter, word count and sort are ideal candidates for ISP. ● Spark workloads, which have low ratio of I/O wait time / CPU time like stream processing and iterative graph processing workloads are bound by latency of frequent accesses to DRAM and are ideal candidates for 2D integrated PIM. ● Spark workloads, which are iterative and have moderate ratio of I/O wait time / CPU time like K-means, have both I/O bound and memory bound phases and hence will benefit from hybrid 2D integrated PIM and ISP. ● In order to satisfy the varying compute demands of Spark workloads, we envision an NDC architecture with programmable logic based hybrid ISP and 2D integrated PIM.
  • 24.
    24 How to testthe refined hypothesis ? 24 ● Simulation Approach ● Very slow for big data applications :( ● Modeling Approach ● Overly estimated numbers :( ● Emulation Approach ● A lot of development :( How about a combination of Modeling and partial Emulation ?
  • 25.
    25 Can existing tightlycoupled servers be used as emulators ? 25
  • 26.
  • 27.
    27 Which programming model? 27 Iterative MapReduce *Source: JudyQiu-Talk-IIT-Nov-4-2011
  • 28.
    28 Which workloads ? 28 K-meansand SGD Mahan et al. TABLA: A unified template-based framework for accelerating statistical machine learning
  • 29.
    29 Design Parameters ? 29 ●Assumption 01: Training data, model and intermediate data fit in the FPGA internal memory and is kept across the iterations. ● Assumption 02: Model and intermediate data fit in the FPGA internal memory but training data does not fit inside FPGA and is kept on FPGA external DDR3 memory. ● Assumption 03: Training data does not fit on the FPGA external memory but model fits inside the FPGA. ● Assumption 04: Training data does not fit on the FPGA external memory but fits on the System memory and model does not fit inside the FPGA memory.
  • 30.
  • 31.
    31 Advantages of thedesign ? 31 ● Template based design to support generality. ● No of mappers and reducers can be instantiated based on the FPGA card. ● General Sequencer is a Finite State Machine whose states can be varied to meet the diverse set of workloads ● Mappers and Reducers can be programmed in C/C++ and can be synthesized using High Level Synthesis. ● Support hardware acceleration of Diverse set of workloads
  • 32.
    32 Let's show somenumbers ? 32 ~9x
  • 33.
    33 What are theopportunities ? 33 K-means (Km) Conservatively, Near-data accelerators augmented Scale-up Servers can improve ● the performance of Spark MLlib by 4x
  • 34.
    34 What High LevelSynthesis (Xilinx SDSoC Tool Chain) can do ? 34 20x 10x High Level Synthesis approach has a potential to solve the programmability issues of NDP
  • 35.
    35 What are thechallenges? 35 ● How to design the best hybrid CPU + FPGA ML workloads ? ● How to attain peak performance on CPU side ? ● How to attain peak performance on FPGA side ? ● How to balance load between CPU and FPGA ? ● How hide communication between JVM and FPGA ? ● How to attain peak CAPI bandwidth consumption ? ● How to design the clever ML workload accelerators using HLS tools ?
  • 36.
    36 A Quick Summary? 36 Problems Identified Solutions Proposed https://databricks.com/session/near-data-computing-architectures-apache-spark-challenges-opportunities/ https://spark-summit.org/eu-2016/events/performance-characterization-of-apache-spark-on-scale-up-servers/ Work Time Inflation Poor Multi-core Scalability of data analytics with Spark Thread Level Load Imbalance Wait Time on I/O GC overhead DRAM Bound Latency NUMA Awareness Hyper Threaded Cores No next-line prefetchers Lower DRAM speed Future Hybrid Node with ISP + 2D PIM Choice of GC algorithm Multiple Small executors
  • 37.
    37 Cont.. 37 Scale-out big dataprocessing frameworks like Apache Spark fail to fully exploit the potential of modern off-the-shelf commodity machines (scale-up servers) and require modern servers to be augmented with programmable accelerators near-memory and near-storage Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server, PhD thesis, Ahsan Javed Awan (ISBN: 978-91-7729-584-6)
  • 38.
    38 That's all fornow ? 38 Email: ajawan@kth.seEmail: ajawan@kth.se Profile:Profile: www.kth.se/profile/ajawan/www.kth.se/profile/ajawan/ https://se.linkedin.com/in/ahsanjavedawanhttps://se.linkedin.com/in/ahsanjavedawan THANK YOU
  • 39.
    39 What are thelimitations of my work ? 39 ● Assumption 01: Apache Spark is to stay as the state-of-art for the foreseeable future. ● Assumption 02: In the big data analytics domain, synthetic benchmarks/dwarfs should be given preference over real-world workloads. ● Assumption 03: SSDs will stick around despite the availability of terabyte scale DRAMs. ● Assumption 04: Tools in the high-level synthesis domain are getting mature enough to support programmable accelerators near DRAM and NVRAM
  • 40.
    40 Is GC detrimentalto scalability of Spark applications ? 40 GC time does not scale linearly at larger datasets
  • 41.
    41 How about usingmultiple small executors over single large executor ? Multiple small executors can provide up-to 36% performance gain
  • 42.
    42 Is GC detrimentalto scalability of Spark applications ? 42 NUMA awareness results in 10% speed up on average
  • 43.
    43 Is Hyper-Threading Effective? Hyper threading reduces the DRAM bound stalls by 50%
  • 44.
    44 How effective areexisting data prefetchers ? Disabling next-line prefetchers can improve the performance by 15%
  • 45.
    45 How effective cacheaware optimizations in project tungsten are ? DataFrame exhibit 25% less back-end bound stalls 64% less DRAM bound stalled cycles 25% less BW consumption10% less starvation of execution resources Dataframes have better micro-architectural performance than RDDs
  • 46.
    46 Is there thread-levelload imbalance ?