Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing

Big Data Meets HPC – Exploi4ng HPC Technologies for Accelera4ng Big Data Processing Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu h<p://www.cse.ohio-state.edu/~panda Keynote Talk at HPCAC-Switzerland (Mar 2016) by

HPCAC-Switzerland (Mar ‘16) 2 Network Based Compu4ng Laboratory •  Big Data has become the one of the most important elements of business analyFcs •  Provides groundbreaking opportuniFes for enterprise informaFon management and decision making •  The amount of data is exploding; companies are capturing and digiFzing more informaFon than ever •  The rate of informaFon growth appears to be exceeding Moore’s Law Introduc4on to Big Data Applica4ons and Analy4cs

HPCAC-Switzerland (Mar ‘16) 3 Network Based Compu4ng Laboratory •  Commonly accepted 3V’s of Big Data •  Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Diﬀerent Things, hWp://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf •  4/5V’s of Big Data – 3V + *Veracity, *Value 4V Characteris4cs of Big Data Courtesy: hWp://api.ning.com/ﬁles/tRHkwQN7s- Xz5cxylXG004GLGJdjoPd6bVfVBwvgu*F5MwDDUCiHHdmBW- JTEz0cfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg

HPCAC-Switzerland (Mar ‘16) 4 Network Based Compu4ng Laboratory •  Webpages (content, graph) •  Clicks (ad, page, social) •  Users (OpenID, FB Connect, etc.) •  e-mails (Hotmail, Y!Mail, Gmail, etc.) •  Photos, Movies (Flickr, YouTube, Video, etc.) •  Cookies / tracking info (see Ghostery) •  Installed apps (Android market, App Store, etc.) •  LocaFon (LaFtude, Loopt, Foursquared, Google Now, etc.) •  User generated content (Wikipedia & co, etc.) •  Ads (display, text, DoubleClick, Yahoo, etc.) •  Comments (Discuss, Facebook, etc.) •  Reviews (Yelp, Y!Local, etc.) •  Social connecFons (LinkedIn, Facebook, etc.) •  Purchase decisions (Netﬂix, Amazon, etc.) •  Instant Messages (YIM, Skype, Gtalk, etc.) •  Search terms (Google, Bing, etc.) •  News arFcles (BBC, NYTimes, Y!News, etc.) •  Blog posts (Tumblr, Wordpress, etc.) •  Microblogs (Twi<er, Jaiku, Meme, etc.) •  Link sharing (Facebook, Delicious, Buzz, etc.) Data Genera4on in Internet Services and Applica4ons Number of Apps in the Apple App Store, Android Market, Blackberry, and Windows Phone (2013) •  Android Market: <1200K •  Apple App Store: ~1000K Courtesy: hWp://dazeinfo.com/2014/07/10/apple-inc-aapl-ios-google-inc-goog- android-growth-mobile-ecosystem-2014/

HPCAC-Switzerland (Mar ‘16) 5 Network Based Compu4ng Laboratory Velocity of Big Data – How Much Data Is Generated Every Minute on the Internet? The global Internet popula4on grew 18.5% from 2013 to 2015 and now represents 3.2 Billion People. Courtesy: hWps://www.domo.com/blog/2015/08/data-never-sleeps-3-0/

HPCAC-Switzerland (Mar ‘16) 6 Network Based Compu4ng Laboratory •  ScienFﬁc Data Management, Analysis, and VisualizaFon •  ApplicaFons examples –  Climate modeling –  CombusFon –  Fusion –  Astrophysics –  BioinformaFcs •  Data Intensive Tasks –  Runs large-scale simulaFons on supercomputers –  Dump data on parallel storage systems –  Collect experimental / observaFonal data –  Move experimental / observaFonal data to analysis sites –  Visual analyFcs – help understand data visually Not Only in Internet Services - Big Data in Scien4ﬁc Domains

HPCAC-Switzerland (Mar ‘16) 7 Network Based Compu4ng Laboratory •  Hadoop: h<p://hadoop.apache.org –  The most popular framework for Big Data AnalyFcs –  HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc. •  Spark: h<p://spark-project.org –  Provides primiFves for in-memory cluster compuFng; Jobs can load data into memory and query it repeatedly •  Storm: h<p://storm-project.net –  A distributed real-Fme computaFon system for real-Fme analyFcs, online machine learning, conFnuous computaFon, etc. •  S4: h<p://incubator.apache.org/s4 –  A distributed system for processing conFnuous unbounded streams of data •  GraphLab: h<p://graphlab.org –  Consists of a core C++ GraphLab API and a collecFon of high-performance machine learning and data mining toolkits built on top of the GraphLab API. •  Web 2.0: RDBMS + Memcached (h<p://memcached.org) –  Memcached: A high-performance, distributed memory object caching systems Typical Solu4ons or Architectures for Big Data Analy4cs

HPCAC-Switzerland (Mar ‘16) 8 Network Based Compu4ng Laboratory Big Data Processing with Hadoop Components •  Major components included in this tutorial: –  MapReduce (Batch) –  HBase (Query) –  HDFS (Storage) –  RPC (Inter-process communicaFon) •  Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase •  Model scales but high amount of communicaFon during intermediate phases can be further opFmized HDFS MapReduce Hadoop Framework User Applica4ons HBase Hadoop Common (RPC)

HPCAC-Switzerland (Mar ‘16) 9 Network Based Compu4ng Laboratory Worker Worker Worker Worker Master HDFS Driver Zookeeper Worker SparkContext •  An in-memory data-processing framework –  IteraFve machine learning jobs –  InteracFve data analyFcs –  Scala based ImplementaFon –  Standalone, YARN, Mesos •  Scalable and communicaFon intensive –  Wide dependencies between Resilient Distributed Datasets (RDDs) –  MapReduce-like shuﬄe operaFons to reparFFon RDDs –  Sockets based communicaFon Spark Architecture Overview hWp://spark.apache.org

HPCAC-Switzerland (Mar ‘16) 10 Network Based Compu4ng Laboratory Memcached Architecture •  Distributed Caching Layer –  Allows to aggregate spare memory from mulFple nodes –  General purpose •  Typically used to cache database queries, results of API calls •  Scalable model, but typical usage very network intensive Main memory CPUs SSD HDD High Performance Networks ... ... ... Main memory CPUs SSD HDD Main memory CPUs SSD HDD Main memory CPUs SSD HDD Main memory CPUs SSD HDD

HPCAC-Switzerland (Mar ‘16) 11 Network Based Compu4ng Laboratory •  SubstanFal impact on designing and uFlizing data management and processing systems in mulFple Fers –  Front-end data accessing and serving (Online) •  Memcached + DB (e.g. MySQL), HBase –  Back-end data analyFcs (Oﬄine) •  HDFS, MapReduce, Spark Data Management and Processing on Modern Clusters Internet Front-end Tier Back-end Tier Web ServerWeb ServerWeb Server Memcached + DB (MySQL)Memcached + DB (MySQL)Memcached + DB (MySQL) NoSQL DB (HBase)NoSQL DB (HBase)NoSQL DB (HBase) HDFS MapReduce Spark Data Analytics Apps/Jobs Data Accessing and Serving

HPCAC-Switzerland (Mar ‘16) 12 Network Based Compu4ng Laboratory •  Introduced in Oct 2000 •  High Performance Data Transfer –  Interprocessor communicaFon and I/O –  Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec -> 100Gbps), and low CPU uFlizaFon (5-10%) •  MulFple OperaFons –  Send/Recv –  RDMA Read/Write –  Atomic OperaFons (very unique) •  high performance and scalable implementaFons of distributed locks, semaphores, collecFve communicaFon operaFons •  Leading to big changes in designing –  HPC clusters –  File systems –  Cloud compuFng systems –  Grid compuFng systems Open Standard InﬁniBand Networking Technology

HPCAC-Switzerland (Mar ‘16) 13 Network Based Compu4ng Laboratory Interconnects and Protocols in OpenFabrics Stack Kernel Space Applica4on / Middleware Verbs Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP 1/10/40/100 GigE InfiniBand Adapter InfiniBand Switch IPoIB IPoIB Ethernet Adapter Ethernet Switch Hardware Offload TCP/IP 10/40 GigE- TOE InfiniBand Adapter InfiniBand Switch User Space RSockets RSockets iWARP Adapter Ethernet Switch TCP/IP User Space iWARP RoCE Adapter Ethernet Switch RDMA User Space RoCE InfiniBand Switch InfiniBand Adapter RDMA User Space IB Na4ve Sockets Applica4on / Middleware Interface Protocol Adapter Switch InfiniBand Adapter InfiniBand Switch RDMA SDP SDP

HPCAC-Switzerland (Mar ‘16) 14 Network Based Compu4ng Laboratory How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applica4ons? Bring HPC and Big Data processing into a “convergent trajectory”! What are the major bo<lenecks in current Big Data processing middleware (e.g. Hadoop, Spark, and Memcached)? Can the bo<lenecks be alleviated with new designs by taking advantage of HPC technologies? Can RDMA-enabled high-performance interconnects benefit Big Data processing? Can HPC Clusters with high-performance storage systems (e.g. SSD, parallel file systems) benefit Big Data applicaFons? How much performance benefits can be achieved through enhanced designs? How to design benchmarks for evaluaFng the performance of Big Data middleware on HPC clusters?

HPCAC-Switzerland (Mar ‘16) 15 Network Based Compu4ng Laboratory Designing Communica4on and I/O Libraries for Big Data Systems: Challenges Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Networking Technologies (InﬁniBand, 1/10/40/100 GigE and Intelligent NICs) Storage Technologies (HDD, SSD, and NVMe-SSD) Programming Models (Sockets) Applica4ons Commodity Compu4ng System Architectures (Mul4- and Many-core architectures and accelerators) Other Protocols? Communica4on and I/O Library Point-to-Point Communica4on QoS Threaded Models and Synchroniza4on Fault-Tolerance I/O and File Systems Virtualiza4on Benchmarks Upper level Changes?

HPCAC-Switzerland (Mar ‘16) 16 Network Based Compu4ng Laboratory •  Sockets not designed for high-performance –  Stream semanFcs owen mismatch for upper layers –  Zero-copy not available for non-blocking sockets Can Big Data Processing Systems be Designed with High- Performance Networks and Protocols? Current Design Applica4on Sockets 1/10/40/100 GigE Network Our Approach Applica4on OSU Design 10/40/100 GigE or InﬁniBand Verbs Interface

HPCAC-Switzerland (Mar ‘16) 17 Network Based Compu4ng Laboratory •  RDMA for Apache Spark •  RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) –  Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distribuFons •  RDMA for Apache Hadoop 1.x (RDMA-Hadoop) •  RDMA for Memcached (RDMA-Memcached) •  OSU HiBD-Benchmarks (OHB) –  HDFS and Memcached Micro-benchmarks •  hWp://hibd.cse.ohio-state.edu •  Users Base: 155 organizaFons from 20 countries •  More than 15,500 downloads from the project site •  RDMA for Apache HBase and Impala (upcoming) The High-Performance Big Data (HiBD) Project Available for InﬁniBand and RoCE

HPCAC-Switzerland (Mar ‘16) 18 Network Based Compu4ng Laboratory •  High-Performance Design of Hadoop over RDMA-enabled Interconnects –  High performance RDMA-enhanced design with naFve InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components –  Enhanced HDFS with in-memory and heterogeneous storage –  High performance design of MapReduce over Lustre –  Plugin-based architecture supporFng RDMA-based designs for Apache Hadoop, CDH and HDP –  Easily configurable for different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre) and different protocols (naFve InfiniBand, RoCE, and IPoIB) •  Current release: 0.9.9 –  Based on Apache Hadoop 2.7.1 –  Compliant with Apache Hadoop 2.7.1, HDP 2.3.0.0 and CDH 5.6.0 APIs and applicaFons –  Tested with •  Mellanox InfiniBand adapters (DDR, QDR and FDR) •  RoCE support with Mellanox adapters •  Various mulF-core pla|orms •  Different file systems with disks and SSDs and Lustre –  hWp://hibd.cse.ohio-state.edu RDMA for Apache Hadoop 2.x Distribu4on

HPCAC-Switzerland (Mar ‘16) 19 Network Based Compu4ng Laboratory •  High-Performance Design of Spark over RDMA-enabled Interconnects –  High performance RDMA-enhanced design with naFve InfiniBand and RoCE support at the verbs-level for Spark –  RDMA-based data shuffle and SEDA-based shuffle architecture –  Non-blocking and chunk-based data transfer –  Easily configurable for different protocols (naFve InfiniBand, RoCE, and IPoIB) •  Current release: 0.9.1 –  Based on Apache Spark 1.5.1 –  Tested with •  Mellanox InfiniBand adapters (DDR, QDR and FDR) •  RoCE support with Mellanox adapters •  Various mulF-core pla|orms •  RAM disks, SSDs, and HDD –  hWp://hibd.cse.ohio-state.edu RDMA for Apache Spark Distribu4on

HPCAC-Switzerland (Mar ‘16) 20 Network Based Compu4ng Laboratory •  High-Performance Design of Memcached over RDMA-enabled Interconnects –  High performance RDMA-enhanced design with naFve InfiniBand and RoCE support at the verbs-level for Memcached and libMemcached components –  High performance design of SSD-Assisted Hybrid Memory –  Easily configurable for naFve InfiniBand, RoCE and the tradiFonal sockets-based support (Ethernet and InfiniBand with IPoIB) •  Current release: 0.9.4 –  Based on Memcached 1.4.24 and libMemcached 1.0.18 –  Compliant with libMemcached APIs and applicaFons –  Tested with •  Mellanox InfiniBand adapters (DDR, QDR and FDR) •  RoCE support with Mellanox adapters •  Various mulF-core pla|orms •  SSD –  hWp://hibd.cse.ohio-state.edu RDMA for Memcached Distribu4on

HPCAC-Switzerland (Mar ‘16) 21 Network Based Compu4ng Laboratory •  Micro-benchmarks for Hadoop Distributed File System (HDFS) –  SequenFal Write Latency (SWL) Benchmark, SequenFal Read Latency (SRL) Benchmark, Random Read Latency (RRL) Benchmark, SequenFal Write Throughput (SWT) Benchmark, SequenFal Read Throughput (SRT) Benchmark –  Support benchmarking of •  Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Pla|orm (HDP) HDFS, Cloudera DistribuFon of Hadoop (CDH) HDFS •  Micro-benchmarks for Memcached –  Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark •  Current release: 0.8 •  hWp://hibd.cse.ohio-state.edu OSU HiBD Micro-Benchmark (OHB) Suite – HDFS & Memcached

HPCAC-Switzerland (Mar ‘16) 22 Network Based Compu4ng Laboratory •  HHH: Heterogeneous storage devices with hybrid replicaFon schemes are supported in this mode of operaFon to have be<er fault-tolerance as well as performance. This mode is enabled by default in the package. •  HHH-M: A high-performance in-memory based setup has been introduced in this package that can be uFlized to perform all I/O operaFons in- memory and obtain as much performance benefit as possible. •  HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster. •  MapReduce over Lustre, with/without local disks: Besides, HDFS based soluFons, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks. •  Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH- L, and MapReduce over Lustre). Different Modes of RDMA for Apache Hadoop 2.x

HPCAC-Switzerland (Mar ‘16) 23 Network Based Compu4ng Laboratory •  RDMA-based Designs and Performance EvaluaFon –  HDFS –  MapReduce –  RPC –  HBase –  Spark –  Memcached (Basic, Hybrid and Non-blocking APIs) –  HDFS + Memcached-based Burst Buﬀer –  OSU HiBD Benchmarks (OHB) Accelera4on Case Studies and Performance Evalua4on

HPCAC-Switzerland (Mar ‘16) 24 Network Based Compu4ng Laboratory •  Enables high performance RDMA communicaFon, while supporFng tradiFonal socket interface •  JNI Layer bridges Java based HDFS with communicaFon library wri<en in naFve code Design Overview of HDFS with RDMA HDFS Verbs RDMA Capable Networks (IB, iWARP, RoCE ..) Applica4ons 1/10/40/100 GigE, IPoIB Network Java Socket Interface Java Na4ve Interface (JNI) Write Others OSU Design •  Design Features –  RDMA-based HDFS write –  RDMA-based HDFS replicaFon –  Parallel replicaFon support –  On-demand connecFon setup –  InﬁniBand/RoCE support N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InﬁniBand , Supercompu4ng (SC), Nov 2012 N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014

HPCAC-Switzerland (Mar ‘16) 25 Network Based Compu4ng Laboratory Triple-H Heterogeneous Storage •  Design Features –  Three modes •  Default (HHH) •  In-Memory (HHH-M) •  Lustre-Integrated (HHH-L) –  Policies to eﬃciently uFlize the heterogeneous storage devices •  RAM, SSD, HDD, Lustre –  EvicFon/PromoFon based on data usage pa<ern –  Hybrid ReplicaFon –  Lustre-Integrated mode: •  Lustre-based fault-tolerance Enhanced HDFS with In-Memory and Heterogeneous Storage Hybrid Replica4on Data Placement Policies Evic4on/Promo4on RAM Disk SSD HDD Lustre N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015 Applica4ons

HPCAC-Switzerland (Mar ‘16) 26 Network Based Compu4ng Laboratory Design Overview of MapReduce with RDMA MapReduce Verbs RDMA Capable Networks (IB, iWARP, RoCE ..) OSU Design Applica4ons 1/10/40/100 GigE, IPoIB Network Java Socket Interface Java Na4ve Interface (JNI) Job Tracker Task Tracker Map Reduce •  Enables high performance RDMA communicaFon, while supporFng tradiFonal socket interface •  JNI Layer bridges Java based MapReduce with communicaFon library wri<en in naFve code •  Design Features –  RDMA-based shuffle –  Prefetching and caching map output –  Efficient Shuffle Algorithms –  In-memory merge –  On-demand Shuffle Adjustment –  Advanced overlapping •  map, shuffle, and merge •  shuffle, merge, and reduce –  On-demand connecFon setup –  InfiniBand/RoCE support M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014

HPCAC-Switzerland (Mar ‘16) 27 Network Based Compu4ng Laboratory •  A hybrid approach to achieve maximum possible overlapping in MapReduce across all phases compared to other approaches –  Efficient Shuffle Algorithms –  Dynamic and Efficient Switching –  On-demand Shuffle Adjustment Advanced Overlapping among Different Phases Default Architecture Enhanced Overlapping with In-Memory Merge Advanced Hybrid Overlapping M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014

HPCAC-Switzerland (Mar ‘16) 28 Network Based Compu4ng Laboratory •  Design Features –  JVM-bypassed buffer management –  RDMA or send/recv based adapFve communicaFon –  Intelligent buffer allocaFon and adjustment for serializaFon –  On-demand connecFon setup –  InfiniBand/RoCE support Design Overview of Hadoop RPC with RDMA Hadoop RPC Verbs RDMA Capable Networks (IB, iWARP, RoCE ..) Applica4ons 1/10/40/100 GigE, IPoIB Network Java Socket Interface Java Na4ve Interface (JNI) Our Design Default OSU Design •  Enables high performance RDMA communicaFon, while supporFng tradiFonal socket interface •  JNI Layer bridges Java based RPC with communicaFon library wri<en in naFve code X. Lu, N. Islam, M. W. Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda, High-Performance Design of Hadoop RPC with RDMA over InfiniBand, Int'l Conference on Parallel Processing (ICPP '13), October 2013.

HPCAC-Switzerland (Mar ‘16) 29 Network Based Compu4ng Laboratory 0 50 100 150 200 250 80 100 120 Execu4on Time (s) Data Size (GB) IPoIB (FDR) 0 50 100 150 200 250 80 100 120 Execu4on Time (s) Data Size (GB) IPoIB (FDR) Performance Benefits – RandomWriter & TeraGen in TACC-Stampede Cluster with 32 Nodes with a total of 128 maps •  RandomWriter –  3-4x improvement over IPoIB for 80-120 GB file size •  TeraGen –  4-5x improvement over IPoIB for 80-120 GB file size RandomWriter TeraGen Reduced by 3x Reduced by 4x

HPCAC-Switzerland (Mar ‘16) 30 Network Based Compu4ng Laboratory 0 100 200 300 400 500 600 700 800 900 80 100 120 Execu4on Time (s) Data Size (GB) IPoIB (FDR) OSU-IB (FDR) 0 100 200 300 400 500 600 80 100 120 Execu4on Time (s) Data Size (GB) IPoIB (FDR) OSU-IB (FDR) Performance Beneﬁts – Sort & TeraSort in TACC-Stampede Cluster with 32 Nodes with a total of 128 maps and 64 reduces •  Sort with single HDD per node –  40-52% improvement over IPoIB for 80-120 GB data •  TeraSort with single HDD per node –  42-44% improvement over IPoIB for 80-120 GB data Reduced by 52% Reduced by 44% Cluster with 32 Nodes with a total of 128 maps and 57 reduces

HPCAC-Switzerland (Mar ‘16) 31 Network Based Compu4ng Laboratory Evalua4on of HHH and HHH-L with Applica4ons HDFS (FDR) HHH (FDR) 60.24 s 48.3 s CloudBurst MR-MSPolyGraph 0 200 400 600 800 1000 4 6 8 ExecuFon Time (s) Concurrent maps per host HDFS Lustre HHH-L Reduced by 79% •  MR-MSPolygraph on OSU RI with 1,000 maps –  HHH-L reduces the execuFon Fme by 79% over Lustre, 30% over HDFS •  CloudBurst on TACC Stampede –  With HHH: 19% improvement over HDFS

HPCAC-Switzerland (Mar ‘16) 32 Network Based Compu4ng Laboratory Evalua4on with Spark on SDSC Gordon (HHH vs. Tachyon/Alluxio) •  For 200GB TeraGen on 32 nodes –  Spark-TeraGen: HHH has 2.4x improvement over Tachyon; 2.3x over HDFS-IPoIB (QDR) –  Spark-TeraSort: HHH has 25.2% improvement over Tachyon; 17% over HDFS-IPoIB (QDR) 0 20 40 60 80 100 120 140 160 180 8:50 16:100 32:200 Execu4on Time (s) Cluster Size : Data Size (GB) IPoIB (QDR) Tachyon OSU-IB (QDR) 0 100 200 300 400 500 600 700 8:50 16:100 32:200 Execu4on Time (s) Cluster Size : Data Size (GB) Reduced by 2.4x Reduced by 25.2% TeraGen TeraSort N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characteriza4on and Accelera4on of In-Memory File Systems for Hadoop and Spark Applica4ons on HPC Clusters, IEEE BigData ’15, October 2015

HPCAC-Switzerland (Mar ‘16) 33 Network Based Compu4ng Laboratory Intermediate Data Directory Design Overview of Shuffle Strategies for MapReduce over Lustre •  Design Features –  Two shuffle approaches •  Lustre read based shuffle •  RDMA based shuffle –  Hybrid shuffle algorithm to take benefit from both shuffle approaches –  Dynamically adapts to the be<er shuffle approach for each shuffle request based on profiling values for each Lustre read operaFon –  In-memory merge and overlapping of different phases are kept similar to RDMA-enhanced MapReduce design Map 1 Map 2 Map 3 Lustre Reduce 1 Reduce 2 Lustre Read / RDMA In-memory merge/sort reduce M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015 In-memory merge/sort reduce

HPCAC-Switzerland (Mar ‘16) 34 Network Based Compu4ng Laboratory •  For 500GB Sort in 64 nodes –  44% improvement over IPoIB (FDR) Performance Improvement of MapReduce over Lustre on TACC- Stampede •  For 640GB Sort in 128 nodes –  48% improvement over IPoIB (FDR) 0 200 400 600 800 1000 1200 300 400 500 Job Execu4on Time (sec) Data Size (GB) IPoIB (FDR) OSU-IB (FDR) 0 50 100 150 200 250 300 350 400 450 500 20 GB 40 GB 80 GB 160 GB 320 GB 640 GB Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128 Job Execu4on Time (sec) IPoIB (FDR) OSU-IB (FDR) M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-based Approach Beneﬁt?, Euro-Par, August 2014. •  Local disk is used as the intermediate data directory Reduced by 48% Reduced by 44%

HPCAC-Switzerland (Mar ‘16) 35 Network Based Compu4ng Laboratory •  For 80GB Sort in 8 nodes –  34% improvement over IPoIB (QDR) Case Study - Performance Improvement of MapReduce over Lustre on SDSC-Gordon •  For 120GB TeraSort in 16 nodes –  25% improvement over IPoIB (QDR) •  Lustre is used as the intermediate data directory 0 100 200 300 400 500 600 700 800 900 40 60 80 Job Execu4on Time (sec) Data Size (GB) IPoIB (QDR) OSU-Lustre-Read (QDR) OSU-RDMA-IB (QDR) OSU-Hybrid-IB (QDR) 0 100 200 300 400 500 600 700 800 900 40 80 120 Job Execu4on Time (sec) Data Size (GB) Reduced by 25% Reduced by 34%

HPCAC-Switzerland (Mar ‘16) 37 Network Based Compu4ng Laboratory HBase-RDMA Design Overview •  JNI Layer bridges Java based HBase with communicaFon library wri<en in naFve code •  Enables high performance RDMA communicaFon, while supporFng tradiFonal socket interface HBase IB Verbs RDMA Capable Networks (IB, iWARP, RoCE ..) OSU-IB Design Applica4ons 1/10/40/100 GigE, IPoIB Networks Java Socket Interface Java Na4ve Interface (JNI)

HPCAC-Switzerland (Mar ‘16) 38 Network Based Compu4ng Laboratory HBase – YCSB Read-Write Workload •  HBase Get latency (QDR, 10GigE) –  64 clients: 2.0 ms; 128 Clients: 3.5 ms –  42% improvement over IPoIB for 128 clients •  HBase Put latency (QDR, 10GigE) –  64 clients: 1.9 ms; 128 Clients: 3.5 ms –  40% improvement over IPoIB for 128 clients 0 1000 2000 3000 4000 5000 6000 7000 8 16 32 64 96 128 Time (us) No. of Clients 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 8 16 32 64 96 128 Time (us) No. of Clients 10GigE Read Latency Write Latency OSU-IB (QDR) IPoIB (QDR) J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InﬁniBand, IPDPS’12

HPCAC-Switzerland (Mar ‘16) 39 Network Based Compu4ng Laboratory HBase – YCSB Get Latency and Throughput on SDSC-Comet •  HBase Get average latency (FDR) –  4 client threads: 38 us –  59% improvement over IPoIB for 4 client threads •  HBase Get total throughput –  4 client threads: 102 Kops/sec –  2.4x improvement over IPoIB for 4 client threads Get Latency Get Throughput 0 0.02 0.04 0.06 0.08 0.1 0.12 1 2 3 4 Average Latency (ms) Number of Client Threads IPoIB (FDR) OSU-IB (FDR) 0 20 40 60 80 100 120 1 2 3 4 Total Throughput (Kops/ sec) Number of Client Threads IPoIB (FDR) OSU-IB (FDR) 59% 2.4x

HPCAC-Switzerland (Mar ‘16) 41 Network Based Compu4ng Laboratory •  Design Features –  RDMA based shuffle –  SEDA-based plugins –  Dynamic connecFon management and sharing –  Non-blocking data transfer –  Off-JVM-heap buffer management –  InfiniBand/RoCE support Design Overview of Spark with RDMA Spark Applications (Scala/Java/Python) Spark (Scala/Java) BlockFetcherIteratorBlockManager Task TaskTask Task Java NIO Shuffle Server (default) Netty Shuffle Server (optional) RDMA Shuffle Server (plug-in) Java NIO Shuffle Fetcher (default) Netty Shuffle Fetcher (optional) RDMA Shuffle Fetcher (plug-in) Java Socket RDMA-based Shuffle Engine (Java/JNI) 1/10 Gig Ethernet/IPoIB (QDR/FDR) Network Native InfiniBand (QDR/FDR) •  Enables high performance RDMA communicaFon, while supporFng tradiFonal socket interface •  JNI Layer bridges Scala based Spark with communicaFon library wri<en in naFve code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelera4ng Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014

HPCAC-Switzerland (Mar ‘16) 42 Network Based Compu4ng Laboratory •  InﬁniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R) •  RDMA-based design for Spark 1.5.1 •  RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node. –  SortBy: Total Fme reduced by up to 80% over IPoIB (56Gbps) –  GroupBy: Total Fme reduced by up to 57% over IPoIB (56Gbps) Performance Evalua4on on SDSC Comet – SortBy/GroupBy 64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time 0 50 100 150 200 250 300 64 128 256 Time (sec) Data Size (GB) IPoIB RDMA 0 50 100 150 200 250 64 128 256 Time (sec) Data Size (GB) IPoIB RDMA 57% 80%

HPCAC-Switzerland (Mar ‘16) 43 Network Based Compu4ng Laboratory •  InﬁniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R) •  RDMA-based design for Spark 1.5.1 •  RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node. –  Sort: Total Fme reduced by 38% over IPoIB (56Gbps) –  TeraSort: Total Fme reduced by 15% over IPoIB (56Gbps) Performance Evalua4on on SDSC Comet – HiBench Sort/TeraSort 64 Worker Nodes, 1536 cores, Sort Total Time 64 Worker Nodes, 1536 cores, TeraSort Total Time 0 50 100 150 200 250 300 350 400 450 64 128 256 Time (sec) Data Size (GB) IPoIB RDMA 0 100 200 300 400 500 600 64 128 256 Time (sec) Data Size (GB) IPoIB RDMA 15% 38%

HPCAC-Switzerland (Mar ‘16) 44 Network Based Compu4ng Laboratory •  InﬁniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) •  RDMA-based design for Spark 1.5.1 •  RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. –  32 nodes/768 cores: Total Fme reduced by 37% over IPoIB (56Gbps) –  64 nodes/1536 cores: Total Fme reduced by 43% over IPoIB (56Gbps) Performance Evalua4on on SDSC Comet – HiBench PageRank 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time 0 50 100 150 200 250 300 350 400 450 Huge BigData GiganFc Time (sec) Data Size (GB) IPoIB RDMA 0 100 200 300 400 500 600 700 800 Huge BigData GiganFc Time (sec) Data Size (GB) IPoIB RDMA 43% 37%

HPCAC-Switzerland (Mar ‘16) 46 Network Based Compu4ng Laboratory •  Server and client perform a negoFaFon protocol –  Master thread assigns clients to appropriate worker thread •  Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread •  All other Memcached data structures are shared among RDMA and Sockets worker threads •  Memcached Server can serve both socket and verbs clients simultaneously •  Memcached applicaFons need not be modiﬁed; uses verbs interface if available Memcached-RDMA Design Sockets Client RDMA Client Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory Slabs Items … 1 1 2 2

HPCAC-Switzerland (Mar ‘16) 47 Network Based Compu4ng Laboratory 1 10 100 1000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Time (us) Message Size OSU-IB (FDR) IPoIB (FDR) 0 100 200 300 400 500 600 700 16 32 64 128 256 512 1024 2048 4080 Thousands of Transac4ons per Second (TPS) No. of Clients •  Memcached Get latency –  4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us –  2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us •  Memcached Throughput (4bytes) –  4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s –  Nearly 2X improvement in throughput Memcached GET Latency Memcached Throughput RDMA-Memcached Performance (FDR Interconnect) Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) Latency Reduced by nearly 20X 2X

HPCAC-Switzerland (Mar ‘16) 48 Network Based Compu4ng Laboratory •  IllustraFon with Read-Cache-Read access pa<ern using modiﬁed mysqlslap load tesFng tool •  Memcached-RDMA can -  improve query latency by up to 66% over IPoIB (32Gbps) -  throughput by up to 69% over IPoIB (32Gbps) Micro-benchmark Evalua4on for OLDP workloads 0 1 2 3 4 5 6 7 8 64 96 128 160 320 400 Latency (sec) No. of Clients Memcached-IPoIB (32Gbps) Memcached-RDMA (32Gbps) 0 1000 2000 3000 4000 64 96 128 160 320 400 Throughput (Kq/s) No. of Clients Memcached-IPoIB (32Gbps) Memcached-RDMA (32Gbps) D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Beneﬁt On-Line Data Processing Workloads with Memcached and MySQL, ISPASS’15 Reduced by 66%

HPCAC-Switzerland (Mar ‘16) 49 Network Based Compu4ng Laboratory •  ohb_memlat & ohb_memthr latency & throughput micro-benchmarks •  Memcached-RDMA can -  improve query latency by up to 70% over IPoIB (32Gbps) -  improve throughput by up to 2X over IPoIB (32Gbps) -  No overhead in using hybrid mode when all data can ﬁt in memory Performance Beneﬁts of Hybrid Memcached (Memory + SSD) on SDSC-Gordon 0 2 4 6 8 10 64 128 256 512 1024 Throughput (million trans/sec) No. of Clients IPoIB (32Gbps) RDMA-Mem (32Gbps) RDMA-Hybrid (32Gbps) 0 100 200 300 400 500 Average latency (us) Message Size (Bytes) 2X

HPCAC-Switzerland (Mar ‘16) 50 Network Based Compu4ng Laboratory –  Memcached latency test with Zipf distribuFon, server with 1 GB memory, 32 KB key-value pair size, total size of data accessed is 1 GB (when data fits in memory) and 1.5 GB (when data does not fit in memory) –  When data fits in memory: RDMA-Mem/Hybrid gives 5x improvement over IPoIB-Mem –  When data does not fit in memory: RDMA-Hybrid gives 2x-2.5x over IPoIB/RDMA-Mem Performance Evalua4on on IB FDR + SATA/NVMe SSDs 0 500 1000 1500 2000 2500 Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid- NVMe IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid- NVMe Data Fits In Memory Data Does Not Fit In Memory Latency (us) slab allocaFon (SSD write) cache check+load (SSD read) cache update server response client wait miss-penalty

HPCAC-Switzerland (Mar ‘16) 51 Network Based Compu4ng Laboratory –  RDMA-Accelerated CommunicaFon for Memcached Get/Set –  Hybrid ‘RAM+SSD’ slab management for higher data retenFon –  Non-blocking API extensions •  memcached_(iset/iget/bset/bget/test/wait) •  Achieve near in-memory speeds while hiding bo<lenecks of network and SSD I/O •  Ability to exploit communicaFon/computaFon overlap •  OpFonal buffer re-use guarantees –  AdapFve slab manager with different I/O schemes for higher throughput. Accelera4ng Hybrid Memcached with RDMA, Non-blocking Extensions and SSDs D. Shankar, X. Lu, N. S. Islam, M. W. Rahman, and D. K. Panda, High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits, IPDPS, May 2016 BLOCKING API NON-BLOCKING API REQ. NON-BLOCKING API REPLY CLIENT SERVER HYBRID SLAB MANAGER (RAM+SSD) RDMA-ENHANCED COMMUNICATION LIBRARY RDMA-ENHANCED COMMUNICATION LIBRARY LIBMEMCACHED LIBRARY Blocking API Flow Non-Blocking API Flow

HPCAC-Switzerland (Mar ‘16) 52 Network Based Compu4ng Laboratory –  Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve •  >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty •  >2.5x throughput improvement vs. blocking API over default/opFmized RDMA-Hybrid –  Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem Performance Evalua4on with Non-Blocking Memcached API 0 500 1000 1500 2000 2500 Set Get Set Get Set Get Set Get Set Get Set Get IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonB- i H-RDMA-Opt-NonB- b Average Latency (us) Miss Penalty (Backend DB Access Overhead) Client Wait Server Response Cache Update Cache check+Load (Memory and/or SSD read) Slab AllocaFon (w/ SSD write on Out-of-Mem) H = Hybrid Memcached over SATA SSD Opt = AdapFve slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee

HPCAC-Switzerland (Mar ‘16) 54 Network Based Compu4ng Laboratory •  Design Features –  Memcached-based burst-buffer system •  Hides latency of parallel file system access •  Read from local storage and Memcached –  Data locality achieved by wriFng data to local storage –  Different approaches of integraFon with parallel file system to guarantee fault-tolerance Accelera4ng I/O Performance of Big Data Analy4cs through RDMA-based Key-Value Store Applica4on I/O Forwarding Module Map/Reduce Task DataNode Local Disk Data Locality Fault-tolerance Lustre Memcached-based Burst Buffer System

HPCAC-Switzerland (Mar ‘16) 55 Network Based Compu4ng Laboratory Evalua4on with PUMA Workloads Gains on OSU RI with our approach (Mem-bb) on 24 nodes •  SequenceCount: 34.5% over Lustre, 40% over HDFS •  RankedInvertedIndex: 27.3% over Lustre, 48.3% over HDFS •  HistogramRaFng: 17% over Lustre, 7% over HDFS 0 500 1000 1500 2000 2500 3000 3500 4000 4500 SeqCount RankedInvIndex HistoRaFng Execu4on Time (s) Workloads HDFS (32Gbps) Lustre (32Gbps) Mem-bb (32Gbps) 48.3% 40% 17% N. S. Islam, D. Shankar, X. Lu, M. W. Rahman, and D. K. Panda, Accelera4ng I/O Performance of Big Data Analy4cs with RDMA- based Key-Value Store, ICPP ’15, September 2015

HPCAC-Switzerland (Mar ‘16) 57 Network Based Compu4ng Laboratory •  The current benchmarks provide some performance behavior •  However, do not provide any informaFon to the designer/developer on: –  What is happening at the lower-layer? –  Where the benefits are coming from? –  Which design is leading to benefits or bo<lenecks? –  Which component in the design needs to be changed and what will be its impact? –  Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer? Are the Current Benchmarks Sufficient for Big Data?

HPCAC-Switzerland (Mar ‘16) 58 Network Based Compu4ng Laboratory Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Networking Technologies (InﬁniBand, 1/10/40/100 GigE and Intelligent NICs) Storage Technologies (HDD, SSD, and NVMe-SSD) Programming Models (Sockets) Applica4ons Commodity Compu4ng System Architectures (Mul4- and Many-core architectures and accelerators) Other Protocols? Communica4on and I/O Library Point-to-Point Communica4on QoS Threaded Models and Synchroniza4on Fault-Tolerance I/O and File Systems Virtualiza4on Benchmarks RDMA Protocols Challenges in Benchmarking of RDMA-based Designs Current Benchmarks No Benchmarks Correla4on?

HPCAC-Switzerland (Mar ‘16) 59 Network Based Compu4ng Laboratory •  A comprehensive suite of benchmarks to –  Compare performance of diﬀerent MPI libraries on various networks and systems –  Validate low-level funcFonaliFes –  Provide insights to the underlying MPI-level designs •  Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-direcFonal bandwidth •  Extended later to –  MPI-2 one-sided –  CollecFves –  GPU-aware data movement –  OpenSHMEM (point-to-point and collecFves) –  UPC •  Has become an industry standard •  Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems •  Available from h<p://mvapich.cse.ohio-state.edu/benchmarks •  Available in an integrated manner with MVAPICH2 stack OSU MPI Micro-Benchmarks (OMB) Suite

HPCAC-Switzerland (Mar ‘16) 60 Network Based Compu4ng Laboratory Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Networking Technologies (InﬁniBand, 1/10/40/100 GigE and Intelligent NICs) Storage Technologies (HDD, SSD, and NVMe-SSD) Programming Models (Sockets) Applica4ons Commodity Compu4ng System Architectures (Mul4- and Many-core architectures and accelerators) Other Protocols? Communica4on and I/O Library Point-to-Point Communica4on QoS Threaded Models and Synchroniza4on Fault-Tolerance I/O and File Systems Virtualiza4on Benchmarks RDMA Protocols Itera4ve Process – Requires Deeper Inves4ga4on and Design for Benchmarking Next Genera4on Big Data Systems and Applica4ons Applica4ons-Level Benchmarks Micro- Benchmarks

HPCAC-Switzerland (Mar ‘16) 61 Network Based Compu4ng Laboratory •  HDFS Benchmarks –  SequenFal Write Latency (SWL) Benchmark –  SequenFal Read Latency (SRL) Benchmark –  Random Read Latency (RRL) Benchmark –  SequenFal Write Throughput (SWT) Benchmark –  SequenFal Read Throughput (SRT) Benchmark •  Memcached Benchmarks –  Get Benchmark –  Set Benchmark –  Mixed Get/Set Benchmark •  Available as a part of OHB 0.8 OSU HiBD Benchmarks (OHB) N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D. K. Panda, A Micro-benchmark Suite for Evalua4ng HDFS Opera4ons on Modern Clusters, Int'l Workshop on Big Data Benchmarking (WBDB '12), December 2012 D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evalua4ng Hadoop MapReduce on High- Performance Networks, BPOE-5 (2014) X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evalua4ng Hadoop RPC on High-Performance Networks, Int'l Workshop on Big Data Benchmarking (WBDB '13), July 2013 To be Released

HPCAC-Switzerland (Mar ‘16) 62 Network Based Compu4ng Laboratory •  Upcoming Releases of RDMA-enhanced Packages will support –  HBase –  Impala •  Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support –  MapReduce –  RPC •  Advanced designs with upper-level changes and opFmizaFons –  Memcached with Non-blocking API –  HDFS + Memcached-based Burst Buﬀer On-going and Future Plans of OSU High Performance Big Data (HiBD) Project

HPCAC-Switzerland (Mar ‘16) 63 Network Based Compu4ng Laboratory •  Discussed challenges in acceleraFng Hadoop, Spark and Memcached •  Presented iniFal designs to take advantage of InﬁniBand/RDMA for HDFS, MapReduce, RPC, Spark, and Memcached •  Results are promising •  Many other open issues need to be solved •  Will enable Big Data community to take advantage of modern HPC technologies to carry out their analyFcs in a fast and scalable manner Concluding Remarks

HPCAC-Switzerland (Mar ‘16) 64 Network Based Compu4ng Laboratory Funding Acknowledgments Funding Support by Equipment Support by

HPCAC-Switzerland (Mar ‘16) 65 Network Based Compu4ng Laboratory Personnel Acknowledgments Current Students –  A. AugusFne (M.S.) –  A. Awan (Ph.D.) –  S. Chakraborthy (Ph.D.) –  C.-H. Chu (Ph.D.) –  N. Islam (Ph.D.) –  M. Li (Ph.D.) Past Students –  P. Balaji (Ph.D.) –  S. Bhagvat (M.S.) –  A. Bhat (M.S.) –  D. BunFnas (Ph.D.) –  L. Chai (Ph.D.) –  B. Chandrasekharan (M.S.) –  N. Dandapanthula (M.S.) –  V. Dhanraj (M.S.) –  T. Gangadharappa (M.S.) –  K. Gopalakrishnan (M.S.) –  G. Santhanaraman (Ph.D.) –  A. Singh (Ph.D.) –  J. Sridhar (M.S.) –  S. Sur (Ph.D.) –  H. Subramoni (Ph.D.) –  K. Vaidyanathan (Ph.D.) –  A. Vishnu (Ph.D.) –  J. Wu (Ph.D.) –  W. Yu (Ph.D.) Past Research Scien:st –  S. Sur Current Post-Doc –  J. Lin –  D. Banerjee Current Programmer –  J. Perkins Past Post-Docs –  H. Wang –  X. Besseron –  H.-W. Jin –  M. Luo –  W. Huang (Ph.D.) –  W. Jiang (M.S.) –  J. Jose (Ph.D.) –  S. Kini (M.S.) –  M. Koop (Ph.D.) –  R. Kumar (M.S.) –  S. Krishnamoorthy (M.S.) –  K. Kandalla (Ph.D.) –  P. Lai (M.S.) –  J. Liu (Ph.D.) –  M. Luo (Ph.D.) –  A. Mamidala (Ph.D.) –  G. Marsh (M.S.) –  V. Meshram (M.S.) –  A. Moody (M.S.) –  S. Naravula (Ph.D.) –  R. Noronha (Ph.D.) –  X. Ouyang (Ph.D.) –  S. Pai (M.S.) –  S. Potluri (Ph.D.) –  R. Rajachandrasekar (Ph.D.) –  K. Kulkarni (M.S.) –  M. Rahman (Ph.D.) –  D. Shankar (Ph.D.) –  A. Venkatesh (Ph.D.) –  J. Zhang (Ph.D.) –  E. Mancini –  S. Marcarelli –  J. Vienne Current Research Scien:sts Current Senior Research Associate –  H. Subramoni –  X. Lu Past Programmers –  D. Bureddy - K. Hamidouche Current Research Specialist –  M. Arnold

HPCAC-Switzerland (Mar ‘16) 66 Network Based Compu4ng Laboratory Second Interna4onal Workshop on High-Performance Big Data Compu4ng (HPBDC) HPBDC 2016 will be held with IEEE Interna4onal Parallel and Distributed Processing Symposium (IPDPS 2016), Chicago, Illinois USA, May 27th, 2016 Keynote Talk: Dr. Chaitanya Baru, Senior Advisor for Data Science, Na4onal Science Founda4on (NSF); Dis4nguished Scien4st, San Diego Supercomputer Center (SDSC) Panel Moderator: Jianfeng Zhan (ICT/CAS) Panel Topic: Merge or Split: Mutual Inﬂuence between Big Data and HPC Techniques Six Regular Research Papers and Two Short Research Papers hWp://web.cse.ohio-state.edu/~luxi/hpbdc2016 HPBDC 2015 was held in conjunc4on with ICDCS’15 hWp://web.cse.ohio-state.edu/~luxi/hpbdc2015

HPCAC-Switzerland (Mar ‘16) 67 Network Based Compu4ng Laboratory panda@cse.ohio-state.edu Thank You! The High-Performance Big Data Project h<p://hibd.cse.ohio-state.edu/ Network-Based CompuFng Laboratory h<p://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project h<p://mvapich.cse.ohio-state.edu/

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing

More Related Content

What's hot

Viewers also liked

Similar to Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing

More from inside-BigData.com

Recently uploaded

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing