Big	Data	Meets	HPC	–	Exploi4ng	HPC	Technologies for	Accelera4ng	Big	Data	Processing Dhabaleswar	K.	(DK)	Panda The	Ohio	State	University E-mail:	panda@cse.ohio-state.edu h<p://www.cse.ohio-state.edu/~panda Keynote	Talk	at	HPCAC-Switzerland	(Mar	2016) by
HPCAC-Switzerland	(Mar	‘16) 2	Network	Based	Compu4ng	Laboratory •  Big	Data	has	become	the	one	of	the	most important	elements	of	business	analyFcs •  Provides	groundbreaking	opportuniFes	for enterprise	informaFon	management	and decision	making •  The	amount	of	data	is	exploding;	companies are	capturing	and	digiFzing	more	informaFon than	ever •  The	rate	of	informaFon	growth	appears	to	be exceeding	Moore’s	Law Introduc4on	to	Big	Data	Applica4ons	and	Analy4cs
HPCAC-Switzerland	(Mar	‘16) 3	Network	Based	Compu4ng	Laboratory •  Commonly	accepted	3V’s	of	Big	Data •  Volume,	Velocity,	Variety Michael	Stonebraker:	Big	Data	Means	at	Least	Three	Different	Things,	hWp://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf •  4/5V’s	of	Big	Data	–	3V	+	*Veracity,	*Value 4V	Characteris4cs	of	Big	Data Courtesy:	hWp://api.ning.com/files/tRHkwQN7s- Xz5cxylXG004GLGJdjoPd6bVfVBwvgu*F5MwDDUCiHHdmBW- JTEz0cfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg
HPCAC-Switzerland	(Mar	‘16) 4	Network	Based	Compu4ng	Laboratory •  Webpages	(content,	graph) •  Clicks	(ad,	page,	social) •  Users	(OpenID,	FB	Connect,	etc.) •  e-mails	(Hotmail,	Y!Mail,	Gmail,	etc.) •  Photos,	Movies	(Flickr,	YouTube,	Video,	etc.) •  Cookies	/	tracking	info	(see	Ghostery) •  Installed	apps	(Android	market,	App	Store,	etc.) •  LocaFon	(LaFtude,	Loopt,	Foursquared,	Google	Now,	etc.) •  User	generated	content	(Wikipedia	&	co,	etc.) •  Ads	(display,	text,	DoubleClick,	Yahoo,	etc.) •  Comments	(Discuss,	Facebook,	etc.) •  Reviews	(Yelp,	Y!Local,	etc.) •  Social	connecFons	(LinkedIn,	Facebook,	etc.) •  Purchase	decisions	(Netflix,	Amazon,	etc.) •  Instant	Messages	(YIM,	Skype,	Gtalk,	etc.) •  Search	terms	(Google,	Bing,	etc.) •  News	arFcles	(BBC,	NYTimes,	Y!News,	etc.) •  Blog	posts	(Tumblr,	Wordpress,	etc.) •  Microblogs	(Twi<er,	Jaiku,	Meme,	etc.) •  Link	sharing	(Facebook,	Delicious,	Buzz,	etc.) Data	Genera4on	in	Internet	Services	and	Applica4ons Number	of	Apps	in	the	Apple	App	Store,	Android	Market,	Blackberry, and	Windows	Phone	(2013) •  Android	Market:	<1200K •  Apple	App	Store:	~1000K Courtesy:	hWp://dazeinfo.com/2014/07/10/apple-inc-aapl-ios-google-inc-goog- android-growth-mobile-ecosystem-2014/
HPCAC-Switzerland	(Mar	‘16) 5	Network	Based	Compu4ng	Laboratory Velocity	of	Big	Data	–	How	Much	Data	Is	Generated	Every	Minute	on	the	Internet? The	global	Internet	popula4on	grew	18.5%	from	2013	to	2015	and	now	represents 3.2	Billion	People. Courtesy:	hWps://www.domo.com/blog/2015/08/data-never-sleeps-3-0/
HPCAC-Switzerland	(Mar	‘16) 6	Network	Based	Compu4ng	Laboratory •  ScienFfic	Data	Management,	Analysis,	and	VisualizaFon •  ApplicaFons	examples –  Climate	modeling –  CombusFon –  Fusion –  Astrophysics –  BioinformaFcs •  Data	Intensive	Tasks –  Runs	large-scale	simulaFons	on	supercomputers –  Dump	data	on	parallel	storage	systems –  Collect	experimental	/	observaFonal	data –  Move	experimental	/	observaFonal	data	to	analysis	sites –  Visual	analyFcs	–	help	understand	data	visually Not	Only	in	Internet	Services	-	Big	Data	in	Scien4fic	Domains
HPCAC-Switzerland	(Mar	‘16) 7	Network	Based	Compu4ng	Laboratory •  Hadoop:	h<p://hadoop.apache.org –  The	most	popular	framework	for	Big	Data	AnalyFcs –  HDFS,	MapReduce,	HBase,	RPC,	Hive,	Pig,	ZooKeeper,	Mahout,	etc. •  Spark:	h<p://spark-project.org –  Provides	primiFves	for	in-memory	cluster	compuFng;	Jobs	can	load	data	into	memory	and	query	it	repeatedly •  Storm:	h<p://storm-project.net –  A	distributed	real-Fme	computaFon	system	for	real-Fme	analyFcs,	online	machine	learning,	conFnuous computaFon,	etc. •  S4:	h<p://incubator.apache.org/s4 –  A	distributed	system	for	processing	conFnuous	unbounded	streams	of	data •  GraphLab:	h<p://graphlab.org –  Consists	of	a	core	C++	GraphLab	API	and	a	collecFon	of	high-performance	machine	learning	and	data	mining toolkits	built	on	top	of	the	GraphLab	API. •  Web	2.0:	RDBMS	+	Memcached	(h<p://memcached.org) –  Memcached:	A	high-performance,	distributed	memory	object	caching	systems Typical	Solu4ons	or	Architectures	for	Big	Data	Analy4cs
HPCAC-Switzerland	(Mar	‘16) 8	Network	Based	Compu4ng	Laboratory Big	Data	Processing	with	Hadoop	Components •  Major	components	included	in	this tutorial: –  MapReduce	(Batch) –  HBase	(Query) –  HDFS	(Storage) –  RPC	(Inter-process	communicaFon) •  Underlying	Hadoop	Distributed	File System	(HDFS)	used	by	both MapReduce	and	HBase •  Model	scales	but	high	amount	of communicaFon	during	intermediate phases	can	be	further	opFmized HDFS MapReduce Hadoop	Framework User	Applica4ons HBase Hadoop	Common	(RPC)
HPCAC-Switzerland	(Mar	‘16) 9	Network	Based	Compu4ng	Laboratory Worker Worker Worker Worker Master HDFS Driver Zookeeper Worker SparkContext •  An	in-memory	data-processing framework –  IteraFve	machine	learning	jobs –  InteracFve	data	analyFcs –  Scala	based	ImplementaFon –  Standalone,	YARN,	Mesos •  Scalable	and	communicaFon	intensive –  Wide	dependencies	between	Resilient Distributed	Datasets	(RDDs) –  MapReduce-like	shuffle	operaFons	to reparFFon	RDDs –  Sockets	based	communicaFon Spark	Architecture	Overview hWp://spark.apache.org
HPCAC-Switzerland	(Mar	‘16) 10	Network	Based	Compu4ng	Laboratory Memcached	Architecture •  Distributed	Caching	Layer –  Allows	to	aggregate	spare	memory	from	mulFple	nodes –  General	purpose •  Typically	used	to	cache	database	queries,	results	of	API	calls •  Scalable	model,	but	typical	usage	very	network	intensive Main memory CPUs SSD HDD High Performance Networks ... ... ... Main memory CPUs SSD HDD Main memory CPUs SSD HDD Main memory CPUs SSD HDD Main memory CPUs SSD HDD
HPCAC-Switzerland	(Mar	‘16) 11	Network	Based	Compu4ng	Laboratory •  SubstanFal	impact	on	designing	and	uFlizing	data	management	and	processing	systems	in	mulFple	Fers –  Front-end	data	accessing	and	serving	(Online) •  Memcached	+	DB	(e.g.	MySQL),	HBase –  Back-end	data	analyFcs	(Offline) •  HDFS,	MapReduce,	Spark Data	Management	and	Processing	on	Modern	Clusters Internet Front-end Tier Back-end Tier Web ServerWeb ServerWeb Server Memcached + DB (MySQL)Memcached + DB (MySQL)Memcached + DB (MySQL) NoSQL DB (HBase)NoSQL DB (HBase)NoSQL DB (HBase) HDFS MapReduce Spark Data Analytics Apps/Jobs Data Accessing and Serving
HPCAC-Switzerland	(Mar	‘16) 12	Network	Based	Compu4ng	Laboratory •  Introduced	in	Oct	2000 •  High	Performance	Data	Transfer –  Interprocessor	communicaFon	and	I/O –  Low	latency	(<1.0	microsec),	High	bandwidth	(up	to	12.5	GigaBytes/sec	->	100Gbps),	and low	CPU	uFlizaFon	(5-10%) •  MulFple	OperaFons –  Send/Recv –  RDMA	Read/Write –  Atomic	OperaFons	(very	unique) •  high	performance	and	scalable	implementaFons	of	distributed	locks,	semaphores,	collecFve communicaFon	operaFons •  Leading	to	big	changes	in	designing –  HPC	clusters –  File	systems –  Cloud	compuFng	systems –  Grid	compuFng	systems Open	Standard	InfiniBand	Networking	Technology
HPCAC-Switzerland	(Mar	‘16) 13	Network	Based	Compu4ng	Laboratory Interconnects	and	Protocols	in	OpenFabrics	Stack Kernel Space Applica4on	/ Middleware Verbs Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP 1/10/40/100 GigE InfiniBand Adapter InfiniBand Switch IPoIB IPoIB Ethernet Adapter Ethernet Switch Hardware Offload TCP/IP 10/40	GigE- TOE InfiniBand Adapter InfiniBand Switch User Space RSockets RSockets iWARP Adapter Ethernet Switch TCP/IP User Space iWARP RoCE Adapter Ethernet Switch RDMA User Space RoCE InfiniBand Switch InfiniBand Adapter RDMA User Space IB	Na4ve Sockets Applica4on	/ Middleware	Interface Protocol Adapter Switch InfiniBand Adapter InfiniBand Switch RDMA SDP SDP
HPCAC-Switzerland	(Mar	‘16) 14	Network	Based	Compu4ng	Laboratory How	Can	HPC	Clusters	with	High-Performance	Interconnect	and	Storage Architectures	Benefit	Big	Data	Applica4ons? Bring	HPC	and	Big	Data	processing	into	a “convergent	trajectory”! What	are	the	major bo<lenecks	in	current	Big Data	processing middleware	(e.g.	Hadoop, Spark,	and	Memcached)? Can	the	bo<lenecks	be alleviated	with	new designs	by	taking advantage	of	HPC technologies? Can	RDMA-enabled high-performance interconnects benefit	Big	Data processing? Can	HPC	Clusters	with high-performance storage	systems	(e.g. SSD,	parallel	file systems)	benefit	Big Data	applicaFons? How	much performance	benefits can	be	achieved through	enhanced designs? How	to	design benchmarks	for evaluaFng	the performance	of	Big Data	middleware	on HPC	clusters?
HPCAC-Switzerland	(Mar	‘16) 15	Network	Based	Compu4ng	Laboratory Designing	Communica4on	and	I/O	Libraries	for	Big	Data	Systems: Challenges Big	Data	Middleware (HDFS,	MapReduce,	HBase,	Spark	and	Memcached) Networking	Technologies (InfiniBand,	1/10/40/100	GigE and	Intelligent	NICs) Storage	Technologies (HDD,	SSD,	and	NVMe-SSD) Programming	Models (Sockets) Applica4ons Commodity	Compu4ng	System Architectures (Mul4-	and	Many-core architectures	and	accelerators) Other	Protocols? Communica4on	and	I/O	Library Point-to-Point Communica4on QoS Threaded	Models and	Synchroniza4on Fault-Tolerance	I/O	and	File	Systems Virtualiza4on Benchmarks Upper	level Changes?
HPCAC-Switzerland	(Mar	‘16) 16	Network	Based	Compu4ng	Laboratory •  Sockets	not	designed	for	high-performance –  Stream	semanFcs	owen	mismatch	for	upper	layers –  Zero-copy	not	available	for	non-blocking	sockets Can	Big	Data	Processing	Systems	be	Designed	with	High- Performance	Networks	and	Protocols? Current	Design Applica4on Sockets 1/10/40/100	GigE Network Our	Approach Applica4on OSU	Design 10/40/100	GigE	or InfiniBand Verbs	Interface
HPCAC-Switzerland	(Mar	‘16) 17	Network	Based	Compu4ng	Laboratory •  RDMA	for	Apache	Spark •  RDMA	for	Apache	Hadoop	2.x	(RDMA-Hadoop-2.x) –  Plugins	for	Apache,	Hortonworks	(HDP)	and	Cloudera	(CDH)	Hadoop	distribuFons •  RDMA	for	Apache	Hadoop	1.x	(RDMA-Hadoop) •  RDMA	for	Memcached	(RDMA-Memcached) •  OSU	HiBD-Benchmarks	(OHB) –  HDFS	and	Memcached	Micro-benchmarks •  hWp://hibd.cse.ohio-state.edu •  Users	Base:	155	organizaFons	from	20	countries •  More	than	15,500	downloads	from	the	project	site •  RDMA	for	Apache	HBase	and	Impala	(upcoming) The	High-Performance	Big	Data	(HiBD)	Project Available	for	InfiniBand	and	RoCE
HPCAC-Switzerland	(Mar	‘16) 18	Network	Based	Compu4ng	Laboratory •  High-Performance	Design	of	Hadoop	over	RDMA-enabled	Interconnects –  High	performance	RDMA-enhanced	design	with	naFve	InfiniBand	and	RoCE	support	at	the	verbs-level	for	HDFS,	MapReduce,	and RPC	components –  Enhanced	HDFS	with	in-memory	and	heterogeneous	storage –  High	performance	design	of	MapReduce	over	Lustre –  Plugin-based	architecture	supporFng	RDMA-based	designs	for	Apache	Hadoop,	CDH	and	HDP –  Easily	configurable	for	different	running	modes	(HHH,	HHH-M,	HHH-L,	and	MapReduce	over	Lustre)	and	different	protocols	(naFve InfiniBand,	RoCE,	and	IPoIB) •  Current	release:	0.9.9 –  Based	on	Apache	Hadoop	2.7.1 –  Compliant	with	Apache	Hadoop	2.7.1,	HDP	2.3.0.0	and	CDH	5.6.0	APIs	and	applicaFons –  Tested	with •  Mellanox	InfiniBand	adapters	(DDR,	QDR	and	FDR) •  RoCE	support	with	Mellanox	adapters •  Various	mulF-core	pla|orms •  Different	file	systems	with	disks	and	SSDs	and	Lustre –  hWp://hibd.cse.ohio-state.edu RDMA	for	Apache	Hadoop	2.x	Distribu4on
HPCAC-Switzerland	(Mar	‘16) 19	Network	Based	Compu4ng	Laboratory •  High-Performance	Design	of	Spark	over	RDMA-enabled	Interconnects –  High	performance	RDMA-enhanced	design	with	naFve	InfiniBand	and	RoCE	support	at	the	verbs-level for	Spark –  RDMA-based	data	shuffle	and	SEDA-based	shuffle	architecture –  Non-blocking	and	chunk-based	data	transfer –  Easily	configurable	for	different	protocols	(naFve	InfiniBand,	RoCE,	and	IPoIB) •  Current	release:	0.9.1 –  Based	on	Apache	Spark	1.5.1 –  Tested	with •  Mellanox	InfiniBand	adapters	(DDR,	QDR	and	FDR) •  RoCE	support	with	Mellanox	adapters •  Various	mulF-core	pla|orms •  RAM	disks,	SSDs,	and	HDD –  hWp://hibd.cse.ohio-state.edu RDMA	for	Apache	Spark	Distribu4on
HPCAC-Switzerland	(Mar	‘16) 20	Network	Based	Compu4ng	Laboratory •  High-Performance	Design	of	Memcached	over	RDMA-enabled	Interconnects –  High	performance	RDMA-enhanced	design	with	naFve	InfiniBand	and	RoCE	support	at	the	verbs-level	for Memcached	and	libMemcached	components –  High	performance	design	of	SSD-Assisted	Hybrid	Memory –  Easily	configurable	for	naFve	InfiniBand,	RoCE	and	the	tradiFonal	sockets-based	support	(Ethernet	and InfiniBand	with	IPoIB) •  Current	release:	0.9.4 –  Based	on	Memcached	1.4.24	and	libMemcached	1.0.18 –  Compliant	with	libMemcached	APIs	and	applicaFons –  Tested	with •  Mellanox	InfiniBand	adapters	(DDR,	QDR	and	FDR) •  RoCE	support	with	Mellanox	adapters •  Various	mulF-core	pla|orms •  SSD –  hWp://hibd.cse.ohio-state.edu RDMA	for	Memcached	Distribu4on
HPCAC-Switzerland	(Mar	‘16) 21	Network	Based	Compu4ng	Laboratory •  Micro-benchmarks	for	Hadoop	Distributed	File	System	(HDFS) –  SequenFal	Write	Latency	(SWL)	Benchmark,	SequenFal	Read	Latency	(SRL) Benchmark,	Random	Read	Latency	(RRL)	Benchmark,	SequenFal	Write	Throughput (SWT)	Benchmark,	SequenFal	Read	Throughput	(SRT)	Benchmark –  Support	benchmarking	of •  Apache	Hadoop	1.x	and	2.x	HDFS,	Hortonworks	Data	Pla|orm	(HDP)	HDFS,	Cloudera DistribuFon	of	Hadoop	(CDH)	HDFS •  Micro-benchmarks	for	Memcached –  Get	Benchmark,	Set	Benchmark,	and	Mixed	Get/Set	Benchmark •  Current	release:	0.8 •  hWp://hibd.cse.ohio-state.edu OSU	HiBD	Micro-Benchmark	(OHB)	Suite	–	HDFS	&	Memcached
HPCAC-Switzerland	(Mar	‘16) 22	Network	Based	Compu4ng	Laboratory •  HHH:	Heterogeneous	storage	devices	with	hybrid	replicaFon	schemes	are	supported	in	this	mode	of	operaFon	to	have	be<er	fault-tolerance	as	well as	performance.	This	mode	is	enabled	by	default	in	the	package. •  HHH-M:	A	high-performance	in-memory	based	setup	has	been	introduced	in	this	package	that	can	be	uFlized	to	perform	all	I/O	operaFons	in- memory	and	obtain	as	much	performance	benefit	as	possible. •  HHH-L:	With	parallel	file	systems	integrated,	HHH-L	mode	can	take	advantage	of	the	Lustre	available	in	the	cluster. •  MapReduce	over	Lustre,	with/without	local	disks:	Besides,	HDFS	based	soluFons,	this	package	also	provides	support	to	run	MapReduce	jobs	on	top of	Lustre	alone.	Here,	two	different	modes	are	introduced:	with	local	disks	and	without	local	disks. •  Running	with	Slurm	and	PBS:	Supports	deploying	RDMA	for	Apache	Hadoop	2.x	with	Slurm	and	PBS	in	different	running	modes	(HHH,	HHH-M,	HHH- L,	and	MapReduce	over	Lustre). Different	Modes	of	RDMA	for	Apache	Hadoop	2.x
HPCAC-Switzerland	(Mar	‘16) 23	Network	Based	Compu4ng	Laboratory •  RDMA-based	Designs	and	Performance	EvaluaFon –  HDFS –  MapReduce –  RPC –  HBase –  Spark –  Memcached	(Basic,	Hybrid	and	Non-blocking	APIs) –  HDFS	+	Memcached-based	Burst	Buffer –  OSU	HiBD	Benchmarks	(OHB) Accelera4on	Case	Studies	and	Performance	Evalua4on
HPCAC-Switzerland	(Mar	‘16) 24	Network	Based	Compu4ng	Laboratory •  Enables	high	performance	RDMA	communicaFon,	while	supporFng	tradiFonal	socket	interface •  JNI	Layer	bridges	Java	based	HDFS	with	communicaFon	library	wri<en	in	naFve	code Design	Overview	of	HDFS	with	RDMA HDFS Verbs RDMA	Capable	Networks (IB,	iWARP,	RoCE	..) Applica4ons 1/10/40/100	GigE,	IPoIB Network Java	Socket	Interface Java	Na4ve	Interface	(JNI) Write	Others OSU	Design •  Design	Features –  RDMA-based	HDFS	write –  RDMA-based	HDFS replicaFon –  Parallel	replicaFon support –  On-demand	connecFon setup –  InfiniBand/RoCE	support N.	S.	Islam,	M.	W.	Rahman,	J.	Jose,	R.	Rajachandrasekar,	H.	Wang,	H.	Subramoni,	C.	Murthy	and	D.	K.	Panda	,	High	Performance	RDMA-Based	Design	of	HDFS over	InfiniBand	,	Supercompu4ng	(SC),	Nov	2012 N.	Islam,	X.	Lu,	W.	Rahman,	and	D.	K.	Panda,	SOR-HDFS:	A	SEDA-based	Approach	to	Maximize	Overlapping	in	RDMA-Enhanced	HDFS,	HPDC	'14,	June	2014
HPCAC-Switzerland	(Mar	‘16) 25	Network	Based	Compu4ng	Laboratory Triple-H Heterogeneous	Storage •  Design	Features –  Three	modes •  Default	(HHH) •  In-Memory	(HHH-M) •  Lustre-Integrated	(HHH-L) –  Policies	to	efficiently	uFlize	the	heterogeneous storage	devices •  RAM,	SSD,	HDD,	Lustre –  EvicFon/PromoFon	based	on	data	usage pa<ern –  Hybrid	ReplicaFon –  Lustre-Integrated	mode: •  Lustre-based	fault-tolerance Enhanced	HDFS	with	In-Memory	and	Heterogeneous	Storage Hybrid	Replica4on Data	Placement	Policies Evic4on/Promo4on RAM	Disk SSD HDD Lustre N.	Islam,	X.	Lu,	M.	W.	Rahman,	D.	Shankar,	and	D.	K.	Panda,	Triple-H:	A	Hybrid	Approach	to	Accelerate	HDFS	on	HPC	Clusters with	Heterogeneous	Storage	Architecture,	CCGrid	’15,	May	2015 Applica4ons
HPCAC-Switzerland	(Mar	‘16) 26	Network	Based	Compu4ng	Laboratory Design	Overview	of	MapReduce	with	RDMA MapReduce Verbs RDMA	Capable	Networks (IB,	iWARP,	RoCE	..) OSU	Design Applica4ons 1/10/40/100	GigE,	IPoIB Network Java	Socket	Interface Java	Na4ve	Interface	(JNI) Job Tracker Task Tracker Map Reduce •  Enables	high	performance	RDMA	communicaFon,	while	supporFng	tradiFonal	socket	interface •  JNI	Layer	bridges	Java	based	MapReduce	with	communicaFon	library	wri<en	in	naFve	code •  Design	Features –  RDMA-based	shuffle –  Prefetching	and	caching	map	output –  Efficient	Shuffle	Algorithms –  In-memory	merge –  On-demand	Shuffle	Adjustment –  Advanced	overlapping •  map,	shuffle,	and	merge •  shuffle,	merge,	and	reduce –  On-demand	connecFon	setup –  InfiniBand/RoCE	support M.	W.	Rahman,	X.	Lu,	N.	S.	Islam,	and	D.	K.	Panda,	HOMR:	A	Hybrid	Approach	to	Exploit	Maximum	Overlapping	in MapReduce	over	High	Performance	Interconnects,	ICS,	June	2014
HPCAC-Switzerland	(Mar	‘16) 27	Network	Based	Compu4ng	Laboratory •  A	hybrid	approach	to	achieve	maximum possible	overlapping	in	MapReduce	across all	phases	compared	to	other	approaches –  Efficient	Shuffle	Algorithms –  Dynamic	and	Efficient	Switching –  On-demand	Shuffle	Adjustment	Advanced	Overlapping	among	Different	Phases Default	Architecture Enhanced	Overlapping	with	In-Memory	Merge Advanced	Hybrid	Overlapping M.	W.	Rahman,	X.	Lu,	N.	S.	Islam,	and	D.	K.	Panda,	HOMR:	A Hybrid	Approach	to	Exploit	Maximum	Overlapping	in	MapReduce over	High	Performance	Interconnects,	ICS,	June	2014
HPCAC-Switzerland	(Mar	‘16) 28	Network	Based	Compu4ng	Laboratory •  Design	Features –  JVM-bypassed	buffer management –  RDMA	or	send/recv	based adapFve	communicaFon –  Intelligent	buffer	allocaFon	and adjustment	for	serializaFon –  On-demand	connecFon	setup –  InfiniBand/RoCE	support Design	Overview	of	Hadoop	RPC	with	RDMA Hadoop	RPC Verbs RDMA	Capable	Networks (IB,	iWARP,	RoCE	..) Applica4ons 1/10/40/100	GigE,	IPoIB Network Java	Socket	Interface Java	Na4ve	Interface	(JNI) Our	Design	Default OSU	Design •  Enables	high	performance	RDMA	communicaFon,	while	supporFng	tradiFonal	socket	interface •  JNI	Layer	bridges	Java	based	RPC	with	communicaFon	library	wri<en	in	naFve	code X.	Lu,	N.	Islam,	M.	W.	Rahman,	J.	Jose,	H.	Subramoni,	H.	Wang,	and	D.	K.	Panda,	High-Performance	Design	of	Hadoop	RPC	with RDMA	over	InfiniBand,	Int'l	Conference	on	Parallel	Processing	(ICPP	'13),	October	2013.
HPCAC-Switzerland	(Mar	‘16) 29	Network	Based	Compu4ng	Laboratory 0 50 100 150 200 250 80 100 120 Execu4on	Time	(s) Data	Size	(GB) IPoIB	(FDR) 0 50 100 150 200 250 80 100 120 Execu4on	Time	(s) Data	Size	(GB) IPoIB	(FDR) Performance	Benefits	–	RandomWriter	&	TeraGen	in	TACC-Stampede Cluster	with	32	Nodes	with	a	total	of	128	maps •  RandomWriter –  3-4x	improvement	over	IPoIB for	80-120	GB	file	size •  TeraGen –  4-5x	improvement	over	IPoIB for	80-120	GB	file	size RandomWriter TeraGen Reduced	by	3x Reduced	by	4x
HPCAC-Switzerland	(Mar	‘16) 30	Network	Based	Compu4ng	Laboratory 0 100 200 300 400 500 600 700 800 900 80 100 120 Execu4on	Time	(s) Data	Size	(GB) IPoIB	(FDR) OSU-IB	(FDR) 0 100 200 300 400 500 600 80 100 120 Execu4on	Time	(s) Data	Size	(GB) IPoIB	(FDR) OSU-IB	(FDR) Performance	Benefits	–	Sort	&	TeraSort	in	TACC-Stampede Cluster	with	32	Nodes	with	a	total	of 128	maps	and	64	reduces •  Sort	with	single	HDD	per	node –  40-52%	improvement	over	IPoIB for	80-120	GB	data •  TeraSort	with	single	HDD	per	node –  42-44%	improvement	over	IPoIB for	80-120	GB	data Reduced	by	52% Reduced	by	44% Cluster	with	32	Nodes	with	a	total	of 128	maps	and	57	reduces
HPCAC-Switzerland	(Mar	‘16) 31	Network	Based	Compu4ng	Laboratory Evalua4on	of	HHH	and	HHH-L	with	Applica4ons HDFS	(FDR) HHH	(FDR) 60.24	s 48.3	s CloudBurst	MR-MSPolyGraph 0 200 400 600 800 1000 4 6 8 ExecuFon	Time	(s) Concurrent	maps	per	host HDFS Lustre HHH-L Reduced	by	79% •  MR-MSPolygraph	on	OSU	RI	with 1,000	maps –  HHH-L	reduces	the	execuFon	Fme by	79%	over	Lustre,	30%	over	HDFS •  CloudBurst	on	TACC	Stampede –  With	HHH:	19%	improvement	over HDFS
HPCAC-Switzerland	(Mar	‘16) 32	Network	Based	Compu4ng	Laboratory Evalua4on	with	Spark	on	SDSC	Gordon	(HHH	vs.	Tachyon/Alluxio) •  For	200GB	TeraGen	on	32	nodes –  Spark-TeraGen:	HHH	has	2.4x	improvement	over	Tachyon;	2.3x	over	HDFS-IPoIB	(QDR) –  Spark-TeraSort:	HHH	has	25.2%	improvement	over	Tachyon;	17%	over	HDFS-IPoIB	(QDR) 0 20 40 60 80 100 120 140 160 180 8:50 16:100 32:200 Execu4on	Time	(s) Cluster	Size	:	Data	Size	(GB) IPoIB	(QDR) Tachyon OSU-IB	(QDR) 0 100 200 300 400 500 600 700 8:50 16:100 32:200 Execu4on	Time	(s) Cluster	Size	:	Data	Size	(GB) Reduced by	2.4x Reduced	by	25.2% TeraGen TeraSort N.	Islam,	M.	W.	Rahman,	X.	Lu,	D.	Shankar,	and	D.	K.	Panda,	Performance	Characteriza4on	and	Accelera4on	of	In-Memory	File Systems	for	Hadoop	and	Spark	Applica4ons	on	HPC	Clusters,	IEEE	BigData	’15,	October	2015
HPCAC-Switzerland	(Mar	‘16) 33	Network	Based	Compu4ng	Laboratory Intermediate	Data	Directory Design	Overview	of	Shuffle	Strategies	for	MapReduce	over	Lustre •  Design	Features –  Two	shuffle	approaches •  Lustre	read	based	shuffle •  RDMA	based	shuffle –  Hybrid	shuffle	algorithm	to	take	benefit from	both	shuffle	approaches –  Dynamically	adapts	to	the	be<er shuffle	approach	for	each	shuffle request	based	on	profiling	values	for each	Lustre	read	operaFon –  In-memory	merge	and	overlapping	of different	phases	are	kept	similar	to RDMA-enhanced	MapReduce	design Map	1 Map	2 Map	3 Lustre Reduce	1 Reduce	2 Lustre	Read	/	RDMA In-memory merge/sort reduce M.	W.	Rahman,	X.	Lu,	N.	S.	Islam,	R.	Rajachandrasekar,	and	D.	K.	Panda,	High	Performance	Design	of	YARN MapReduce	on	Modern	HPC	Clusters	with	Lustre	and	RDMA,	IPDPS,	May	2015 In-memory merge/sort reduce
HPCAC-Switzerland	(Mar	‘16) 34	Network	Based	Compu4ng	Laboratory •  For	500GB	Sort	in	64	nodes –  44%	improvement	over	IPoIB	(FDR) Performance	Improvement	of	MapReduce	over	Lustre	on	TACC- Stampede •  For	640GB	Sort	in	128	nodes –  48%	improvement	over	IPoIB	(FDR) 0 200 400 600 800 1000 1200 300 400 500 Job	Execu4on	Time	(sec) Data	Size	(GB) IPoIB	(FDR) OSU-IB	(FDR) 0 50 100 150 200 250 300 350 400 450 500 20	GB 40	GB 80	GB 160	GB 320	GB 640	GB Cluster:	4 Cluster:	8 Cluster:	16 Cluster:	32 Cluster:	64 Cluster:	128 Job	Execu4on	Time	(sec) IPoIB	(FDR) OSU-IB	(FDR) M.	W.	Rahman,	X.	Lu,	N.	S.	Islam,	R.	Rajachandrasekar,	and	D.	K.	Panda,	MapReduce	over	Lustre:	Can RDMA-based	Approach	Benefit?,	Euro-Par,	August	2014. •  Local	disk	is	used	as	the	intermediate	data	directory Reduced	by	48%	Reduced	by	44%
HPCAC-Switzerland	(Mar	‘16) 35	Network	Based	Compu4ng	Laboratory •  For	80GB	Sort	in	8	nodes –  34%	improvement	over	IPoIB	(QDR) Case	Study	-	Performance	Improvement	of	MapReduce	over Lustre	on	SDSC-Gordon •  For	120GB	TeraSort	in	16	nodes –  25%	improvement	over	IPoIB	(QDR) •  Lustre	is	used	as	the	intermediate	data	directory 0 100 200 300 400 500 600 700 800 900 40 60 80 Job	Execu4on	Time	(sec) Data	Size	(GB) IPoIB	(QDR) OSU-Lustre-Read	(QDR) OSU-RDMA-IB	(QDR) OSU-Hybrid-IB	(QDR) 0 100 200 300 400 500 600 700 800 900 40 80 120 Job	Execu4on	Time	(sec) Data	Size	(GB) Reduced	by	25%	Reduced	by	34%
HPCAC-Switzerland	(Mar	‘16) 36	Network	Based	Compu4ng	Laboratory •  RDMA-based	Designs	and	Performance	EvaluaFon –  HDFS –  MapReduce –  RPC –  HBase –  Spark –  Memcached	(Basic,	Hybrid	and	Non-blocking	APIs) –  HDFS	+	Memcached-based	Burst	Buffer –  OSU	HiBD	Benchmarks	(OHB) Accelera4on	Case	Studies	and	Performance	Evalua4on
HPCAC-Switzerland	(Mar	‘16) 37	Network	Based	Compu4ng	Laboratory HBase-RDMA	Design	Overview •  JNI	Layer	bridges	Java	based	HBase	with	communicaFon	library	wri<en	in	naFve	code •  Enables	high	performance	RDMA	communicaFon,	while	supporFng	tradiFonal	socket	interface HBase IB	Verbs RDMA	Capable	Networks (IB,	iWARP,	RoCE	..) OSU-IB	Design Applica4ons 1/10/40/100	GigE,	IPoIB Networks Java	Socket	Interface Java	Na4ve	Interface	(JNI)
HPCAC-Switzerland	(Mar	‘16) 38	Network	Based	Compu4ng	Laboratory HBase	–	YCSB	Read-Write	Workload •  HBase	Get	latency	(QDR,	10GigE) –  64	clients:	2.0	ms;	128	Clients:	3.5	ms –  42%	improvement	over	IPoIB	for	128	clients •  HBase	Put	latency	(QDR,	10GigE) –  64	clients:	1.9	ms;	128	Clients:	3.5	ms –  40%	improvement	over	IPoIB	for	128	clients 0 1000 2000 3000 4000 5000 6000 7000 8 16 32 64 96 128 Time	(us) No.	of	Clients 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 8 16 32 64 96 128 Time	(us) No.	of	Clients 10GigE Read	Latency Write	Latency OSU-IB	(QDR)	IPoIB	(QDR) J.	Huang,	X.	Ouyang,	J.	Jose,	M.	W.	Rahman,	H. Wang,	M.	Luo,	H.	Subramoni,	Chet	Murthy,	and D.	K.	Panda,	High-Performance	Design	of	HBase with	RDMA	over	InfiniBand,	IPDPS’12
HPCAC-Switzerland	(Mar	‘16) 39	Network	Based	Compu4ng	Laboratory HBase	–	YCSB	Get	Latency	and	Throughput	on	SDSC-Comet •  HBase	Get	average	latency	(FDR) –  4	client	threads:	38	us –  59%	improvement	over	IPoIB	for	4	client	threads •  HBase	Get	total	throughput –  4	client	threads:	102	Kops/sec –  2.4x	improvement	over	IPoIB	for	4	client	threads Get	Latency Get	Throughput 0 0.02 0.04 0.06 0.08 0.1 0.12 1 2 3 4 Average	Latency	(ms) Number	of	Client	Threads IPoIB	(FDR) OSU-IB	(FDR) 0 20 40 60 80 100 120 1 2 3 4 Total	Throughput	(Kops/ sec) Number	of	Client	Threads IPoIB	(FDR) OSU-IB	(FDR) 59% 2.4x
HPCAC-Switzerland	(Mar	‘16) 40	Network	Based	Compu4ng	Laboratory •  RDMA-based	Designs	and	Performance	EvaluaFon –  HDFS –  MapReduce –  RPC –  HBase –  Spark –  Memcached	(Basic,	Hybrid	and	Non-blocking	APIs) –  HDFS	+	Memcached-based	Burst	Buffer –  OSU	HiBD	Benchmarks	(OHB) Accelera4on	Case	Studies	and	Performance	Evalua4on
HPCAC-Switzerland	(Mar	‘16) 41	Network	Based	Compu4ng	Laboratory •  Design	Features –  RDMA	based	shuffle –  SEDA-based	plugins –  Dynamic	connecFon management	and	sharing –  Non-blocking	data	transfer –  Off-JVM-heap	buffer management –  InfiniBand/RoCE	support Design	Overview	of	Spark	with	RDMA Spark Applications (Scala/Java/Python) Spark (Scala/Java) BlockFetcherIteratorBlockManager Task TaskTask Task Java NIO Shuffle Server (default) Netty Shuffle Server (optional) RDMA Shuffle Server (plug-in) Java NIO Shuffle Fetcher (default) Netty Shuffle Fetcher (optional) RDMA Shuffle Fetcher (plug-in) Java Socket RDMA-based Shuffle Engine (Java/JNI) 1/10 Gig Ethernet/IPoIB (QDR/FDR) Network Native InfiniBand (QDR/FDR) •  Enables	high	performance	RDMA	communicaFon,	while	supporFng	tradiFonal	socket	interface •  JNI	Layer	bridges	Scala	based	Spark	with	communicaFon	library	wri<en	in	naFve	code X.	Lu,	M.	W.	Rahman,	N.	Islam,	D.	Shankar,	and	D.	K.	Panda,	Accelera4ng	Spark	with	RDMA	for	Big	Data	Processing:	Early Experiences,	Int'l	Symposium	on	High	Performance	Interconnects	(HotI'14),	August	2014
HPCAC-Switzerland	(Mar	‘16) 42	Network	Based	Compu4ng	Laboratory •  InfiniBand	FDR,	SSD,	64	Worker	Nodes,	1536	Cores,	(1536M	1536R) •  RDMA-based	design	for	Spark	1.5.1 •  RDMA	vs.	IPoIB	with	1536	concurrent	tasks,	single	SSD	per	node. –  SortBy:	Total	Fme	reduced	by	up	to	80%	over	IPoIB	(56Gbps) –  GroupBy:	Total	Fme	reduced	by	up	to	57%	over	IPoIB	(56Gbps) Performance	Evalua4on	on	SDSC	Comet	–	SortBy/GroupBy 64	Worker	Nodes,	1536	cores,	SortByTest	Total	Time 64	Worker	Nodes,	1536	cores,	GroupByTest	Total	Time 0 50 100 150 200 250 300 64 128 256 Time	(sec) Data	Size	(GB) IPoIB RDMA 0 50 100 150 200 250 64 128 256 Time	(sec) Data	Size	(GB) IPoIB RDMA 57%	80%
HPCAC-Switzerland	(Mar	‘16) 43	Network	Based	Compu4ng	Laboratory •  InfiniBand	FDR,	SSD,	64	Worker	Nodes,	1536	Cores,	(1536M	1536R) •  RDMA-based	design	for	Spark	1.5.1 •  RDMA	vs.	IPoIB	with	1536	concurrent	tasks,	single	SSD	per	node. –  Sort:	Total	Fme	reduced	by	38%	over	IPoIB	(56Gbps) –  TeraSort:	Total	Fme	reduced	by	15%	over	IPoIB	(56Gbps) Performance	Evalua4on	on	SDSC	Comet	–	HiBench	Sort/TeraSort 64	Worker	Nodes,	1536	cores,	Sort	Total	Time 64	Worker	Nodes,	1536	cores,	TeraSort	Total	Time 0 50 100 150 200 250 300 350 400 450 64 128 256 Time	(sec) Data	Size	(GB) IPoIB RDMA 0 100 200 300 400 500 600 64 128 256 Time	(sec) Data	Size	(GB) IPoIB RDMA 15%	38%
HPCAC-Switzerland	(Mar	‘16) 44	Network	Based	Compu4ng	Laboratory •  InfiniBand	FDR,	SSD,	32/64	Worker	Nodes,	768/1536	Cores,	(768/1536M	768/1536R) •  RDMA-based	design	for	Spark	1.5.1 •  RDMA	vs.	IPoIB	with	768/1536	concurrent	tasks,	single	SSD	per	node. –  32	nodes/768	cores:	Total	Fme	reduced	by	37%	over	IPoIB	(56Gbps) –  64	nodes/1536	cores:	Total	Fme	reduced	by	43%	over	IPoIB	(56Gbps) Performance	Evalua4on	on	SDSC	Comet	–	HiBench	PageRank 32	Worker	Nodes,	768	cores,	PageRank	Total	Time 64	Worker	Nodes,	1536	cores,	PageRank	Total	Time 0 50 100 150 200 250 300 350 400 450 Huge BigData GiganFc Time	(sec) Data	Size	(GB) IPoIB RDMA 0 100 200 300 400 500 600 700 800 Huge BigData GiganFc Time	(sec) Data	Size	(GB) IPoIB RDMA 43%	37%
HPCAC-Switzerland	(Mar	‘16) 45	Network	Based	Compu4ng	Laboratory •  RDMA-based	Designs	and	Performance	EvaluaFon –  HDFS –  MapReduce –  RPC –  HBase –  Spark –  Memcached	(Basic,	Hybrid	and	Non-blocking	APIs) –  HDFS	+	Memcached-based	Burst	Buffer –  OSU	HiBD	Benchmarks	(OHB) Accelera4on	Case	Studies	and	Performance	Evalua4on
HPCAC-Switzerland	(Mar	‘16) 46	Network	Based	Compu4ng	Laboratory •  Server	and	client	perform	a	negoFaFon	protocol –  Master	thread	assigns	clients	to	appropriate	worker	thread •  Once	a	client	is	assigned	a	verbs	worker	thread,	it	can	communicate	directly	and	is	“bound”	to that	thread •  All	other	Memcached	data	structures	are	shared	among	RDMA	and	Sockets	worker	threads •  Memcached	Server	can	serve	both	socket	and	verbs	clients	simultaneously •  Memcached	applicaFons	need	not	be	modified;	uses	verbs	interface	if	available Memcached-RDMA	Design Sockets Client RDMA Client Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory Slabs Items … 1 1 2 2
HPCAC-Switzerland	(Mar	‘16) 47	Network	Based	Compu4ng	Laboratory 1 10 100 1000 1 2 4 8 16 32 64	128	256	512	1K 2K 4K Time	(us) Message	Size OSU-IB	(FDR) IPoIB	(FDR) 0 100 200 300 400 500 600 700 16 32 64 128 256 512 1024	2048	4080 Thousands	of	Transac4ons per	Second	(TPS) No.	of	Clients •  Memcached	Get	latency –  4	bytes	OSU-IB:	2.84	us;	IPoIB:	75.53	us –  2K	bytes	OSU-IB:	4.49	us;	IPoIB:	123.42	us •  Memcached	Throughput	(4bytes) –  4080	clients	OSU-IB:	556	Kops/sec,	IPoIB:	233	Kops/s –  Nearly	2X	improvement	in	throughput Memcached	GET	Latency Memcached	Throughput RDMA-Memcached	Performance	(FDR	Interconnect) Experiments	on	TACC	Stampede	(Intel	SandyBridge	Cluster,	IB:	FDR) Latency	Reduced by	nearly	20X 2X
HPCAC-Switzerland	(Mar	‘16) 48	Network	Based	Compu4ng	Laboratory •  IllustraFon	with	Read-Cache-Read	access	pa<ern	using	modified	mysqlslap	load	tesFng tool •  Memcached-RDMA	can -  improve	query	latency	by	up	to	66%	over	IPoIB	(32Gbps) -  throughput	by	up	to	69%	over	IPoIB	(32Gbps) Micro-benchmark	Evalua4on	for	OLDP	workloads 0 1 2 3 4 5 6 7 8 64 96 128 160 320 400 Latency	(sec) No.	of	Clients	Memcached-IPoIB	(32Gbps) Memcached-RDMA	(32Gbps) 0 1000 2000 3000 4000 64 96 128 160 320 400 Throughput	(Kq/s) No.	of	Clients Memcached-IPoIB	(32Gbps) Memcached-RDMA	(32Gbps) D.	Shankar,	X.	Lu,	J.	Jose,	M.	W.	Rahman,	N.	Islam,	and	D.	K.	Panda,	Can	RDMA	Benefit	On-Line	Data	Processing	Workloads with	Memcached	and	MySQL,	ISPASS’15 Reduced	by	66%
HPCAC-Switzerland	(Mar	‘16) 49	Network	Based	Compu4ng	Laboratory •  ohb_memlat	&	ohb_memthr	latency	&	throughput	micro-benchmarks •  Memcached-RDMA	can -  improve	query	latency	by	up	to	70%	over	IPoIB	(32Gbps) -  improve	throughput	by	up	to	2X	over	IPoIB	(32Gbps) -  No	overhead	in	using	hybrid	mode	when	all	data	can	fit	in	memory Performance	Benefits	of	Hybrid	Memcached	(Memory	+	SSD)	on SDSC-Gordon 0 2 4 6 8 10 64 128 256 512 1024 Throughput	(million	trans/sec) No.	of	Clients	IPoIB	(32Gbps) RDMA-Mem	(32Gbps) RDMA-Hybrid	(32Gbps) 0 100 200 300 400 500 Average	latency	(us) Message	Size	(Bytes) 2X
HPCAC-Switzerland	(Mar	‘16) 50	Network	Based	Compu4ng	Laboratory –  Memcached	latency	test	with	Zipf	distribuFon,	server	with	1	GB	memory,	32	KB	key-value	pair	size,	total size	of	data	accessed	is	1	GB	(when	data	fits	in	memory)	and	1.5	GB	(when	data	does	not	fit	in	memory) –  When	data	fits	in	memory:	RDMA-Mem/Hybrid	gives	5x	improvement	over	IPoIB-Mem –  When	data	does	not	fit	in	memory:	RDMA-Hybrid	gives	2x-2.5x	over	IPoIB/RDMA-Mem Performance	Evalua4on	on	IB	FDR	+	SATA/NVMe	SSDs 0 500 1000 1500 2000 2500 Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid- NVMe IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid- NVMe Data	Fits	In	Memory Data	Does	Not	Fit	In	Memory Latency	(us) slab	allocaFon	(SSD	write) cache	check+load	(SSD	read) cache	update server	response client	wait miss-penalty
HPCAC-Switzerland	(Mar	‘16) 51	Network	Based	Compu4ng	Laboratory –  RDMA-Accelerated	CommunicaFon	for Memcached	Get/Set –  Hybrid	‘RAM+SSD’	slab	management	for higher	data	retenFon –  Non-blocking	API	extensions •  memcached_(iset/iget/bset/bget/test/wait) •  Achieve	near	in-memory	speeds	while	hiding bo<lenecks	of	network	and	SSD	I/O •  Ability	to	exploit	communicaFon/computaFon overlap •  OpFonal	buffer	re-use	guarantees –  AdapFve	slab	manager	with	different	I/O schemes	for	higher	throughput. Accelera4ng	Hybrid	Memcached	with	RDMA,	Non-blocking	Extensions	and	SSDs D.	Shankar,	X.	Lu,	N.	S.	Islam,	M.	W.	Rahman,	and	D.	K.	Panda,	High-Performance	Hybrid	Key-Value	Store	on	Modern	Clusters	with RDMA	Interconnects	and	SSDs:	Non-blocking	Extensions,	Designs,	and	Benefits,	IPDPS,	May	2016 BLOCKING API NON-BLOCKING API	REQ. NON-BLOCKING API	REPLY CLIENT	SERVER HYBRID	SLAB	MANAGER	(RAM+SSD) RDMA-ENHANCED	COMMUNICATION	LIBRARY RDMA-ENHANCED	COMMUNICATION	LIBRARY LIBMEMCACHED	LIBRARY Blocking	API	Flow Non-Blocking	API	Flow
HPCAC-Switzerland	(Mar	‘16) 52	Network	Based	Compu4ng	Laboratory –  Data	does	not	fit	in	memory:	Non-blocking	Memcached	Set/Get	API	Extensions	can	achieve •  >16x	latency	improvement	vs.	blocking	API	over	RDMA-Hybrid/RDMA-Mem	w/	penalty •  >2.5x	throughput	improvement	vs.	blocking	API	over	default/opFmized	RDMA-Hybrid –  Data	fits	in	memory:	Non-blocking	Extensions	perform	similar	to	RDMA-Mem/RDMA-Hybrid	and	>3.6x improvement	over	IPoIB-Mem Performance	Evalua4on	with	Non-Blocking	Memcached	API 0 500 1000 1500 2000 2500 Set Get Set Get Set Get Set Get Set Get Set Get IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block	H-RDMA-Opt-NonB- i H-RDMA-Opt-NonB- b Average	Latency	(us) Miss	Penalty	(Backend	DB	Access	Overhead) Client	Wait Server	Response Cache	Update Cache	check+Load	(Memory	and/or	SSD	read) Slab	AllocaFon	(w/	SSD	write	on	Out-of-Mem) H	=	Hybrid	Memcached	over	SATA	SSD	Opt	=	AdapFve	slab	manager	Block	=	Default	Blocking	API NonB-i	=	Non-blocking	iset/iget	API	NonB-b	=	Non-blocking	bset/bget	API	w/	buffer	re-use	guarantee
HPCAC-Switzerland	(Mar	‘16) 53	Network	Based	Compu4ng	Laboratory •  RDMA-based	Designs	and	Performance	EvaluaFon –  HDFS –  MapReduce –  RPC –  HBase –  Spark –  Memcached	(Basic,	Hybrid	and	Non-blocking	APIs) –  HDFS	+	Memcached-based	Burst	Buffer –  OSU	HiBD	Benchmarks	(OHB) Accelera4on	Case	Studies	and	Performance	Evalua4on
HPCAC-Switzerland	(Mar	‘16) 54	Network	Based	Compu4ng	Laboratory •  Design	Features –  Memcached-based	burst-buffer system •  Hides	latency	of	parallel	file system	access •  Read	from	local	storage	and Memcached –  Data	locality	achieved	by	wriFng	data to	local	storage –  Different	approaches	of	integraFon with	parallel	file	system	to	guarantee fault-tolerance Accelera4ng	I/O	Performance	of	Big	Data	Analy4cs through	RDMA-based	Key-Value	Store Applica4on I/O	Forwarding	Module Map/Reduce	Task DataNode Local	Disk Data	Locality	Fault-tolerance Lustre Memcached-based	Burst	Buffer	System
HPCAC-Switzerland	(Mar	‘16) 55	Network	Based	Compu4ng	Laboratory Evalua4on	with	PUMA	Workloads Gains	on	OSU	RI	with	our	approach	(Mem-bb)	on	24	nodes •  SequenceCount:	34.5%	over	Lustre,	40%	over	HDFS •  RankedInvertedIndex:	27.3%	over	Lustre,	48.3%	over	HDFS •  HistogramRaFng:	17%	over	Lustre,	7%	over	HDFS 0 500 1000 1500 2000 2500 3000 3500 4000 4500 SeqCount RankedInvIndex HistoRaFng Execu4on	Time	(s) Workloads HDFS	(32Gbps) Lustre	(32Gbps) Mem-bb	(32Gbps) 48.3% 40% 17% N.	S.	Islam,	D.	Shankar,	X.	Lu,	M. W.	Rahman,	and	D.	K.	Panda, Accelera4ng	I/O	Performance	of Big	Data	Analy4cs	with	RDMA- based	Key-Value	Store,	ICPP	’15, September	2015
HPCAC-Switzerland	(Mar	‘16) 56	Network	Based	Compu4ng	Laboratory •  RDMA-based	Designs	and	Performance	EvaluaFon –  HDFS –  MapReduce –  RPC –  HBase –  Spark –  Memcached	(Basic,	Hybrid	and	Non-blocking	APIs) –  HDFS	+	Memcached-based	Burst	Buffer –  OSU	HiBD	Benchmarks	(OHB) Accelera4on	Case	Studies	and	Performance	Evalua4on
HPCAC-Switzerland	(Mar	‘16) 57	Network	Based	Compu4ng	Laboratory •  The	current	benchmarks	provide	some	performance	behavior •  However,	do	not	provide	any	informaFon	to	the	designer/developer	on: –  What	is	happening	at	the	lower-layer? –  Where	the	benefits	are	coming	from? –  Which	design	is	leading	to	benefits	or	bo<lenecks? –  Which	component	in	the	design	needs	to	be	changed	and	what	will	be	its	impact? –  Can	performance	gain/loss	at	the	lower-layer	be	correlated	to	the	performance gain/loss	observed	at	the	upper	layer? Are	the	Current	Benchmarks	Sufficient	for	Big	Data?
HPCAC-Switzerland	(Mar	‘16) 58	Network	Based	Compu4ng	Laboratory Big	Data	Middleware (HDFS,	MapReduce,	HBase,	Spark	and	Memcached) Networking	Technologies (InfiniBand,	1/10/40/100	GigE and	Intelligent	NICs) Storage	Technologies (HDD,	SSD,	and	NVMe-SSD) Programming	Models (Sockets) Applica4ons Commodity	Compu4ng	System Architectures (Mul4-	and	Many-core architectures	and	accelerators) Other	Protocols? Communica4on	and	I/O	Library Point-to-Point Communica4on QoS Threaded	Models and	Synchroniza4on Fault-Tolerance	I/O	and	File	Systems Virtualiza4on Benchmarks RDMA	Protocols Challenges	in	Benchmarking	of	RDMA-based	Designs Current Benchmarks No	Benchmarks Correla4on?
HPCAC-Switzerland	(Mar	‘16) 59	Network	Based	Compu4ng	Laboratory •  A	comprehensive	suite	of	benchmarks	to –  Compare	performance	of	different	MPI	libraries	on	various	networks	and	systems –  Validate	low-level	funcFonaliFes –  Provide	insights	to	the	underlying	MPI-level	designs •  Started	with	basic	send-recv	(MPI-1)	micro-benchmarks	for	latency,	bandwidth	and	bi-direcFonal	bandwidth •  Extended	later	to –  MPI-2	one-sided –  CollecFves –  GPU-aware	data	movement –  OpenSHMEM	(point-to-point	and	collecFves) –  UPC •  Has	become	an	industry	standard •  Extensively	used	for	design/development	of	MPI	libraries,	performance	comparison	of	MPI	libraries	and	even in	procurement	of	large-scale	systems •  Available	from	h<p://mvapich.cse.ohio-state.edu/benchmarks •  Available	in	an	integrated	manner	with	MVAPICH2	stack OSU	MPI	Micro-Benchmarks	(OMB)	Suite
HPCAC-Switzerland	(Mar	‘16) 60	Network	Based	Compu4ng	Laboratory Big	Data	Middleware (HDFS,	MapReduce,	HBase,	Spark	and	Memcached) Networking	Technologies (InfiniBand,	1/10/40/100	GigE and	Intelligent	NICs) Storage	Technologies (HDD,	SSD,	and	NVMe-SSD) Programming	Models (Sockets) Applica4ons Commodity	Compu4ng	System Architectures (Mul4-	and	Many-core architectures	and	accelerators) Other	Protocols? Communica4on	and	I/O	Library Point-to-Point Communica4on QoS Threaded	Models and	Synchroniza4on Fault-Tolerance	I/O	and	File	Systems Virtualiza4on Benchmarks RDMA	Protocols Itera4ve	Process	–	Requires	Deeper	Inves4ga4on	and	Design	for Benchmarking	Next	Genera4on	Big	Data	Systems	and	Applica4ons Applica4ons-Level Benchmarks Micro- Benchmarks
HPCAC-Switzerland	(Mar	‘16) 61	Network	Based	Compu4ng	Laboratory •  HDFS	Benchmarks –  SequenFal	Write	Latency	(SWL)	Benchmark –  SequenFal	Read	Latency	(SRL)	Benchmark –  Random	Read	Latency	(RRL)	Benchmark –  SequenFal	Write	Throughput	(SWT)	Benchmark –  SequenFal	Read	Throughput	(SRT)	Benchmark •  Memcached	Benchmarks –  Get	Benchmark –  Set	Benchmark –  Mixed	Get/Set	Benchmark •  Available	as	a	part	of	OHB	0.8 OSU	HiBD	Benchmarks	(OHB) N.	S.	Islam,	X.	Lu,	M.	W.	Rahman,	J.	Jose,	and	D. K.	Panda,	A	Micro-benchmark	Suite	for Evalua4ng	HDFS	Opera4ons	on	Modern Clusters,	Int'l	Workshop	on	Big	Data Benchmarking	(WBDB	'12),	December	2012 D.	Shankar,	X.	Lu,	M.	W.	Rahman,	N.	Islam,	and D.	K.	Panda,	A	Micro-Benchmark	Suite	for Evalua4ng	Hadoop	MapReduce	on	High- Performance	Networks,	BPOE-5	(2014) X.	Lu,	M.	W.	Rahman,	N.	Islam,	and	D.	K.	Panda, A	Micro-Benchmark	Suite	for	Evalua4ng	Hadoop RPC	on	High-Performance	Networks,	Int'l Workshop	on	Big	Data	Benchmarking	(WBDB '13),	July	2013 To	be	Released
HPCAC-Switzerland	(Mar	‘16) 62	Network	Based	Compu4ng	Laboratory •  Upcoming	Releases	of	RDMA-enhanced	Packages	will	support –  HBase –  Impala •  Upcoming	Releases	of	OSU	HiBD	Micro-Benchmarks	(OHB)	will	support –  MapReduce –  RPC •  Advanced	designs	with	upper-level	changes	and	opFmizaFons –  Memcached	with	Non-blocking	API –  HDFS	+	Memcached-based	Burst	Buffer On-going	and	Future	Plans	of	OSU	High	Performance	Big	Data (HiBD)	Project
HPCAC-Switzerland	(Mar	‘16) 63	Network	Based	Compu4ng	Laboratory •  Discussed	challenges	in	acceleraFng	Hadoop,	Spark	and	Memcached •  Presented	iniFal	designs	to	take	advantage	of	InfiniBand/RDMA	for	HDFS, MapReduce,	RPC,	Spark,	and	Memcached •  Results	are	promising •  Many	other	open	issues	need	to	be	solved •  Will	enable	Big	Data	community	to	take	advantage	of	modern	HPC technologies	to	carry	out	their	analyFcs	in	a	fast	and	scalable	manner Concluding	Remarks
HPCAC-Switzerland	(Mar	‘16) 64	Network	Based	Compu4ng	Laboratory Funding	Acknowledgments Funding	Support	by Equipment	Support	by
HPCAC-Switzerland	(Mar	‘16) 65	Network	Based	Compu4ng	Laboratory Personnel	Acknowledgments Current	Students –  A.	AugusFne	(M.S.) –  A.	Awan	(Ph.D.) –  S.	Chakraborthy	(Ph.D.) –  C.-H.	Chu	(Ph.D.) –  N.	Islam	(Ph.D.) –  M.	Li	(Ph.D.) Past	Students –  P.	Balaji	(Ph.D.) –  S.	Bhagvat	(M.S.) –  A.	Bhat	(M.S.) –  D.	BunFnas	(Ph.D.) –  L.	Chai	(Ph.D.) –  B.	Chandrasekharan	(M.S.) –  N.	Dandapanthula	(M.S.) –  V.	Dhanraj	(M.S.) –  T.	Gangadharappa	(M.S.) –  K.	Gopalakrishnan	(M.S.) –  G.	Santhanaraman	(Ph.D.) –  A.	Singh	(Ph.D.) –  J.	Sridhar	(M.S.) –  S.	Sur	(Ph.D.) –  H.	Subramoni	(Ph.D.) –  K.	Vaidyanathan	(Ph.D.) –  A.	Vishnu	(Ph.D.) –  J.	Wu	(Ph.D.) –  W.	Yu	(Ph.D.) Past	Research	Scien:st –  S.	Sur Current	Post-Doc –  J.	Lin –  D.	Banerjee Current	Programmer –  J.	Perkins Past	Post-Docs –  H.	Wang –  X.	Besseron –  H.-W.	Jin –  M.	Luo –  W.	Huang	(Ph.D.) –  W.	Jiang	(M.S.) –  J.	Jose	(Ph.D.) –  S.	Kini	(M.S.) –  M.	Koop	(Ph.D.) –  R.	Kumar	(M.S.) –  S.	Krishnamoorthy	(M.S.) –  K.	Kandalla	(Ph.D.) –  P.	Lai	(M.S.) –  J.	Liu	(Ph.D.) –  M.	Luo	(Ph.D.) –  A.	Mamidala	(Ph.D.) –  G.	Marsh	(M.S.) –  V.	Meshram	(M.S.) –  A.	Moody	(M.S.) –  S.	Naravula	(Ph.D.) –  R.	Noronha	(Ph.D.) –  X.	Ouyang	(Ph.D.) –  S.	Pai	(M.S.) –  S.	Potluri	(Ph.D.) –  R.	Rajachandrasekar	(Ph.D.) –  K.	Kulkarni	(M.S.) –  M.	Rahman	(Ph.D.) –  D.	Shankar	(Ph.D.) –  A.	Venkatesh	(Ph.D.) –  J.	Zhang	(Ph.D.) –  E.	Mancini –  S.	Marcarelli –  J.	Vienne Current	Research	Scien:sts	Current	Senior	Research	Associate –  H.	Subramoni –  X.	Lu Past	Programmers –  D.	Bureddy	-	K.	Hamidouche Current	Research	Specialist –  M.	Arnold
HPCAC-Switzerland	(Mar	‘16) 66	Network	Based	Compu4ng	Laboratory Second	Interna4onal	Workshop	on High-Performance	Big	Data	Compu4ng	(HPBDC) HPBDC	2016	will	be	held	with	IEEE	Interna4onal	Parallel	and	Distributed	Processing Symposium	(IPDPS	2016),	Chicago,	Illinois	USA,	May	27th,	2016 Keynote	Talk:	Dr.	Chaitanya	Baru, Senior	Advisor	for	Data	Science,	Na4onal	Science	Founda4on	(NSF); Dis4nguished	Scien4st,	San	Diego	Supercomputer	Center	(SDSC) Panel	Moderator:	Jianfeng	Zhan	(ICT/CAS) Panel	Topic:	Merge	or	Split:	Mutual	Influence	between	Big	Data	and	HPC	Techniques Six	Regular	Research	Papers	and	Two	Short	Research	Papers hWp://web.cse.ohio-state.edu/~luxi/hpbdc2016 HPBDC	2015	was	held	in	conjunc4on	with	ICDCS’15 hWp://web.cse.ohio-state.edu/~luxi/hpbdc2015
HPCAC-Switzerland	(Mar	‘16) 67	Network	Based	Compu4ng	Laboratory panda@cse.ohio-state.edu Thank	You! The	High-Performance	Big	Data	Project h<p://hibd.cse.ohio-state.edu/ Network-Based	CompuFng	Laboratory h<p://nowlab.cse.ohio-state.edu/ The	MVAPICH2	Project h<p://mvapich.cse.ohio-state.edu/

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing