In-Memory	Computing,	Storage	&	Analysis Apache	Apex	+	Apache	Geode Sandeep	Deshmukh Ashish	Tadose
Project	Status Mentor List Ted Dunning: Apache Member, MapR Alan Gates: Apache Member, Hortonworks Taylor Goetz: Apache Member, Hortonworks Justin Mclean: Apache Member, Class Software Chris Nauroth: Apache Member, Hortonworks Hitesh Shah: Apache Member, Hortonworks Apex	In	Apache	Incubation	Stage
Apache	Apex	(Incubating)	Committer	List Open-sourced	in	July	2015 Over	50 committers	already… And	growing….
Apex	Platform	Overview Enterprise Edition
Directed Acyclic Graph (DAG) Application	Programming	Model • A Stream is a sequence of data tuples • An Operator takes one or more input streams,performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance in single-threaded • DirectedAcyclic Graph (DAG) is made up of operators and streams Output StreamTuple Tuple er Operator er Operator er Operator er Operator Application	Programming	Model
Hadoop Edge	Node DT	RTS Management Server Hadoop	Node YARN	Container Apex	App	Master Hadoop	Node YARN	Container YARN	Container YARN	Container Thread1 Op2 Op1 Thread-N Op3 Streaming Container Hadoop	Node YARN	Container YARN	Container YARN	Container Thread1 Op2 Op1 Thread-N Op3 Streaming Container CLI REST API DT	RTS Management Server REST API Part	of	Community	Edition Apex	Component	Overview
• Native	Hadoop	Integration • Partitioning	and	Scaling	out • Advanced	Windowing	Support • Stateful	Fault-tolerance • Processing	Semantics • Compute	Locality • Dynamic	updates Apex	Features	…
Apache	Apex-Malhar
• Processing	data	in-motion • Preventing	data-loss	– buffer	server • In	memory	data	stores	for	querying	data IMC	Components	in	Apex
Typical latencies Why	In-Memory	Computing?
Why	In-Memory	Computing? In-memory	computing	will	have	long	term,	disruptive	impact	by radically	changing	users	expectations,	application	design	principles, product's	architectures	and	vendor's	strategies	RAM	is	the	new	disk, disk	the	new	tape RAM	is	the	new	disk, disk	the	new	tape In-memory	computing	is	the	future	of	computing..	it	offers	massive not	only	in	TCO	reduction	but	across	all	four	value	dimensions: performance,	process,	process	innovation,	simplification	and flexibility.
What	are	IMDG? • IMDGs	host	data	in	memory	and	distribute	it	across	a cluster	of	commodity	servers • The	main	access	pattern	is	key/value	access,	MapReduce,	various	forms	of	HPC-like	processing, and	a	limited	distributed	querying	and	indexing	capabilities. Why	they	are	important? • Performance	– using	RAM	is	faster	than	using	disk. • Extremely High	availability	of	data	- by	keeping	it	in	memory	and	in	highly	distributed	cluster. • Data	Structure	– using	a	key/value	store	allows	greater	flexibility	for	the	application	developer. object	store	similar	in	interface	to	a	typical	concurrent	hash	map. • Scalable	Data	Partitioning • Transactional	ACID	support In	Memory	Data	Grid	- IMDG
High	Level	Architecture	- Geode
Geode	Features Core	Features • Linear	scalability	&	latency	miniming data	distribution • Performance	optimized	persistence	- High	availability	&	durability • Configurable	consistency	- region	types	{	partitioned, replicated	&	local	} • Distributed	transactions • Cluster	resilience	&	failover Advanced	Features • Server	Function	Execution	- Send	computation	to	data • Asynchronous Events	- Deliver	events	to	a	receiver	without	impacting	the write	path • Continues	Queries	&	Client	subscriptions	- Useful	for	refreshing client cache
Geode	Features Core	Features • Linear	scalability	&	latency	miniming data	distribution • Performance	optimized	persistence	- High	availability	&	durability • Configurable	consistency	- region	types	{	partitioned, replicated	&	local	} • Distributed	transactions • Cluster	resilience	&	failover Advanced	Features • Server	Function	Execution	- Send	computation	to	data • Asynchronous Events	- Deliver	events	to	a	receiver	without	impacting	the write	path • Continues	Queries	&	Client	subscriptions	- Useful	for	refreshing client cache
Ÿ Caching for speed and scale – Read-through, Write-through, Write-behind Ÿ Geode as the OLTP system of record – Data in-memory for low latency, on disk for durability Ÿ Parallel compute engine Ÿ Real-time analytics Application	Patterns
Geode	reads With	Consistent Latency	and	CPU • Scaled	from	256	clients	and	2	servers	to	1280	clients	and	10	servers • Partitioned	region	with	redundancy	and	1K	data	size 0 2 4 6 8 10 12 14 16 18 0 1 2 3 4 5 6 2 4 6 8 10 Speedup Server	Hosts speedup latency	(ms) CPU	% Geode	Features
Geode	3.5-4.5X	Faster	Than	Cassandra	for	YCSB
Roadmap Ÿ HDFS persistence Ÿ Off-heap storage Ÿ Lucene indexes Ÿ Spark integration Ÿ Cloud Foundry service …and other ideas from the Geode community! Roadmap
Streaming	meets	In	Memory	Data	Grid
Apex	+	Geode Apex	Operator	check-pointing	in	Geode	store • Better	latency	for	checkpoint	operations	than	HDFS	check-pointing • Makes	Apex	DAG	a	complete	in-memory	pipeline • https://issues.apache.org/jira/browse/APEXCORE-283 Write	Apex	data	streams	to	Geode	store • Apex	output operator	implementation	which	writes	data	to	Geode	region • Use	cases • Ingest	streaming	data	in	Geode	for	further	processing • Store	Data	processed	by	Apex	pipeline	in	Geode	store	to	serve	user	queries • https://malhar.atlassian.net/projects/MLHR/issues/MLHR-1942
Questions	??? Thank	You	…

In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode