In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

In-Memory Computing, Storage & Analysis Apache Apex + Apache Geode Sandeep Deshmukh Ashish Tadose

Project Status Mentor List Ted Dunning: Apache Member, MapR Alan Gates: Apache Member, Hortonworks Taylor Goetz: Apache Member, Hortonworks Justin Mclean: Apache Member, Class Software Chris Nauroth: Apache Member, Hortonworks Hitesh Shah: Apache Member, Hortonworks Apex In Apache Incubation Stage

Apache Apex (Incubating) Committer List Open-sourced in July 2015 Over 50 committers already… And growing….

Apex Platform Overview Enterprise Edition

Directed Acyclic Graph (DAG) Application Programming Model • A Stream is a sequence of data tuples • An Operator takes one or more input streams,performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance in single-threaded • DirectedAcyclic Graph (DAG) is made up of operators and streams Output StreamTuple Tuple er Operator er Operator er Operator er Operator Application Programming Model

Hadoop Edge Node DT RTS Management Server Hadoop Node YARN Container Apex App Master Hadoop Node YARN Container YARN Container YARN Container Thread1 Op2 Op1 Thread-N Op3 Streaming Container Hadoop Node YARN Container YARN Container YARN Container Thread1 Op2 Op1 Thread-N Op3 Streaming Container CLI REST API DT RTS Management Server REST API Part of Community Edition Apex Component Overview

• Native Hadoop Integration • Partitioning and Scaling out • Advanced Windowing Support • Stateful Fault-tolerance • Processing Semantics • Compute Locality • Dynamic updates Apex Features …

• Processing data in-motion • Preventing data-loss – buffer server • In memory data stores for querying data IMC Components in Apex

Typical latencies Why In-Memory Computing?

Why In-Memory Computing? In-memory computing will have long term, disruptive impact by radically changing users expectations, application design principles, product's architectures and vendor's strategies RAM is the new disk, disk the new tape RAM is the new disk, disk the new tape In-memory computing is the future of computing.. it offers massive not only in TCO reduction but across all four value dimensions: performance, process, process innovation, simplification and flexibility.

What are IMDG? • IMDGs host data in memory and distribute it across a cluster of commodity servers • The main access pattern is key/value access, MapReduce, various forms of HPC-like processing, and a limited distributed querying and indexing capabilities. Why they are important? • Performance – using RAM is faster than using disk. • Extremely High availability of data - by keeping it in memory and in highly distributed cluster. • Data Structure – using a key/value store allows greater flexibility for the application developer. object store similar in interface to a typical concurrent hash map. • Scalable Data Partitioning • Transactional ACID support In Memory Data Grid - IMDG

High Level Architecture - Geode

Geode Features Core Features • Linear scalability & latency miniming data distribution • Performance optimized persistence - High availability & durability • Configurable consistency - region types { partitioned, replicated & local } • Distributed transactions • Cluster resilience & failover Advanced Features • Server Function Execution - Send computation to data • Asynchronous Events - Deliver events to a receiver without impacting the write path • Continues Queries & Client subscriptions - Useful for refreshing client cache

Ÿ Caching for speed and scale – Read-through, Write-through, Write-behind Ÿ Geode as the OLTP system of record – Data in-memory for low latency, on disk for durability Ÿ Parallel compute engine Ÿ Real-time analytics Application Patterns

Geode reads With Consistent Latency and CPU • Scaled from 256 clients and 2 servers to 1280 clients and 10 servers • Partitioned region with redundancy and 1K data size 0 2 4 6 8 10 12 14 16 18 0 1 2 3 4 5 6 2 4 6 8 10 Speedup Server Hosts speedup latency (ms) CPU % Geode Features

Geode 3.5-4.5X Faster Than Cassandra for YCSB

Roadmap Ÿ HDFS persistence Ÿ Off-heap storage Ÿ Lucene indexes Ÿ Spark integration Ÿ Cloud Foundry service …and other ideas from the Geode community! Roadmap

Streaming meets In Memory Data Grid

Apex + Geode Apex Operator check-pointing in Geode store • Better latency for checkpoint operations than HDFS check-pointing • Makes Apex DAG a complete in-memory pipeline • https://issues.apache.org/jira/browse/APEXCORE-283 Write Apex data streams to Geode store • Apex output operator implementation which writes data to Geode region • Use cases • Ingest streaming data in Geode for further processing • Store Data processed by Apex pipeline in Geode store to serve user queries • https://malhar.atlassian.net/projects/MLHR/issues/MLHR-1942

In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

More Related Content

What's hot

Similar to In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

More from imcpune

Recently uploaded

In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode