Tachyon: An Open Source Memory-Centric Distributed Storage System

Haoyuan Li, Tachyon Nexus  haoyuan@tachyonnexus.com  September 30, 2015 @ Strata and Hadoop World NYC 2015 An Open Source Memory-Centric Distributed Storage System

Outline •  Open Source •  Introduction to Tachyon •  New Features •  Getting Involved 2

History •  Started at UC Berkeley AMPLab –  From summer 2012 –  Same lab produced Apache Spark and Apache Mesos •  Open sourced –  April 2013 –  Apache License 2.0 –  Latest Release: Version 0.7.1 (August 2015) •  Deployed at > 100 companies 4

Contributors Growth 5 v0.4! Feb ‘14 v0.3! Oct ‘13 v0.2 Apr ‘13 v0.1 Dec ‘12 v0.6! Mar ‘15 v0.5! Jul ‘14 v0.7! Jul ‘15 1 3 15 30 46 70 111

Contributors Growth 6 > 150 Contributors (3x increment over the last Strata NYC) > 50 Organizations

Contributors Growth 7 One of the Fastest Growing Big Data Open Source Project

Thanks to Contributors and Users! 8

One Tachyon Production  Deployment Example •  Baidu (Dominant Search Engine in China, ~ 50 Billion USD Market Cap) •  Framework: SparkSQL •  Under Storage: Baidu’s File System •  Storage Media: MEM + HDD •  100+ nodes deployment •  1PB+ managed space •  30x Performance Improvement 9

Tachyon is an Open Source  Memory-centric  Distributed Storage System 11

Performance Trend:   Memory is Fast •  RAM throughput   increasing exponentially •  Disk throughput increasing slowly 13 Memory-locality key to interactive response times

Price Trend: Memory is Cheaper source: jcmit.com 14

17 Missing a Solution for the Storage Layer

A Use Case Example with - •  Fast, in-memory data processing framework – Keep one in-memory copy inside JVM – Track lineage of operations used to derive data – Upon failure, use lineage to recompute data map ﬁlter map join reduce Lineage Tracking 18

Issue 1 19 Data Sharing is the bottleneck in analytics pipeline:  Slow writes to disk Spark Job1 Spark mem block manager block 1 block 3 Spark Job2 Spark mem block manager block 3 block 1 HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process (slow writes)

Issue 1 20 Spark Job Spark mem block manager block 1 block 3 Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 Data Sharing is the bottleneck in analytics pipeline:  Slow writes to disk storage engine & execution engine same process (slow writes)

Issue 1 resolved with Tachyon 21 Memory-speed data sharing  among jobs in diﬀerent frameworks execution engine &   storage engine same process (fast writes) Spark Job Spark mem Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Tachyon! in-memory block 1 block 3 block 4

Issue 2 22 Spark Task Spark memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine &   storage engine same process Cache loss when process crashes

Issue 2 23 crash Spark memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine &   storage engine same process Cache loss when process crashes

HDFS / Amazon S3 Issue 2 24 block 1 block 3 block 2 block 4 execution engine &   storage engine same process crash Cache loss when process crashes

HDFS / Amazon S3 block 1 block 3 block 2 block 4 Tachyon! in-memory block 1 block 3 block 4 Issue 2 resolved with Tachyon 25 Spark Task Spark memory block manager execution engine &   storage engine same process Keep in-memory data safe,  even when a job crashes.

Issue 2 resolved with Tachyon 26 HDFS disk block 1 block 3 block 2 block 4 execution engine &   storage engine same process Tachyon! in-memory block 1 block 3 block 4 crash HDFS / Amazon S3 block 1 block 3 block 2 block 4 Keep in-memory data safe,  even when a job crashes.

HDFS / Amazon S3 Issue 3 27 In-memory Data Duplication & Java Garbage Collection Spark Job1 Spark mem block manager block 1 block 3 Spark Job2 Spark mem block manager block 3 block 1 block 1 block 3 block 2 block 4 execution engine &   storage engine same process (duplication & GC)

Issue 3 resolved with Tachyon 28 No in-memory data duplication,  much less GC Spark Job1 Spark mem Spark Job2 Spark mem HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine &   storage engine same process (no duplication & GC) HDFS disk block 1 block 3 block 2 block 4 Tachyon! in-memory block 1 block 3 block 4

Previously Mentioned •  A memory-centric storage architecture •  Push lineage down to storage layer 29

Tachyon Memory-Centric Architecture 30

Tachyon Memory-Centric Architecture 31

1) Eco-system: Enable new workload in any storage; Work with the framework of your choice; 34

2) Tachyon running in production environment, both in the Cloud and on Premise. 35

Use Case: Baidu •  Framework: SparkSQL •  Under Storage: Baidu’s File System •  Storage Media: MEM + HDD •  100+ nodes deployment •  1PB+ managed space •  30x Performance Improvement 36

Use Case: a SAAS Company •  Framework: Impala •  Under Storage: S3 •  Storage Media: MEM + SSD •  15x Performance Improvement 37

Use Case: an Oil Company •  Framework: Spark •  Under Storage: GlusterFS •  Storage Media: MEM only •  Analyzing data in traditional storage 38

Use Case: a SAAS Company •  Framework: Spark •  Under Storage: S3 •  Storage Media: SSD only •  Elastic Tachyon deployment 39

40 What if   data size exceeds   memory capacity?

41 3) Tiered Storage:  Tachyon Manages More Than DRAM MEM SSD HDD Faster Higher   Capacity

42 Conﬁgurable Storage Tiers MEM only MEM + HHD SSD only

43 4) Pluggable Data Management Policy Evict stale data to lower tier Promote hot data to upper tier

More Features •  7) Remote Write Support •  8) Easy deployment with Mesos and Yarn •  9) Initial Security Support •  10) One Command Cluster Deployment •  11) Metrics Reporting for Clients, Workers, and Master 47

12) More Under Storage Supports 48

Memory-Centric Distributed Storage Welcome to try, contact, and collaborate! 51 JIRA New Contributor Tasks

•  Team consists of Tachyon creators, top contributors •  Series A ($7.5 million) from Andreessen Horowitz  •  Committed to Tachyon Open Source  52

Strata NYC 2015 •  Welcome to visit us at our booth #P18. •  Check out other Tachyon related talks. –  First-ever scalable, distributed deep learning architecture using Spark and Tachyon •  Christopher Nguyen (Adatao, Inc.), Vu Pham (Adatao, Inc) •  2:05pm–2:45pm Thursday, 10/01/2015 –  Faster time to insight using Spark, Tachyon, and Zeppelin •  Nirmal Ranganathan (Rackspace Hosting) •  2:05pm–2:45pm Thursday, 10/01/2015 54

•  Try Tachyon: http://tachyon-project.org  •  Develop Tachyon: https://github.com/amplab/tachyon  •  Meet Friends: http://www.meetup.com/Tachyon  •  Get News: http://goo.gl/mwB2sX •  Tachyon Nexus: http://www.tachyonnexus.com •  Contact us: haoyuan@tachyonnexus.com 55

Tachyon: An Open Source Memory-Centric Distributed Storage System

More Related Content

What's hot

Viewers also liked

Similar to Tachyon: An Open Source Memory-Centric Distributed Storage System

Recently uploaded

Tachyon: An Open Source Memory-Centric Distributed Storage System