Haoyuan Li, Tachyon Nexus
 haoyuan@tachyonnexus.com
 September 30, 2015 @ Strata and Hadoop World NYC 2015 An Open Source Memory-Centric Distributed Storage System
Outline •  Open Source •  Introduction to Tachyon •  New Features •  Getting Involved 2
Outline •  Open Source •  Introduction to Tachyon •  New Features •  Getting Involved 3
History •  Started at UC Berkeley AMPLab –  From summer 2012 –  Same lab produced Apache Spark and Apache Mesos •  Open sourced –  April 2013 –  Apache License 2.0 –  Latest Release: Version 0.7.1 (August 2015) •  Deployed at > 100 companies 4
Contributors Growth 5 v0.4! Feb ‘14 v0.3! Oct ‘13 v0.2 Apr ‘13 v0.1 Dec ‘12 v0.6! Mar ‘15 v0.5! Jul ‘14 v0.7! Jul ‘15 1 3 15 30 46 70 111
Contributors Growth 6 > 150 Contributors (3x increment over the last Strata NYC) > 50 Organizations
Contributors Growth 7 One of the Fastest Growing Big Data Open Source Project
Thanks to Contributors and Users! 8
One Tachyon Production
 Deployment Example •  Baidu (Dominant Search Engine in China, ~ 50 Billion USD Market Cap) •  Framework: SparkSQL •  Under Storage: Baidu’s File System •  Storage Media: MEM + HDD •  100+ nodes deployment •  1PB+ managed space •  30x Performance Improvement 9
Outline •  Open Source •  Introduction to Tachyon •  New Features •  Getting Involved 10
Tachyon is an Open Source
 Memory-centric
 Distributed Storage System 11
12 Why Tachyon?
Performance Trend: 
 Memory is Fast •  RAM throughput 
 increasing exponentially •  Disk throughput increasing slowly 13 Memory-locality key to interactive response times
Price Trend: Memory is Cheaper source:  jcmit.com   14
Realized by many… 15
16 Is the Problem Solved?
17 Missing a Solution for the Storage Layer
A Use Case Example with - •  Fast, in-memory data processing framework – Keep one in-memory copy inside JVM – Track lineage of operations used to derive data – Upon failure, use lineage to recompute data map filter map join reduce Lineage Tracking 18
Issue 1 19 Data Sharing is the bottleneck in analytics pipeline:
 Slow writes to disk Spark Job1 Spark mem block manager block 1 block 3 Spark Job2 Spark mem block manager block 3 block 1 HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process (slow writes)
Issue 1 20 Spark Job Spark mem block manager block 1 block 3 Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 Data Sharing is the bottleneck in analytics pipeline:
 Slow writes to disk storage engine & execution engine same process (slow writes)
Issue 1 resolved with Tachyon 21 Memory-speed data sharing
 among jobs in different frameworks execution engine & 
 storage engine same process (fast writes) Spark Job Spark mem Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS   disk   block  1   block  3   block  2   block  4   Tachyon! in-memory block 1 block 3 block 4
Issue 2 22 Spark Task Spark memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process Cache loss when process crashes
Issue 2 23 crash Spark memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process Cache loss when process crashes
HDFS / Amazon S3 Issue 2 24 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process crash Cache loss when process crashes
HDFS / Amazon S3 block 1 block 3 block 2 block 4 Tachyon! in-memory block 1 block 3 block 4 Issue 2 resolved with Tachyon 25 Spark Task Spark memory block manager execution engine & 
 storage engine same process Keep in-memory data safe,
 even when a job crashes.
Issue 2 resolved with Tachyon 26 HDFS   disk   block  1   block  3   block  2   block  4   execution engine & 
 storage engine same process Tachyon! in-memory block 1 block 3 block 4 crash HDFS / Amazon S3 block 1 block 3 block 2 block 4 Keep in-memory data safe,
 even when a job crashes.
HDFS / Amazon S3 Issue 3 27 In-memory Data Duplication & Java Garbage Collection Spark Job1 Spark mem block manager block 1 block 3 Spark Job2 Spark mem block manager block 3 block 1 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process (duplication & GC)
Issue 3 resolved with Tachyon 28 No in-memory data duplication,
 much less GC Spark Job1 Spark mem Spark Job2 Spark mem HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process (no duplication & GC) HDFS   disk   block  1   block  3   block  2   block  4   Tachyon! in-memory block 1 block 3 block 4
Previously Mentioned •  A memory-centric storage architecture •  Push lineage down to storage layer 29
Tachyon Memory-Centric Architecture 30
Tachyon Memory-Centric Architecture 31
Lineage in Tachyon 32
Outline •  Open Source •  Introduction to Tachyon •  New Features •  Getting Involved 33
1) Eco-system: Enable new workload in any storage; Work with the framework of your choice; 34
2) Tachyon running in production environment, both in the Cloud and on Premise. 35
Use Case: Baidu •  Framework: SparkSQL •  Under Storage: Baidu’s File System •  Storage Media: MEM + HDD •  100+ nodes deployment •  1PB+ managed space •  30x Performance Improvement 36
Use Case: a SAAS Company •  Framework: Impala •  Under Storage: S3 •  Storage Media: MEM + SSD •  15x Performance Improvement 37
Use Case: an Oil Company •  Framework: Spark •  Under Storage: GlusterFS •  Storage Media: MEM only •  Analyzing data in traditional storage 38
Use Case: a SAAS Company •  Framework: Spark •  Under Storage: S3 •  Storage Media: SSD only •  Elastic Tachyon deployment 39
40 What if 
 data size exceeds 
 memory capacity?
41 3) Tiered Storage:
 Tachyon Manages More Than DRAM MEM SSD HDD Faster Higher 
 Capacity
42 Configurable Storage Tiers MEM only MEM + HHD SSD only
43 4) Pluggable Data Management Policy Evict stale data to lower tier Promote hot data to upper tier
44 Pin Data in Memory
5) Transparent Naming 45
6) Unified Namespace 46
More Features •  7) Remote Write Support •  8) Easy deployment with Mesos and Yarn •  9) Initial Security Support •  10) One Command Cluster Deployment •  11) Metrics Reporting for Clients, Workers, and Master 47
12) More Under Storage Supports 48
Reported Tachyon Usage 49
Outline •  Open Source •  Introduction to Tachyon •  New Features •  Getting Involved 50
Memory-Centric Distributed Storage Welcome to try, contact, and collaborate! 51 JIRA New Contributor Tasks
•  Team consists of Tachyon creators, top contributors •  Series A ($7.5 million) from Andreessen Horowitz
 •  Committed to Tachyon Open Source
 52
53
Strata NYC 2015 •  Welcome to visit us at our booth #P18. •  Check out other Tachyon related talks. –  First-ever scalable, distributed deep learning architecture using Spark and Tachyon •  Christopher Nguyen (Adatao, Inc.), Vu Pham (Adatao, Inc) •  2:05pm–2:45pm Thursday, 10/01/2015 –  Faster time to insight using Spark, Tachyon, and Zeppelin •  Nirmal Ranganathan (Rackspace Hosting) •  2:05pm–2:45pm Thursday, 10/01/2015 54
•  Try Tachyon: http://tachyon-project.org
 •  Develop Tachyon: https://github.com/amplab/tachyon
 •  Meet Friends: http://www.meetup.com/Tachyon
 •  Get News: http://goo.gl/mwB2sX •  Tachyon Nexus: http://www.tachyonnexus.com •  Contact us: haoyuan@tachyonnexus.com 55

Tachyon: An Open Source Memory-Centric Distributed Storage System

  • 1.
    Haoyuan Li, TachyonNexus
 haoyuan@tachyonnexus.com
 September 30, 2015 @ Strata and Hadoop World NYC 2015 An Open Source Memory-Centric Distributed Storage System
  • 2.
    Outline •  Open Source • Introduction to Tachyon •  New Features •  Getting Involved 2
  • 3.
    Outline •  Open Source • Introduction to Tachyon •  New Features •  Getting Involved 3
  • 4.
    History •  Started atUC Berkeley AMPLab –  From summer 2012 –  Same lab produced Apache Spark and Apache Mesos •  Open sourced –  April 2013 –  Apache License 2.0 –  Latest Release: Version 0.7.1 (August 2015) •  Deployed at > 100 companies 4
  • 5.
    Contributors Growth 5 v0.4! Feb ‘14 v0.3! Oct‘13 v0.2 Apr ‘13 v0.1 Dec ‘12 v0.6! Mar ‘15 v0.5! Jul ‘14 v0.7! Jul ‘15 1 3 15 30 46 70 111
  • 6.
    Contributors Growth 6 > 150Contributors (3x increment over the last Strata NYC) > 50 Organizations
  • 7.
    Contributors Growth 7 One ofthe Fastest Growing Big Data Open Source Project
  • 8.
  • 9.
    One Tachyon Production
 DeploymentExample •  Baidu (Dominant Search Engine in China, ~ 50 Billion USD Market Cap) •  Framework: SparkSQL •  Under Storage: Baidu’s File System •  Storage Media: MEM + HDD •  100+ nodes deployment •  1PB+ managed space •  30x Performance Improvement 9
  • 10.
    Outline •  Open Source • Introduction to Tachyon •  New Features •  Getting Involved 10
  • 11.
    Tachyon is an OpenSource
 Memory-centric
 Distributed Storage System 11
  • 12.
  • 13.
    Performance Trend: 
 Memoryis Fast •  RAM throughput 
 increasing exponentially •  Disk throughput increasing slowly 13 Memory-locality key to interactive response times
  • 14.
    Price Trend: Memoryis Cheaper source:  jcmit.com   14
  • 15.
  • 16.
  • 17.
    17 Missing a Solution forthe Storage Layer
  • 18.
    A Use CaseExample with - •  Fast, in-memory data processing framework – Keep one in-memory copy inside JVM – Track lineage of operations used to derive data – Upon failure, use lineage to recompute data map filter map join reduce Lineage Tracking 18
  • 19.
    Issue 1 19 Data Sharingis the bottleneck in analytics pipeline:
 Slow writes to disk Spark Job1 Spark mem block manager block 1 block 3 Spark Job2 Spark mem block manager block 3 block 1 HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process (slow writes)
  • 20.
    Issue 1 20 Spark Job Sparkmem block manager block 1 block 3 Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 Data Sharing is the bottleneck in analytics pipeline:
 Slow writes to disk storage engine & execution engine same process (slow writes)
  • 21.
    Issue 1 resolvedwith Tachyon 21 Memory-speed data sharing
 among jobs in different frameworks execution engine & 
 storage engine same process (fast writes) Spark Job Spark mem Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS   disk   block  1   block  3   block  2   block  4   Tachyon! in-memory block 1 block 3 block 4
  • 22.
    Issue 2 22 Spark Task Sparkmemory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process Cache loss when process crashes
  • 23.
    Issue 2 23 crash Spark memory blockmanager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process Cache loss when process crashes
  • 24.
    HDFS / AmazonS3 Issue 2 24 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process crash Cache loss when process crashes
  • 25.
    HDFS / AmazonS3 block 1 block 3 block 2 block 4 Tachyon! in-memory block 1 block 3 block 4 Issue 2 resolved with Tachyon 25 Spark Task Spark memory block manager execution engine & 
 storage engine same process Keep in-memory data safe,
 even when a job crashes.
  • 26.
    Issue 2 resolvedwith Tachyon 26 HDFS   disk   block  1   block  3   block  2   block  4   execution engine & 
 storage engine same process Tachyon! in-memory block 1 block 3 block 4 crash HDFS / Amazon S3 block 1 block 3 block 2 block 4 Keep in-memory data safe,
 even when a job crashes.
  • 27.
    HDFS / AmazonS3 Issue 3 27 In-memory Data Duplication & Java Garbage Collection Spark Job1 Spark mem block manager block 1 block 3 Spark Job2 Spark mem block manager block 3 block 1 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process (duplication & GC)
  • 28.
    Issue 3 resolvedwith Tachyon 28 No in-memory data duplication,
 much less GC Spark Job1 Spark mem Spark Job2 Spark mem HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & 
 storage engine same process (no duplication & GC) HDFS   disk   block  1   block  3   block  2   block  4   Tachyon! in-memory block 1 block 3 block 4
  • 29.
    Previously Mentioned •  Amemory-centric storage architecture •  Push lineage down to storage layer 29
  • 30.
  • 31.
  • 32.
  • 33.
    Outline •  Open Source • Introduction to Tachyon •  New Features •  Getting Involved 33
  • 34.
    1) Eco-system: Enable newworkload in any storage; Work with the framework of your choice; 34
  • 35.
    2) Tachyon runningin production environment, both in the Cloud and on Premise. 35
  • 36.
    Use Case: Baidu • Framework: SparkSQL •  Under Storage: Baidu’s File System •  Storage Media: MEM + HDD •  100+ nodes deployment •  1PB+ managed space •  30x Performance Improvement 36
  • 37.
    Use Case: aSAAS Company •  Framework: Impala •  Under Storage: S3 •  Storage Media: MEM + SSD •  15x Performance Improvement 37
  • 38.
    Use Case: anOil Company •  Framework: Spark •  Under Storage: GlusterFS •  Storage Media: MEM only •  Analyzing data in traditional storage 38
  • 39.
    Use Case: aSAAS Company •  Framework: Spark •  Under Storage: S3 •  Storage Media: SSD only •  Elastic Tachyon deployment 39
  • 40.
    40 What if 
 datasize exceeds 
 memory capacity?
  • 41.
    41 3) Tiered Storage:
 TachyonManages More Than DRAM MEM SSD HDD Faster Higher 
 Capacity
  • 42.
    42 Configurable Storage Tiers MEMonly MEM + HHD SSD only
  • 43.
    43 4) Pluggable DataManagement Policy Evict stale data to lower tier Promote hot data to upper tier
  • 44.
  • 45.
  • 46.
  • 47.
    More Features •  7)Remote Write Support •  8) Easy deployment with Mesos and Yarn •  9) Initial Security Support •  10) One Command Cluster Deployment •  11) Metrics Reporting for Clients, Workers, and Master 47
  • 48.
    12) More UnderStorage Supports 48
  • 49.
  • 50.
    Outline •  Open Source • Introduction to Tachyon •  New Features •  Getting Involved 50
  • 51.
    Memory-Centric Distributed Storage Welcometo try, contact, and collaborate! 51 JIRA New Contributor Tasks
  • 52.
    •  Team consistsof Tachyon creators, top contributors •  Series A ($7.5 million) from Andreessen Horowitz
 •  Committed to Tachyon Open Source
 52
  • 53.
  • 54.
    Strata NYC 2015 • Welcome to visit us at our booth #P18. •  Check out other Tachyon related talks. –  First-ever scalable, distributed deep learning architecture using Spark and Tachyon •  Christopher Nguyen (Adatao, Inc.), Vu Pham (Adatao, Inc) •  2:05pm–2:45pm Thursday, 10/01/2015 –  Faster time to insight using Spark, Tachyon, and Zeppelin •  Nirmal Ranganathan (Rackspace Hosting) •  2:05pm–2:45pm Thursday, 10/01/2015 54
  • 55.
    •  Try Tachyon:http://tachyon-project.org
 •  Develop Tachyon: https://github.com/amplab/tachyon
 •  Meet Friends: http://www.meetup.com/Tachyon
 •  Get News: http://goo.gl/mwB2sX •  Tachyon Nexus: http://www.tachyonnexus.com •  Contact us: haoyuan@tachyonnexus.com 55