PRESENTATION TITLE GOES HERE Solving Big Data Problems: Storage to the Rescue? John Webster Evaluator Group
2 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Agenda Big Data Analytics Storage Maxims The Fundamental JBOD and DAS Architecture Overview of Disk-based Alternatives What are the Advantages and Disadvantages? The Solid State and In-memory Alternatives Summary and Q&A Note: References to specific vendors and products are used as real-world examples and do not imply an endorsement 04/10/15 2
3 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Big Data Storage Maxim #1 Deliver storage performance at large scale and at low cost, and all at the same time (Think early stage Google, Facebook, Twitter) 04/10/15 3
4 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Big Data Storage Maxim #2 Minimize the “distance” between processing and data storage 04/10/15 4
5 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Big Data Storage Maxim #3 Big Data analytics is dominated by open source 04/10/15 5
6 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Big Data Storage Maxim #4 Big Data analytics software developers manage data at the clustered server level. Storage vendors manage data at the storage system level. 04/10/15 6
7 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Shared Nothing, Asymmetrical Distributed Computing N O D E 1 N O D E 2 N O D E 3 N O D E n DAS DAS DAS DAS C O N T R O L DAS Network Layer 1 Gb Ethernet Compute Layer Commodity Servers Storage Layer 6-12 disks in each server typically JBOD Scale to thousands of nodes Only the Ethernet network is shared In Hadoop, Control = Name Node; Node 1,2… = Data Node
8 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Apache Hadoop: A Platform for All Applications? Presentation & Application Enable both existing and new applications to provide value to the organization Operations Empower existing operations and security tools to manage Hadoop Metadata Management HCatalog Batch Online Real- Time In- Memory OthersSQLScript Map Reduce Pig Hive Hbase Accumulo Storm Spark Multitenant Processing: YARN (Hadoop Operating System) Storage: HDFS (Hadoop Distributed File System) Data Access Data Management Data Integration & Governance Data Workflow Data Lifecycle Falcon Real-time and Batch Ingest Flume Sqoop WebHDFS NFS Authentication Authorization Accountability Data Protection Across Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Provision, Manage & Monitor Ambari Scheduling Oozie Linux WindowsEnvironmen t On Premise Virtualize Commodity HWAppliance Cloud/ Hosted Security Operations Source: Hortonworks
9 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. HDFS as a Persistent Storage Layer Advantages Storage performance at large scale and low cost Minimize distance between data and compute Node failures tolerated Open Source Disadvantages Hadoop NameNode lacks active/active failover (i.e. it’s a SPOF) For data integrity and protection, HDFS creates three full clone copies of data 3x the storage for each file – slow and inefficient If all three copies are corrupted, you’re still hosed (reload and start over) No storage tiering (recognition of different storage types now available in 2.3) Limited ways to respond to corporate security and data governance policies Data in/out processes can take longer than the actual query process What is the single source of the truth? Inability to dis-aggregate storage from compute so that the two can be scaled independently
10 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. N O D E 1 N O D E 2 N O D E 3 N O D E n C O N T R O L Network Layer Compute Layer Storage Layer SAN or NAS, but more commonly Scale-out NAS Shared Storage as Primary Storage 04/10/15 10
11 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. N O D E 1 N O D E 2 N O D E 3 N O D E n C O N T R O L Network Layer Compute Layer Storage Layer Shared Storage as Secondary Storage 04/10/15 11 SAN/NAS/Object Storage
12 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Hadoop On Scale-out Storage Scale-out storage replaces node-level DAS HDFS implemented as “over the wire” protocol or CDMI interface to underlying FS NameNode SPOF eliminated Decoupled storage and compute layers Data services, data protection, and DR by storage-resident services Examples include EMC Isilon, IBM Elastic Storage, Ceph 04/10/15 12
13 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Shared Primary/Secondary Storage Advantages Addresses the enterprise storage management requirements  Data protection/disaster recovery/business continuance  Data governance/compliance/archiving  Single source of the truth Disadvantages Additional cost Potential performance impact Using a vendor specific solution introduces proprietary data/storage management software 04/10/15 13
14 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. What About SSD? N O D E 1 N O D E 2 N O D E 3 N O D E n DAS DAS DAS DAS C O N T R O L DAS Network Layer 10+ Gb Ethernet Compute Layer Commodity Servers Storage Layer SSD in/attached to each server Scale to thousands of nodes Only the Ethernet network is shared In Hadoop, Control = Name Node; Node 1,2… = Data Node
15 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. N O D E 1 N O D E 2 N O D E 3 N O D E n C O N T R O L Network Layer Compute Layer Storage Layer Scale-out Flash Storage What About SSD? 04/10/15 15
16 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. What About In-Memory Computing? Tachyon UC Berkeley Amp Lab project “Reliable, memory-centric storage for Big Data Analytics clusters” (i.e. memory as persistent data store across cluster nodes) One in-memory data copy inside JVM, use operation “lineage” to re-compute data if failure Initial use in Apache Spark environments 04/10/15 16
17 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. What About In-memory Computing? Apache Ignite In-memory “data fabric” Distributed in-memory platform for computing and transacting on large-scale data sets in real-time “Orders of magnitude faster than possible with traditional disk-based or flash technologies.” Tier -1 storage? Originated as GridGain Data Fabric In-Memory Computing Summit 6/29-30 imcsummit.org 04/10/15 17
18 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Summary and Q&A The need for a longer-term, persistent storage layer is now recognized For Hadoop, HDFS may or may not be that storage layer Enterprise storage architects and administrators will be more directly involved in managing Big Data analytics storage over time Now is the time to research and understand the options 04/10/15 18

Solving Big Data Problems

  • 1.
    PRESENTATION TITLE GOESHERE Solving Big Data Problems: Storage to the Rescue? John Webster Evaluator Group
  • 2.
    2 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. Agenda Big Data Analytics Storage Maxims The Fundamental JBOD and DAS Architecture Overview of Disk-based Alternatives What are the Advantages and Disadvantages? The Solid State and In-memory Alternatives Summary and Q&A Note: References to specific vendors and products are used as real-world examples and do not imply an endorsement 04/10/15 2
  • 3.
    3 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. Big Data Storage Maxim #1 Deliver storage performance at large scale and at low cost, and all at the same time (Think early stage Google, Facebook, Twitter) 04/10/15 3
  • 4.
    4 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. Big Data Storage Maxim #2 Minimize the “distance” between processing and data storage 04/10/15 4
  • 5.
    5 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. Big Data Storage Maxim #3 Big Data analytics is dominated by open source 04/10/15 5
  • 6.
    6 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. Big Data Storage Maxim #4 Big Data analytics software developers manage data at the clustered server level. Storage vendors manage data at the storage system level. 04/10/15 6
  • 7.
    7 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. Shared Nothing, Asymmetrical Distributed Computing N O D E 1 N O D E 2 N O D E 3 N O D E n DAS DAS DAS DAS C O N T R O L DAS Network Layer 1 Gb Ethernet Compute Layer Commodity Servers Storage Layer 6-12 disks in each server typically JBOD Scale to thousands of nodes Only the Ethernet network is shared In Hadoop, Control = Name Node; Node 1,2… = Data Node
  • 8.
    8 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. Apache Hadoop: A Platform for All Applications? Presentation & Application Enable both existing and new applications to provide value to the organization Operations Empower existing operations and security tools to manage Hadoop Metadata Management HCatalog Batch Online Real- Time In- Memory OthersSQLScript Map Reduce Pig Hive Hbase Accumulo Storm Spark Multitenant Processing: YARN (Hadoop Operating System) Storage: HDFS (Hadoop Distributed File System) Data Access Data Management Data Integration & Governance Data Workflow Data Lifecycle Falcon Real-time and Batch Ingest Flume Sqoop WebHDFS NFS Authentication Authorization Accountability Data Protection Across Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Provision, Manage & Monitor Ambari Scheduling Oozie Linux WindowsEnvironmen t On Premise Virtualize Commodity HWAppliance Cloud/ Hosted Security Operations Source: Hortonworks
  • 9.
    9 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. HDFS as a Persistent Storage Layer Advantages Storage performance at large scale and low cost Minimize distance between data and compute Node failures tolerated Open Source Disadvantages Hadoop NameNode lacks active/active failover (i.e. it’s a SPOF) For data integrity and protection, HDFS creates three full clone copies of data 3x the storage for each file – slow and inefficient If all three copies are corrupted, you’re still hosed (reload and start over) No storage tiering (recognition of different storage types now available in 2.3) Limited ways to respond to corporate security and data governance policies Data in/out processes can take longer than the actual query process What is the single source of the truth? Inability to dis-aggregate storage from compute so that the two can be scaled independently
  • 10.
    10 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. N O D E 1 N O D E 2 N O D E 3 N O D E n C O N T R O L Network Layer Compute Layer Storage Layer SAN or NAS, but more commonly Scale-out NAS Shared Storage as Primary Storage 04/10/15 10
  • 11.
    11 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. N O D E 1 N O D E 2 N O D E 3 N O D E n C O N T R O L Network Layer Compute Layer Storage Layer Shared Storage as Secondary Storage 04/10/15 11 SAN/NAS/Object Storage
  • 12.
    12 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. Hadoop On Scale-out Storage Scale-out storage replaces node-level DAS HDFS implemented as “over the wire” protocol or CDMI interface to underlying FS NameNode SPOF eliminated Decoupled storage and compute layers Data services, data protection, and DR by storage-resident services Examples include EMC Isilon, IBM Elastic Storage, Ceph 04/10/15 12
  • 13.
    13 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. Shared Primary/Secondary Storage Advantages Addresses the enterprise storage management requirements  Data protection/disaster recovery/business continuance  Data governance/compliance/archiving  Single source of the truth Disadvantages Additional cost Potential performance impact Using a vendor specific solution introduces proprietary data/storage management software 04/10/15 13
  • 14.
    14 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. What About SSD? N O D E 1 N O D E 2 N O D E 3 N O D E n DAS DAS DAS DAS C O N T R O L DAS Network Layer 10+ Gb Ethernet Compute Layer Commodity Servers Storage Layer SSD in/attached to each server Scale to thousands of nodes Only the Ethernet network is shared In Hadoop, Control = Name Node; Node 1,2… = Data Node
  • 15.
    15 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. N O D E 1 N O D E 2 N O D E 3 N O D E n C O N T R O L Network Layer Compute Layer Storage Layer Scale-out Flash Storage What About SSD? 04/10/15 15
  • 16.
    16 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. What About In-Memory Computing? Tachyon UC Berkeley Amp Lab project “Reliable, memory-centric storage for Big Data Analytics clusters” (i.e. memory as persistent data store across cluster nodes) One in-memory data copy inside JVM, use operation “lineage” to re-compute data if failure Initial use in Apache Spark environments 04/10/15 16
  • 17.
    17 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. What About In-memory Computing? Apache Ignite In-memory “data fabric” Distributed in-memory platform for computing and transacting on large-scale data sets in real-time “Orders of magnitude faster than possible with traditional disk-based or flash technologies.” Tier -1 storage? Originated as GridGain Data Fabric In-Memory Computing Summit 6/29-30 imcsummit.org 04/10/15 17
  • 18.
    18 2015 Data StorageInnovation Conference. © Insert Your Company Name. All Rights Reserved. Summary and Q&A The need for a longer-term, persistent storage layer is now recognized For Hadoop, HDFS may or may not be that storage layer Enterprise storage architects and administrators will be more directly involved in managing Big Data analytics storage over time Now is the time to research and understand the options 04/10/15 18