Karthikeyan Nagalingam Nilesh Bagad ANALYZING IOT DATA IN APACHE SPARK ACROSS DATA CENTERS AND CLOUD By
Agenda • IoT Data Management Challenges • NetApp Data Fabric Architecture for Big Data • IoT Customer Use Cases on Data Fabric • Q&A
IoT Data Management Challenges
IoT Data Flow Data is created1 Data is analyzed in realtime2 Data is aggregated and sent to Core3 Data is stored1 EdgeEdge Data Center Edge Geo Distributed Data Lake …. …. Data is analyzed2 Data is protected3 EDGE CORE Cloud
IoT Data Management Challenges AnalyzeStoreCollect Transport Protect • Flexibility and Agility • Cost • Data Protection
NetApp Data Fabric Architecture for Big Data
The NetApp Data Fabric Helping customers unleash data to address their business imperatives HARNESS the power of the hybrid cloud BUILD a next-generation data center MODERNIZE storage through data management PUBLIC CLOUD MULTI- CLOUD NEXT-GEN DATA CENTER ENTERPRISE IT NETAPP DATA FABRIC
Introducing Data Fabric Building Blocks for Analytics NetApp Private Storage AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB Applications ONTAP Express Route SnapMirror® Applications Applications Applications S3 Direct Connect Cold Data StorageGRID FabricPool
In Place Analytics: + Enable big data analytics on NFSv3 data AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB Apache Spark Cluster NetApp FAS NFS Connector Apache Spark Cluster NetApp FAS NFS Connector HDFS NFS AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB NFS Key Benefits • Avoid data move to HDFS. Reduce replicas • Scale compute and storage independently • Enterprise data protection • Hybrid cloud deployment • Hortonworks Certified Confit 1 : NFS as a Storage Confit 2 : HDFS and NFS in Single Spark Cluster
Analytics with Data Fabric NetApp Private Storage AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB Apache Spark Cluster NetApp FAS NFS Connector ONTAP NFS On Premises HDInsight Spark Cluster NetApp FAS NFS Connector Databrics or EMR NetApp FAS NFS Connector Express Route SnapMirror® Direct Connect
IoT Customer Use Cases on Data Fabric
Customer Scenario • IoT data received in AWS and analyzed using Apache Spark cluster in AWS • Data Management Challenge: – How to Backup 10 TB data without load on cluster? – How to protect the data to on-premises? Broadcasting Provider
An Architecture for Processing IoT-Data Ingested in Cloud Backup and DR to On Premises AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB Apache Spark Cluster NetApp FAS NFS Connector ONTAP NFS On Premises EBS Spark Cluster NetApp FAS NFS Connector Kafka Events Via Rest API
Customer Scenario • IoT data is received in AWS and analyzed using Apache Spark Cluster in AWS • Data Management Challenge: – How to reduce the solution cost? – How to consume analytics services in data center and multiple clouds? IT Service Provider
An Architecture for Processing IoT-Data Ingested in Cloud Multi Cloud Connectivity NetApp Private Storage AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB Apache Spark Cluster NetApp FAS NFS Connector ONTAP NFS On Premises Analytics Service NetApp FAS NFS Connector Spark Cluster NetApp FAS NFS Connector Express Route SnapMirror® Direct Connect Kafka Events Via Rest API
Customer Scenario • IoT data received on-premises and analyzed using Cloudera Spark Cluster across data center and cloud • Data Management Challenge: – How to leverage cloud computation for analytics ? – How to consume legacy data (7PB) for analytics? Insurance Company
An Architecture for Processing IoT-Data Ingested on premises DR in Cloud; Analytic across data centers NetApp Private Storage AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB NetApp FAS NFS Connector ONTAP NFS On Premises HDInsight NetApp FAS NFS Connector DataBricks NetApp FAS NFS Connector Express Route SnapMirror® Direct Connect Spark ClusterKafka Events Via Rest API
Customer Scenario • IoT data received on-premises and stored in Hadoop Data Lake. Data needs to be backed up for compliance • Data Management Challenge: – How to reduce the backup window and optimize solution cost? – How to minimize impact on analytics performance during backup? Large Bank
Use Case: Backup for IoT Data HDFS (SAN) Tape HDFS (DAS) Backup Edge OR + Archive Current distcp –update HDFS (snap) NetApp (snap) HDFS (SAN) FlexClone NetBackup (Files) NFS Export NetApp Solution ATypical HDFS (DAS) HDFS (DAS) Backup Edge OR + Archive Proposed distcp -update –diff HDFS (snap) distcp -update –diff (Files) NFS FlexClone NFS Connector NetApp Solution B HDFS (DAS) SnapMirror FlexClone FlexClone FlexClone Run lots of parallel mappers NetApp Backup Solution A • NetApp Snapshot Backup • Backup Archival • Cloud Compatible NetApp Backup Solution B • Hadoop Native Support • Offload Backup Operation • Enterprise Management
Customer Scenario • Large Hadoop Data Lake implementation on premises with Multiple Spark clusters • Data Management Challenge: – How to make data available for dev/test teams? – How to build the new cluster in minutes from an existing cluster? Online Music Distribution
Use Case: Dev/Test for IoT Data AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB Production Apache Spark Cluster NetApp FAS NFS Connector ONTAP NFS On Premises Development Apache Spark Cluster NetApp FAS NFS Connector NFS QA/Test Apache Spark Cluster NetApp FAS NFS Connector NFS FlexClone FlexClone
Customer Scenario • Run analytics on archival data in object store • Data Management Challenge: – How to run Hadoop jobs in object store – Archive the Hadoop data on primary or a secondary site Online Marketplace
Analytics with NetApp StorageGRID In place analytics with StorageGRID Secondary Site AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB Apache Spark Cluster NetApp FAS NFS Connector ONTAP NFS On Premises Direct Connect AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB StorageGRID StorageGRID S3 S3 S3A
• Input dataset size – 1.5TB • ONTAP– 52% better than JBOD Spark Performance Type Hadoop Worker Nodes Drives per Node Number of Storage Arrays JBOD 6 12 NA ONTAP 6 6 1 0 1000 2000 3000 4000 5000 6000 Througput(MBytes/Sec) MB/Sec Spark Scala WordCount - Throughput MB/Sec (Higher is better) JBOD ONTAP 0 50 100 150 200 250 300 350 400 450 500 Duration(s) Seconds Spark Scala WordCount - Duration(s) ( Lower Is better ) JBOD ONTAP HiBench – Spark Scala Word Count
Key Takeaways Flexibility and Agility • On Demand analytics with Hybrid Cloud/Multi Cloud deployments • Rapid provisioning of clusters for test/dev environments Lower Cost • Add storage capacity without adding compute nodes • One copy vs 3 copies of data for HDFS • Data Tiering with FabricPool Enterprise Data Protection • Efficient backup, DR and Archival
Further Resources § Please visit us at: Booth #512 § Visit our Big Data Website: www.netapp.com/bigdata
Q & A
Thank You. Nilesh Bagad: nileshb@netapp.com Karthikeyan Nagalingam: nkarthik@netapp.com

Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp Data Fabric and NetApp Private Storage with Nilesh Bagad and Karthikeyan Nagalingam

  • 1.
    Karthikeyan Nagalingam Nilesh Bagad ANALYZINGIOT DATA IN APACHE SPARK ACROSS DATA CENTERS AND CLOUD By
  • 2.
    Agenda • IoT DataManagement Challenges • NetApp Data Fabric Architecture for Big Data • IoT Customer Use Cases on Data Fabric • Q&A
  • 3.
  • 4.
    IoT Data Flow Datais created1 Data is analyzed in realtime2 Data is aggregated and sent to Core3 Data is stored1 EdgeEdge Data Center Edge Geo Distributed Data Lake …. …. Data is analyzed2 Data is protected3 EDGE CORE Cloud
  • 5.
    IoT Data ManagementChallenges AnalyzeStoreCollect Transport Protect • Flexibility and Agility • Cost • Data Protection
  • 6.
  • 7.
    The NetApp DataFabric Helping customers unleash data to address their business imperatives HARNESS the power of the hybrid cloud BUILD a next-generation data center MODERNIZE storage through data management PUBLIC CLOUD MULTI- CLOUD NEXT-GEN DATA CENTER ENTERPRISE IT NETAPP DATA FABRIC
  • 8.
    Introducing Data FabricBuilding Blocks for Analytics NetApp Private Storage AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB Applications ONTAP Express Route SnapMirror® Applications Applications Applications S3 Direct Connect Cold Data StorageGRID FabricPool
  • 9.
    In Place Analytics: +Enable big data analytics on NFSv3 data AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB Apache Spark Cluster NetApp FAS NFS Connector Apache Spark Cluster NetApp FAS NFS Connector HDFS NFS AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB NFS Key Benefits • Avoid data move to HDFS. Reduce replicas • Scale compute and storage independently • Enterprise data protection • Hybrid cloud deployment • Hortonworks Certified Confit 1 : NFS as a Storage Confit 2 : HDFS and NFS in Single Spark Cluster
  • 10.
    Analytics with DataFabric NetApp Private Storage AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB Apache Spark Cluster NetApp FAS NFS Connector ONTAP NFS On Premises HDInsight Spark Cluster NetApp FAS NFS Connector Databrics or EMR NetApp FAS NFS Connector Express Route SnapMirror® Direct Connect
  • 11.
    IoT Customer UseCases on Data Fabric
  • 12.
    Customer Scenario • IoTdata received in AWS and analyzed using Apache Spark cluster in AWS • Data Management Challenge: – How to Backup 10 TB data without load on cluster? – How to protect the data to on-premises? Broadcasting Provider
  • 13.
    An Architecture forProcessing IoT-Data Ingested in Cloud Backup and DR to On Premises AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB Apache Spark Cluster NetApp FAS NFS Connector ONTAP NFS On Premises EBS Spark Cluster NetApp FAS NFS Connector Kafka Events Via Rest API
  • 14.
    Customer Scenario • IoTdata is received in AWS and analyzed using Apache Spark Cluster in AWS • Data Management Challenge: – How to reduce the solution cost? – How to consume analytics services in data center and multiple clouds? IT Service Provider
  • 15.
    An Architecture forProcessing IoT-Data Ingested in Cloud Multi Cloud Connectivity NetApp Private Storage AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB Apache Spark Cluster NetApp FAS NFS Connector ONTAP NFS On Premises Analytics Service NetApp FAS NFS Connector Spark Cluster NetApp FAS NFS Connector Express Route SnapMirror® Direct Connect Kafka Events Via Rest API
  • 16.
    Customer Scenario • IoTdata received on-premises and analyzed using Cloudera Spark Cluster across data center and cloud • Data Management Challenge: – How to leverage cloud computation for analytics ? – How to consume legacy data (7PB) for analytics? Insurance Company
  • 17.
    An Architecture forProcessing IoT-Data Ingested on premises DR in Cloud; Analytic across data centers NetApp Private Storage AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB NetApp FAS NFS Connector ONTAP NFS On Premises HDInsight NetApp FAS NFS Connector DataBricks NetApp FAS NFS Connector Express Route SnapMirror® Direct Connect Spark ClusterKafka Events Via Rest API
  • 18.
    Customer Scenario • IoTdata received on-premises and stored in Hadoop Data Lake. Data needs to be backed up for compliance • Data Management Challenge: – How to reduce the backup window and optimize solution cost? – How to minimize impact on analytics performance during backup? Large Bank
  • 19.
    Use Case: Backupfor IoT Data HDFS (SAN) Tape HDFS (DAS) Backup Edge OR + Archive Current distcp –update HDFS (snap) NetApp (snap) HDFS (SAN) FlexClone NetBackup (Files) NFS Export NetApp Solution ATypical HDFS (DAS) HDFS (DAS) Backup Edge OR + Archive Proposed distcp -update –diff HDFS (snap) distcp -update –diff (Files) NFS FlexClone NFS Connector NetApp Solution B HDFS (DAS) SnapMirror FlexClone FlexClone FlexClone Run lots of parallel mappers NetApp Backup Solution A • NetApp Snapshot Backup • Backup Archival • Cloud Compatible NetApp Backup Solution B • Hadoop Native Support • Offload Backup Operation • Enterprise Management
  • 20.
    Customer Scenario • LargeHadoop Data Lake implementation on premises with Multiple Spark clusters • Data Management Challenge: – How to make data available for dev/test teams? – How to build the new cluster in minutes from an existing cluster? Online Music Distribution
  • 21.
    Use Case: Dev/Testfor IoT Data AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB Production Apache Spark Cluster NetApp FAS NFS Connector ONTAP NFS On Premises Development Apache Spark Cluster NetApp FAS NFS Connector NFS QA/Test Apache Spark Cluster NetApp FAS NFS Connector NFS FlexClone FlexClone
  • 22.
    Customer Scenario • Runanalytics on archival data in object store • Data Management Challenge: – How to run Hadoop jobs in object store – Archive the Hadoop data on primary or a secondary site Online Marketplace
  • 23.
    Analytics with NetAppStorageGRID In place analytics with StorageGRID Secondary Site AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB Apache Spark Cluster NetApp FAS NFS Connector ONTAP NFS On Premises Direct Connect AFF A200 4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB 1600GB StorageGRID StorageGRID S3 S3 S3A
  • 24.
    • Input datasetsize – 1.5TB • ONTAP– 52% better than JBOD Spark Performance Type Hadoop Worker Nodes Drives per Node Number of Storage Arrays JBOD 6 12 NA ONTAP 6 6 1 0 1000 2000 3000 4000 5000 6000 Througput(MBytes/Sec) MB/Sec Spark Scala WordCount - Throughput MB/Sec (Higher is better) JBOD ONTAP 0 50 100 150 200 250 300 350 400 450 500 Duration(s) Seconds Spark Scala WordCount - Duration(s) ( Lower Is better ) JBOD ONTAP HiBench – Spark Scala Word Count
  • 25.
    Key Takeaways Flexibility andAgility • On Demand analytics with Hybrid Cloud/Multi Cloud deployments • Rapid provisioning of clusters for test/dev environments Lower Cost • Add storage capacity without adding compute nodes • One copy vs 3 copies of data for HDFS • Data Tiering with FabricPool Enterprise Data Protection • Efficient backup, DR and Archival
  • 26.
    Further Resources § Pleasevisit us at: Booth #512 § Visit our Big Data Website: www.netapp.com/bigdata
  • 27.
  • 28.
    Thank You. Nilesh Bagad:nileshb@netapp.com Karthikeyan Nagalingam: nkarthik@netapp.com