25127 The Data Lake Engine Spark + AI Summit 2020 Data Science Across Data Sources with Apache Arrow
25127 Dremio is the Data Lake Engine CompanyTomer Shiran Co-Founder & CPO, Dremio tomer@dremio.com Powering the cloud data lakes of the world’s leading companies across all industries Creators of Over $100M raised Background
25127 Your Data Lake is Exploding, Yet Your Data Remains Inaccessible But… >100% YoY S3 Data Growth1 >50% of Data Will Live on Cloud Data Lake Storage by 20252 1) Estimate based on historical growth https://aws.amazon.com/blogs/aws/amazon-s3-growth-for-2011-now-762-billion-objects/ 2) Estimate based on trends around cloud migration plus growth in semi-structured and unstructured data Data Lakes are becoming the primary place that data lands Consuming the data is too slow & too difficult SQL Data Consumers X X X S3ADLS S3ADLS or or
25127 Data Movement is the Typical Workaround for Data Lake Storage BI Users SQL Data Scientists Data Lake Storage ADLS S3
25127 Data Movement is the Typical Workaround for Data Lake Storage BI Users SQL Data Scientists 1 Brittle & complex ETL/ELT Data Lake Storage ADLS S3
25127 Data Movement is the Typical Workaround for Data Lake Storage 1 2 Brittle & complex ETL/ELT Data Lake Storage Proprietary & expensive DW/Data Marts BI Users SQL Data Scientists ADLS S3
25127 Data Movement is the Typical Workaround for Data Lake Storage Proliferating Cubes, BI Extracts, & Aggregation Tables Proprietary & expensive DW/Data Marts + + + + + + + + + 1 2 3 Brittle & complex ETL/ELT D ecreasingD ataScope& Flexibility Data Lake Storage BI Users SQL Data Scientists ADLS S3
25127 Proliferating Cubes, BI Extracts, & Aggregation Tables Proprietary & expensive DW/Data Marts + + + + + + + + + 1 2 3 Brittle & complex ETL/ELT D ecreasingD ataScope& Flexibility BI Users SQL Data Scientists Data Lake Storage ADLS S3o r o r Query data lake storage directly with 4-100X performance Powered by .
What is Apache Arrow? Columnar In- Memory Representation Many Language Bindings Broad Industry Adoption Row-based Column-based
10+ Downloads per Month
25127 Apache Arrow Gandiva Improves CPU Efficiency ✓ A standalone C++ library for efficient evaluation of arbitrary SQL expressions on Arrow vectors using runtime code- generation in LLVM ✓ Expressions are compiled to LLVM bytecode (IR), optimized & translated to machine code ✓ Gandiva enables vectorized execution with Intel SIMD instructions SQL expression Vectorized execution kernel Input Arrow buffer Output Arrow buffer Gandiva compiler Pre-compiled functions (.bs) OptimizeIRBuilder
25127 4.5x-90x Faster than Java-based Code Generation Test Project time (secs) with Java JIT Project time (secs) with Gandiva LLVM Improvement Sum 3.805 0.558 6.8x Project 5 columns 8.681 1.689 5.13x Project 10 columns 24.923 3.476 7.74x CASE-10 4.308 0.925 4.66x CASE-100 1361 15.187 89.6x
25127 Dremio’s Arrow-based Columnar Cloud Cache (C3) Accelerates I/O ✓ Columnar cloud cache (C3) automatically provides NVMe-level I/O performance when reading from S3/ADLS ✓ Arrow persistence enables granular caching as Arrow buffers in local engine NVMe ✓ Bypass data deserialization and decompression ✓ Enables high-concurrency, low-latency BI workloads on cloud data lake storage … Executor Executor Executor Executor AWS S3 NVMe NVMeNVMe NVMe C3 with Apache Arrow persistence … Executor Executor Executor NVMe NVMe NVMe C3 with Apache Arrow persistence XL engine M engine
25127 The Open Data Platform Storage Data Compute Client Interactive SQL & BI Data science & batch Occasional SQL Athena EMR Batch processing AWS S3 ADLS HDFS File formats: Text | JSON | Parquet | ORC Table formats: Glue | Hive Metastore | Delta Lake | Iceberg
We Need Fast, Industry-Standard Data Exchange Storage Data Compute Client Interactive SQL & BI Data science & batch Occasional SQL Athena EMR AWS S3 ADLS HDFS File formats: Text | JSON | Parquet | ORC Table formats: Glue | Hive Metastore | Delta Lake | Iceberg Batch processing 2 1 3 4
Arrow Flight is an Arrow-based RPC Interface ✓ High-performance wire protocol ✓ Parallel streams of Arrow buffers are transferred ✓ Delivers on the interoperability promise of Apache Arrow ✓ Client-cluster and cluster-cluster communication … Arrow Flight dataframe
Arrow Flight Python Client import pyarrow.flight as flt c = flt.FlightClient.connect("localhost", 47470) fd = flt.FlightDescriptor.for_command(sql) fi = c.get_flight_info(fd) ticket = fi.endpoints[0].ticket df = c.do_get(ticket0).read_all()
Client-Cluster Communication
Cluster-Cluster Communication
Demo
Demo
25127 Q&AThe Data Lake Engine
25127 Dremio is the Data Lake Engine Data Lake Storage Data Lake Engine BI Users SQL Data Scientists ADLS S3or or Optional External Sources Data Users Accelerate Business 100X BI query speed 4X Ad-hoc query speed 0 cubes, extracts, or aggregation tables Reduce Cost & Risk& 10x lower AWS EC2 / Azure VM spend for same performance 0 lock-in, loss of control, and duplication of data Powered by A Next-Generation Data Lake Query Engine for Live, Interactive Analytics Directly on Data Lake Storage

Data Science Across Data Sources with Apache Arrow

  • 1.
    25127 The Data LakeEngine Spark + AI Summit 2020 Data Science Across Data Sources with Apache Arrow
  • 2.
    25127 Dremio is theData Lake Engine CompanyTomer Shiran Co-Founder & CPO, Dremio tomer@dremio.com Powering the cloud data lakes of the world’s leading companies across all industries Creators of Over $100M raised Background
  • 3.
    25127 Your Data Lakeis Exploding, Yet Your Data Remains Inaccessible But… >100% YoY S3 Data Growth1 >50% of Data Will Live on Cloud Data Lake Storage by 20252 1) Estimate based on historical growth https://aws.amazon.com/blogs/aws/amazon-s3-growth-for-2011-now-762-billion-objects/ 2) Estimate based on trends around cloud migration plus growth in semi-structured and unstructured data Data Lakes are becoming the primary place that data lands Consuming the data is too slow & too difficult SQL Data Consumers X X X S3ADLS S3ADLS or or
  • 4.
    25127 Data Movement isthe Typical Workaround for Data Lake Storage BI Users SQL Data Scientists Data Lake Storage ADLS S3
  • 5.
    25127 Data Movement isthe Typical Workaround for Data Lake Storage BI Users SQL Data Scientists 1 Brittle & complex ETL/ELT Data Lake Storage ADLS S3
  • 6.
    25127 Data Movement isthe Typical Workaround for Data Lake Storage 1 2 Brittle & complex ETL/ELT Data Lake Storage Proprietary & expensive DW/Data Marts BI Users SQL Data Scientists ADLS S3
  • 7.
    25127 Data Movement isthe Typical Workaround for Data Lake Storage Proliferating Cubes, BI Extracts, & Aggregation Tables Proprietary & expensive DW/Data Marts + + + + + + + + + 1 2 3 Brittle & complex ETL/ELT D ecreasingD ataScope& Flexibility Data Lake Storage BI Users SQL Data Scientists ADLS S3
  • 8.
    25127 Proliferating Cubes, BI Extracts,& Aggregation Tables Proprietary & expensive DW/Data Marts + + + + + + + + + 1 2 3 Brittle & complex ETL/ELT D ecreasingD ataScope& Flexibility BI Users SQL Data Scientists Data Lake Storage ADLS S3o r o r Query data lake storage directly with 4-100X performance Powered by .
  • 9.
    What is ApacheArrow? Columnar In- Memory Representation Many Language Bindings Broad Industry Adoption Row-based Column-based
  • 10.
  • 11.
    25127 Apache Arrow GandivaImproves CPU Efficiency ✓ A standalone C++ library for efficient evaluation of arbitrary SQL expressions on Arrow vectors using runtime code- generation in LLVM ✓ Expressions are compiled to LLVM bytecode (IR), optimized & translated to machine code ✓ Gandiva enables vectorized execution with Intel SIMD instructions SQL expression Vectorized execution kernel Input Arrow buffer Output Arrow buffer Gandiva compiler Pre-compiled functions (.bs) OptimizeIRBuilder
  • 12.
    25127 4.5x-90x Faster thanJava-based Code Generation Test Project time (secs) with Java JIT Project time (secs) with Gandiva LLVM Improvement Sum 3.805 0.558 6.8x Project 5 columns 8.681 1.689 5.13x Project 10 columns 24.923 3.476 7.74x CASE-10 4.308 0.925 4.66x CASE-100 1361 15.187 89.6x
  • 13.
    25127 Dremio’s Arrow-based ColumnarCloud Cache (C3) Accelerates I/O ✓ Columnar cloud cache (C3) automatically provides NVMe-level I/O performance when reading from S3/ADLS ✓ Arrow persistence enables granular caching as Arrow buffers in local engine NVMe ✓ Bypass data deserialization and decompression ✓ Enables high-concurrency, low-latency BI workloads on cloud data lake storage … Executor Executor Executor Executor AWS S3 NVMe NVMeNVMe NVMe C3 with Apache Arrow persistence … Executor Executor Executor NVMe NVMe NVMe C3 with Apache Arrow persistence XL engine M engine
  • 14.
    25127 The Open DataPlatform Storage Data Compute Client Interactive SQL & BI Data science & batch Occasional SQL Athena EMR Batch processing AWS S3 ADLS HDFS File formats: Text | JSON | Parquet | ORC Table formats: Glue | Hive Metastore | Delta Lake | Iceberg
  • 15.
    We Need Fast,Industry-Standard Data Exchange Storage Data Compute Client Interactive SQL & BI Data science & batch Occasional SQL Athena EMR AWS S3 ADLS HDFS File formats: Text | JSON | Parquet | ORC Table formats: Glue | Hive Metastore | Delta Lake | Iceberg Batch processing 2 1 3 4
  • 16.
    Arrow Flight isan Arrow-based RPC Interface ✓ High-performance wire protocol ✓ Parallel streams of Arrow buffers are transferred ✓ Delivers on the interoperability promise of Apache Arrow ✓ Client-cluster and cluster-cluster communication … Arrow Flight dataframe
  • 17.
    Arrow Flight PythonClient import pyarrow.flight as flt c = flt.FlightClient.connect("localhost", 47470) fd = flt.FlightDescriptor.for_command(sql) fi = c.get_flight_info(fd) ticket = fi.endpoints[0].ticket df = c.do_get(ticket0).read_all()
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
    25127 Dremio is theData Lake Engine Data Lake Storage Data Lake Engine BI Users SQL Data Scientists ADLS S3or or Optional External Sources Data Users Accelerate Business 100X BI query speed 4X Ad-hoc query speed 0 cubes, extracts, or aggregation tables Reduce Cost & Risk& 10x lower AWS EC2 / Azure VM spend for same performance 0 lock-in, loss of control, and duplication of data Powered by A Next-Generation Data Lake Query Engine for Live, Interactive Analytics Directly on Data Lake Storage