July 8, 2020 -- ACM TechTalks Wes McKinney Apache Arrow and the Future of Data Frames
• Director of Ursa Labs, not-for-profit dev group working on Apache Arrow • Created Python pandas project (~2008), lead developer/maintainer until 2013 • PMC Apache Arrow, Apache Parquet, ASF Member • Wrote Python for Data Analysis (1e 2012, 2e 2017) • Formerly: Two Sigma, Cloudera, DataPad, AQR Wes McKinney
Programming interfaces for data preparation, analytics, and feature engineering Career Theme
Some Partners ● https://ursalabs.org ● Apache Arrow-powered Data Science Tools ● Funded by corporate partners ● Built in collaboration with RStudio
What exactly is a data frame?
Is it... • A data structure? • Or an API?
mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
One definition... A "data frame" is … a programming interface … for expressing data manipulations … on tabular datasets … in a general programming language … and whose primary modality is analytical
Data frames address many analytical workloads that are either not possible or not well-served by traditional SQL-based systems
Compared with SQL-based systems Data frames often … use imperative / procedural constructs … offer access to internal structure … expose operations outside of traditional relational algebra … have stateful semantics
Data structures: different approaches • In R, the “data frame” data structure is part of the language • Other projects implement their own (e.g. pandas) • Some projects may not use any data structures (e.g. compiling operations to SQL)
Computational engines • Algorithms implemented against project-specific data structures • Coupling between API and algorithms common • New data frame libraries must often develop new implementations
Data Access • Data must be “loaded” (deserialized) into data structures in order to be manipulated / analyzed • Each data source requires a data structure-specific “loader” implementation • Examples: network protocols, file formats, etc.
Many-layered inefficiencies • Most data frames are effectively “islands” with a hard serialization barrier • Many non-reusable implementations of the same algorithms • Limited collaboration across projects and programming languages
https://towardsdatascience.com/the-great-csv-showdown-julia-vs-python-vs-r-aa77376fb96
What do we want? • Reduce or remove serialization overhead associated with data access • Reuse algorithms and IO utilities across data frame projects • Promote collaboration and unify developer efforts
Data as the “new API”
Apache Arrow ● Open source community project launched in 2016 ● Intersection of database systems, big data, and data science tools ● Purpose: Language-independent open standards and libraries to accelerate and simplify in-memory computing ● https://github.com/apache/arrow
Personal motivations ● Improve interoperability problems with other data processing systems ● Standardize data structures used in data frame implementations ● Promote collaboration and code reuse across libraries and programming languages
Defragmenting Data
Not just any data structures ● Address fundamental computational issues in “legacy” data structures: ○ Limited data types ○ Excessive memory consumption ○ Poor processing efficiency for non-numeric types ○ Accommodate larger-than-memory datasets
● Language-agnostic in-memory columnar format for analytical query engines, data frames ● Binary protocol for IPC / RPC ● “Batteries included” development platform for building data processing applications Apache Arrow Project Overview
2020 Development Status ● 17 major releases ● Over 500 unique contributors ● Over 50M package installs in 2019 ● ASF roster: 52 committers, 29 PMC members ● 11 programming languages represented
Upcoming Arrow 1.0.0 release ● Columnar format and protocol formally declared stable with backward/forward compatibility guarantees ● Libraries moving to Semantic Versioning
Arrow and the Future of Data Frames ● As more data sources offer Arrow-based data access, it will make sense to process Arrow in situ rather than converting to some other data structure ● Analytical systems will generally grow more efficient the more “Arrow-native” they become
Arrow Columnar Format and Binary Protocol
Arrow’s Columnar Memory Format • Runtime memory format for analytical query processing • Ideal companion to columnar storage like Apache Parquet • “Fully shredded” columnar, supports flat and nested schemas • Organized for cache-efficient access on CPUs/GPUs • Optimized for data locality, SIMD, parallel processing • Accommodates both random access and scan workloads
Arrow Binary Protocol • Record batch: ordered collection of named arrays • Streaming wire format for transferring datasets between address spaces • Intended for both IPC / shared memory and RPC use cases SCHEMA DICTIONARY DICTIONARY RECORD BATCH RECORD BATCH ... receiver sender
Encapsulated protocol (“IPC”) messages • Serialization wire format suitable for stream-based parsing metadata body Metadata size or end-of-stream marker “Message” Flatbuffer (see format/Message.fbs) padding Metadata contains memory addresses within body to reconstruct data structures
Columnar Format Future Directions • In-memory encoding, compression, sparseness • e.g. run-length encoding • See mailing list discussions, we need your feedback! • Expansion of logical types
Some success stories
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
https://medium.com/google-cloud/announcing-google-cloud-bigquery-version-1-17-0-1fc428512171
https://www.snowflake.com/blog/fetching-query-results-from-snowflake-just-got-a-lot-faster-with-apache-arrow/
https://www.influxdata.com/blog/apache-arrow-parquet-flight-and-their-ecosystem-are-a-game-changer-for-olap/
Inside Arrow development
Arrow C++ development platform Allocators and Buffers Columnar Data Structures and Builders File Format Interfaces PARQUET CSV JSON ORC AVRO Binary IPC Protocol Gandiva: LLVM Expr Compiler Compute Kernels IO / Filesystem Platform localfs AWS S3 HDFS mmap GCP Azure Red means planned / under construction work Plasma: Shared Mem Object Store Multithreading Runtime Datasets Framework Data Frame Interface Embeddable Query Engine Compressor Interfaces … and much more CUDA Interop Flight RPC
Arrow C++ Platform Multi-core Work Scheduler Core Data Platform Query Processing Datasets Framework Arrow Flight RPC Network Storage
Example: use in R libraries flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime R expressions can be JIT-compiled with LLVM Can be a massive Arrow dataset
Arrow Flight RPC Framework
Arrow Flight Overview • A gRPC-based framework for defining custom data services that send and receive Arrow columnar data natively • Uses Protocol Buffers v3 for client protocol • Pluggable command execution layer, authentication • Low-level gRPC optimizations to avoid unnecessary serialization
Arrow Flight - Parallel Get Client Planner GetFlightInfo FlightInfo DoGet Data Nodes FlightData DoGet FlightData ...
Arrow Flight - Efficient gRPC transport Client DoGet Data Node FlightData Row Batch Row Batch Row Batch Row Batch Row Batch ... Data transported in a Protocol Buffer, but reads can be made zero-copy by writing a custom gRPC “deserializer”
Demo: Build simple Flight service in Python
Getting involved • Join dev@arrow.apache.org • Development https://github.com/apache/arrow

ACM TechTalks : Apache Arrow and the Future of Data Frames

  • 1.
    July 8, 2020-- ACM TechTalks Wes McKinney Apache Arrow and the Future of Data Frames
  • 2.
    • Director ofUrsa Labs, not-for-profit dev group working on Apache Arrow • Created Python pandas project (~2008), lead developer/maintainer until 2013 • PMC Apache Arrow, Apache Parquet, ASF Member • Wrote Python for Data Analysis (1e 2012, 2e 2017) • Formerly: Two Sigma, Cloudera, DataPad, AQR Wes McKinney
  • 3.
    Programming interfaces for datapreparation, analytics, and feature engineering Career Theme
  • 4.
    Some Partners ● https://ursalabs.org ●Apache Arrow-powered Data Science Tools ● Funded by corporate partners ● Built in collaboration with RStudio
  • 5.
    What exactly isa data frame?
  • 6.
    Is it... • Adata structure? • Or an API?
  • 7.
    mpg cyl disphp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
  • 8.
    One definition... A "dataframe" is … a programming interface … for expressing data manipulations … on tabular datasets … in a general programming language … and whose primary modality is analytical
  • 9.
    Data frames addressmany analytical workloads that are either not possible or not well-served by traditional SQL-based systems
  • 10.
    Compared with SQL-basedsystems Data frames often … use imperative / procedural constructs … offer access to internal structure … expose operations outside of traditional relational algebra … have stateful semantics
  • 11.
    Data structures: differentapproaches • In R, the “data frame” data structure is part of the language • Other projects implement their own (e.g. pandas) • Some projects may not use any data structures (e.g. compiling operations to SQL)
  • 12.
    Computational engines • Algorithmsimplemented against project-specific data structures • Coupling between API and algorithms common • New data frame libraries must often develop new implementations
  • 13.
    Data Access • Datamust be “loaded” (deserialized) into data structures in order to be manipulated / analyzed • Each data source requires a data structure-specific “loader” implementation • Examples: network protocols, file formats, etc.
  • 14.
    Many-layered inefficiencies • Mostdata frames are effectively “islands” with a hard serialization barrier • Many non-reusable implementations of the same algorithms • Limited collaboration across projects and programming languages
  • 15.
  • 16.
    What do wewant? • Reduce or remove serialization overhead associated with data access • Reuse algorithms and IO utilities across data frame projects • Promote collaboration and unify developer efforts
  • 17.
  • 19.
    Apache Arrow ● Opensource community project launched in 2016 ● Intersection of database systems, big data, and data science tools ● Purpose: Language-independent open standards and libraries to accelerate and simplify in-memory computing ● https://github.com/apache/arrow
  • 20.
    Personal motivations ● Improveinteroperability problems with other data processing systems ● Standardize data structures used in data frame implementations ● Promote collaboration and code reuse across libraries and programming languages
  • 21.
  • 22.
    Not just anydata structures ● Address fundamental computational issues in “legacy” data structures: ○ Limited data types ○ Excessive memory consumption ○ Poor processing efficiency for non-numeric types ○ Accommodate larger-than-memory datasets
  • 23.
    ● Language-agnostic in-memorycolumnar format for analytical query engines, data frames ● Binary protocol for IPC / RPC ● “Batteries included” development platform for building data processing applications Apache Arrow Project Overview
  • 24.
    2020 Development Status ●17 major releases ● Over 500 unique contributors ● Over 50M package installs in 2019 ● ASF roster: 52 committers, 29 PMC members ● 11 programming languages represented
  • 25.
    Upcoming Arrow 1.0.0release ● Columnar format and protocol formally declared stable with backward/forward compatibility guarantees ● Libraries moving to Semantic Versioning
  • 26.
    Arrow and theFuture of Data Frames ● As more data sources offer Arrow-based data access, it will make sense to process Arrow in situ rather than converting to some other data structure ● Analytical systems will generally grow more efficient the more “Arrow-native” they become
  • 27.
    Arrow Columnar Formatand Binary Protocol
  • 28.
    Arrow’s Columnar MemoryFormat • Runtime memory format for analytical query processing • Ideal companion to columnar storage like Apache Parquet • “Fully shredded” columnar, supports flat and nested schemas • Organized for cache-efficient access on CPUs/GPUs • Optimized for data locality, SIMD, parallel processing • Accommodates both random access and scan workloads
  • 29.
    Arrow Binary Protocol •Record batch: ordered collection of named arrays • Streaming wire format for transferring datasets between address spaces • Intended for both IPC / shared memory and RPC use cases SCHEMA DICTIONARY DICTIONARY RECORD BATCH RECORD BATCH ... receiver sender
  • 30.
    Encapsulated protocol (“IPC”)messages • Serialization wire format suitable for stream-based parsing metadata body Metadata size or end-of-stream marker “Message” Flatbuffer (see format/Message.fbs) padding Metadata contains memory addresses within body to reconstruct data structures
  • 31.
    Columnar Format FutureDirections • In-memory encoding, compression, sparseness • e.g. run-length encoding • See mailing list discussions, we need your feedback! • Expansion of logical types
  • 32.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
    Arrow C++ developmentplatform Allocators and Buffers Columnar Data Structures and Builders File Format Interfaces PARQUET CSV JSON ORC AVRO Binary IPC Protocol Gandiva: LLVM Expr Compiler Compute Kernels IO / Filesystem Platform localfs AWS S3 HDFS mmap GCP Azure Red means planned / under construction work Plasma: Shared Mem Object Store Multithreading Runtime Datasets Framework Data Frame Interface Embeddable Query Engine Compressor Interfaces … and much more CUDA Interop Flight RPC
  • 40.
    Arrow C++ Platform Multi-coreWork Scheduler Core Data Platform Query Processing Datasets Framework Arrow Flight RPC Network Storage
  • 41.
    Example: use inR libraries flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime R expressions can be JIT-compiled with LLVM Can be a massive Arrow dataset
  • 42.
  • 43.
    Arrow Flight Overview •A gRPC-based framework for defining custom data services that send and receive Arrow columnar data natively • Uses Protocol Buffers v3 for client protocol • Pluggable command execution layer, authentication • Low-level gRPC optimizations to avoid unnecessary serialization
  • 44.
    Arrow Flight -Parallel Get Client Planner GetFlightInfo FlightInfo DoGet Data Nodes FlightData DoGet FlightData ...
  • 45.
    Arrow Flight -Efficient gRPC transport Client DoGet Data Node FlightData Row Batch Row Batch Row Batch Row Batch Row Batch ... Data transported in a Protocol Buffer, but reads can be made zero-copy by writing a custom gRPC “deserializer”
  • 46.
    Demo: Build simpleFlight service in Python
  • 47.
    Getting involved • Joindev@arrow.apache.org • Development https://github.com/apache/arrow