https://ursalabs.org ● Build cross-language, portable computational libraries for data science ● Grow Apache Arrow ecosystem ● Funding and employment for full-time open source developers ● Not-for-profit, funded by multiple corporations Ursa Labs Mission
Strategic Partnership Model
• • • •
Up to 80-90% of CPU cycles spent on de/serialization Life without Arrow Life with Arrow No de/serialization
• • • •
Arrow C++ Platform Multi-core Work Scheduler Core Data Platform Query Engine Datasets Framework Arrow Flight RPC Network Storage
● Columnar format objects and utilities ● Memory management and generic IO ● Binary protocol / serialization functions ● Memory-mapping and zero-copy “parsing” ● Integration testing Arrow Core
● Fast read and write of multi-file datasets ● Read only the parts of the dataset relevant to your analysis (“predicate pushdown”) C++ Datasets File Formats Storage Systems CSV
• • • •
Arrow Flight RPC (Messaging) ● Efficient client-server dataset interchange ● Focused on gRPC (Google’s messaging framework), but may support other transports in future ● It’s fast… really fast ○ Upwards 3GB/s server-to-client on localhost
Arrow for R ● Rcpp-based bindings ● https://github.com/apache/arrow/tree/master/r ● Goal: enable R package developers to leverage Arrow ecosystem for better performance and scalability
Arrow format vs. R data.frame ● Type-independent representation of NA values (bits vs. special values) ● Better computational efficiency for strings ● Naturally chunk-based (vs. large contiguous allocations) ● Supports a much wider variety of data types, including nested data (JSON-like)
• • • • •
• • • •
• • • • • • • •
flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) Can be a massive Arrow dataset
flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime Can be a massive Arrow dataset
flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime R expressions can be JIT-compiled with LLVM Can be a massive Arrow dataset
Keep up to date at https://arrow.apache.org https://ursalabs.org https://wesmckinney.com Thanks

Apache Arrow: Leveling Up the Data Science Stack

  • 2.
    https://ursalabs.org ● Build cross-language,portable computational libraries for data science ● Grow Apache Arrow ecosystem ● Funding and employment for full-time open source developers ● Not-for-profit, funded by multiple corporations Ursa Labs Mission
  • 3.
  • 4.
  • 5.
    Up to 80-90%of CPU cycles spent on de/serialization Life without Arrow Life with Arrow No de/serialization
  • 6.
  • 7.
    Arrow C++ Platform Multi-coreWork Scheduler Core Data Platform Query Engine Datasets Framework Arrow Flight RPC Network Storage
  • 8.
    ● Columnar formatobjects and utilities ● Memory management and generic IO ● Binary protocol / serialization functions ● Memory-mapping and zero-copy “parsing” ● Integration testing Arrow Core
  • 9.
    ● Fast readand write of multi-file datasets ● Read only the parts of the dataset relevant to your analysis (“predicate pushdown”) C++ Datasets File Formats Storage Systems CSV
  • 10.
  • 11.
    Arrow Flight RPC(Messaging) ● Efficient client-server dataset interchange ● Focused on gRPC (Google’s messaging framework), but may support other transports in future ● It’s fast… really fast ○ Upwards 3GB/s server-to-client on localhost
  • 12.
    Arrow for R ●Rcpp-based bindings ● https://github.com/apache/arrow/tree/master/r ● Goal: enable R package developers to leverage Arrow ecosystem for better performance and scalability
  • 13.
    Arrow format vs.R data.frame ● Type-independent representation of NA values (bits vs. special values) ● Better computational efficiency for strings ● Naturally chunk-based (vs. large contiguous allocations) ● Supports a much wider variety of data types, including nested data (JSON-like)
  • 14.
  • 15.
  • 16.
  • 17.
    flights %>% group_by(year, month,day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) Can be a massive Arrow dataset
  • 18.
    flights %>% group_by(year, month,day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime Can be a massive Arrow dataset
  • 19.
    flights %>% group_by(year, month,day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime R expressions can be JIT-compiled with LLVM Can be a massive Arrow dataset
  • 20.
    Keep up todate at https://arrow.apache.org https://ursalabs.org https://wesmckinney.com Thanks