Apache Arrow: Leveling Up the Data Science Stack

https://ursalabs.org ● Build cross-language, portable computational libraries for data science ● Grow Apache Arrow ecosystem ● Funding and employment for full-time open source developers ● Not-for-profit, funded by multiple corporations Ursa Labs Mission

Up to 80-90% of CPU cycles spent on de/serialization Life without Arrow Life with Arrow No de/serialization

Arrow C++ Platform Multi-core Work Scheduler Core Data Platform Query Engine Datasets Framework Arrow Flight RPC Network Storage

● Columnar format objects and utilities ● Memory management and generic IO ● Binary protocol / serialization functions ● Memory-mapping and zero-copy “parsing” ● Integration testing Arrow Core

● Fast read and write of multi-file datasets ● Read only the parts of the dataset relevant to your analysis (“predicate pushdown”) C++ Datasets File Formats Storage Systems CSV

Arrow Flight RPC (Messaging) ● Efficient client-server dataset interchange ● Focused on gRPC (Google’s messaging framework), but may support other transports in future ● It’s fast… really fast ○ Upwards 3GB/s server-to-client on localhost

Arrow for R ● Rcpp-based bindings ● https://github.com/apache/arrow/tree/master/r ● Goal: enable R package developers to leverage Arrow ecosystem for better performance and scalability

Arrow format vs. R data.frame ● Type-independent representation of NA values (bits vs. special values) ● Better computational efficiency for strings ● Naturally chunk-based (vs. large contiguous allocations) ● Supports a much wider variety of data types, including nested data (JSON-like)

• • • • • • • •

flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) Can be a massive Arrow dataset

flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime Can be a massive Arrow dataset

flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime R expressions can be JIT-compiled with LLVM Can be a massive Arrow dataset

Keep up to date at https://arrow.apache.org https://ursalabs.org https://wesmckinney.com Thanks

Apache Arrow: Leveling Up the Data Science Stack

In this document

More Related Content

What's hot

Similar to Apache Arrow: Leveling Up the Data Science Stack

More from Wes McKinney

Recently uploaded

Apache Arrow: Leveling Up the Data Science Stack