Ursa Labs builds cross-language libraries like Apache Arrow for data science. Arrow provides a columnar data format and utilities for efficient serialization, IO, and querying across programming languages. Ursa Labs contributes to Arrow and funds open source developers to grow the Arrow ecosystem. Their goal is to reduce the CPU time spent on data serialization and enable faster data analysis in languages like R.
Ursa Labs focuses on building data science libraries, improving Apache Arrow ecosystem, and providing funding for open-source developers.
Arrow minimizes CPU time spent on de/serialization, enabling fast data read/write, and supports high-speed data transfer methods like Arrow Flight RPC.
Arrow enhances R's performance with bindings and supports more data types than traditional R data frames.
Demonstrates the use of dplyr verbs in R to interface with Arrow computation frameworks for efficient data analysis.
Provides resources for further information on Apache Arrow and Ursa Labs.
Arrow C++ Platform Multi-coreWork Scheduler Core Data Platform Query Engine Datasets Framework Arrow Flight RPC Network Storage
8.
● Columnar formatobjects and utilities ● Memory management and generic IO ● Binary protocol / serialization functions ● Memory-mapping and zero-copy “parsing” ● Integration testing Arrow Core
9.
● Fast readand write of multi-file datasets ● Read only the parts of the dataset relevant to your analysis (“predicate pushdown”) C++ Datasets File Formats Storage Systems CSV
Arrow Flight RPC(Messaging) ● Efficient client-server dataset interchange ● Focused on gRPC (Google’s messaging framework), but may support other transports in future ● It’s fast… really fast ○ Upwards 3GB/s server-to-client on localhost
12.
Arrow for R ●Rcpp-based bindings ● https://github.com/apache/arrow/tree/master/r ● Goal: enable R package developers to leverage Arrow ecosystem for better performance and scalability
13.
Arrow format vs.R data.frame ● Type-independent representation of NA values (bits vs. special values) ● Better computational efficiency for strings ● Naturally chunk-based (vs. large contiguous allocations) ● Supports a much wider variety of data types, including nested data (JSON-like)
flights %>% group_by(year, month,day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) Can be a massive Arrow dataset
18.
flights %>% group_by(year, month,day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime Can be a massive Arrow dataset
19.
flights %>% group_by(year, month,day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime R expressions can be JIT-compiled with LLVM Can be a massive Arrow dataset
20.
Keep up todate at https://arrow.apache.org https://ursalabs.org https://wesmckinney.com Thanks