Online statistical analysis using transducers and sketch algorithms

Online statistical analysis using transducers and sketch algorithms simon@metabase.com @sbelak

Metabase ❤  github.com/metabase/metabase • Open source analytics tool • Building a “data scientist in a box” • Hundreds to billions of rows • Some DBs optimised for analytics, some not

Transducers at a glance • Transducers decomplect recursion mechanism, transformation, building the output, and access mechanism            • 3 user-facing “protocols”: xf, transdcucer, and CollReduce

Composing transducers 1. comp xfs    2. xf and transducer 3. github.com/henrygarner/redux  post-complete fuse

Many batch algorithms can be turned into online ones Parallelize independent computations Find a recursive relation

github.com/MastodonC/kixi.stats • Count • (Arithmetic) mean • Geometric mean • Harmonic mean • Median • Variance • Interquartile range • Standard deviation • Standard error • Skewness • Kurtosis • Covariance • Covariance matrix • Correlation • Correlation matrix • Simple linear regression • Standard error of the mean • Standard error of the estimate • Standard error of the prediction • …

Using transducers is worth it for the composition alone

Annoyances • Can only transduce one coll at a time • Always have to pass in an xf • Having functions that return a transducer or not is error prone

Idea: summarise your data with some data structure and query that

Histogram construction 1. Pick a number of buckets K 2. For each incoming value: 1. If a bucket for it exists, increment it 2. else, add a new bucket with count = 1 3. If there are > K buckets, ﬁnd the two most adjacent buckets and merge them

Estimating values • Assume the bin mean in also its median • Do weighted interpolations • Often we can be precise up to the two bounding buckets

Nice property II:   decouples data collection from computation

Aside: transducers are a good way to wrap Java/ imperative construction

Having distributions readily available is great

Often approximations are good enough

Takeouts • Transducers are not only performant but also a good modularization protocol • You don’t realise how often you want a distribution until you have it readily available • Often approximations are good enough • You can get surprisingly far on a single machine

Questions simon@metabase.com @sbelak

Online statistical analysis using transducers and sketch algorithms

More Related Content

What's hot

Similar to Online statistical analysis using transducers and sketch algorithms

More from Simon Belak

Recently uploaded

Online statistical analysis using transducers and sketch algorithms