Online statistical analysis using transducers and sketch algorithms simon@metabase.com @sbelak
Metabase ❤
 github.com/metabase/metabase • Open source analytics tool • Building a “data scientist in a box” • Hundreds to billions of rows • Some DBs optimised for analytics, some not
Transducers at a glance • Transducers decomplect recursion mechanism, transformation, building the output, and access mechanism
 
 
 
 
 
 • 3 user-facing “protocols”: xf, transdcucer, and CollReduce
xf and transducer
Composing transducers 1. comp xfs
 
 2. xf and transducer 3. github.com/henrygarner/redux
 post-complete fuse
 
 
 
 
 

On-line/streaming analysis
Many batch algorithms can be turned into online ones Parallelize independent computations Find a recursive relation
github.com/MastodonC/kixi.stats • Count • (Arithmetic) mean • Geometric mean • Harmonic mean • Median • Variance • Interquartile range • Standard deviation • Standard error • Skewness • Kurtosis • Covariance • Covariance matrix • Correlation • Correlation matrix • Simple linear regression • Standard error of the mean • Standard error of the estimate • Standard error of the prediction • …
Single-pass analysis
Using transducers is worth it for the composition alone
Annoyances • Can only transduce one coll at a time • Always have to pass in an xf • Having functions that return a transducer or not is error prone
Sketch algorithms
Idea: summarise your data with some data structure and query that
Histograms
Histogram construction 1. Pick a number of buckets K 2. For each incoming value: 1. If a bucket for it exists, increment it 2. else, add a new bucket with count = 1 3. If there are > K buckets, find the two most adjacent buckets and merge them
Nice property: merge
Estimating values • Assume the bin mean in also its median • Do weighted interpolations • Often we can be precise up to the two bounding buckets
Nice property II: 
 decouples data collection from computation
github.com/bigmlcom/histogram
Aside: transducers are a good way to wrap Java/ imperative construction
Having distributions readily available is great
Sampling trick
Time slices
What about categorical data?
Count–min sketch
Often approximations are good enough
github.com/addthis/stream-lib
Takeouts • Transducers are not only performant but also a good modularization protocol • You don’t realise how often you want a distribution until you have it readily available • Often approximations are good enough • You can get surprisingly far on a single machine
Questions simon@metabase.com @sbelak

Online statistical analysis using transducers and sketch algorithms