Distributed Machine Learning using Apache Spark from the Browser Devoxx Belgium 2015, Antwerpen
● Distributed computing ● what is Machine Learning? ● Spark for machine learning? ● Spark MLlib by examples ● Spark and other libraries ● Wrap up Outline
Data Fellas Andy Petrella Maths Geospatial Distributed Computing Spark Notebook Trainer Spark/Scala Machine Learning Xavier Tordoir Physics Bioinformatics Distributed Computing Scala (& Perl) trainer Spark Machine Learning
Distributed Computing Why you must care, by Data Fellas Andy Petrella & Xavier Tordoir
Traditionally, tasks are entirely performed on a single computer using three main resources.Uba ga! Computing Processing Power Memory Storage
Computing Oh no! Hence performance is limited in time and space Processing Power Memory Storage TIME SPACE
Distribute computing: [...] A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. [...]. Ref: https://en.wikipedia. org/wiki/Distributed_computing Distributing Interesting
Consequences Oh no! Algorithms have to work on DATA Partitions and with partial results The entire dataset cannot be accessed at once
New resource! Damned Processing Power Memory StorageSPACE Network TIME Network Will impact performances...
Oops did it again Distributing Storage Processing Memory Processing Memory Processing Memory Processing Memory Storage Storage Storage network
Drawback Partition Huh? Storage Processing Memory Processing Memory Processing Memory Processing Memory Storage Storage Storage network
Drawback Partition Hey, you sank my node! Storage Processing Memory Processing Memory Processing Memory Storage Storage network Processing Memory Storage BOOM
Ouch, my rack Advantage Elastic scaling Stora ge Processi ng Memory Processi ng Memory Processi ng Memory Processi ng Memory Stora ge Stora ge Stora ge network What if this cluster happens to not be big enough?
That’s more reasonable Advantage Elastic scaling Stora ge Processi ng Memory Processi ng Memory Processi ng Memory Processi ng Memory Stora ge Stora ge Stora ge network Stora ge Processi ng Memory Processi ng Memory Processi ng Memory Processi ng Memory Stora ge Stora ge Stora ge network network
HPC: computationally intensive applications Model: specialized hardware (CPU/GPU) and network They are orchestrated by a scheduler that gather their computing power and memory. Yeah! what about? What about HPC?
Drawbacks: ● Costs and upgrades by large blocks ● Decoupled storage storage latency = no streaming / no Iteration Got No Money and NO time What about HPC?
Why processing data if not to model? Machine learning: iterative (streaming & batch) Data is aggregated in the form of a model (parameters) Data change little, model is small Do that baby! Iterate
Iterate you gotta be kidding Storage Processing Memory Processing Memory Processing Memory Processing Memory Storage Storage Storage Storage Moving lots of data again and again...
Distributed computing allow cost effective parallelism Efficiency requires distributed storage Colocated with the processing units What about programming models? Summary Interesting
Distributed storage Partitions! HDFS: Apache implementation of Google FS ● Natural fit for distributed storage ● Works as a service Other chunked sources... ● Apache Cassandra, S3, Tachyon,...
Distributed storage Split da Name Node 256M b put /data/f256.txt replication factor 2 Data Node 1 Data Node 2 Data Node 4 Data Node 3
Distributed storage Split da Data Node 1 Data Node 2 Data Node 4 Data Node 3 Name Node 256M b put /data/f256.txt replication factor 2 64 Mb 64 Mb 64 Mb 64 Mb
Distributed storage Everywhere Data Node 1 Data Node 2 Data Node 4 Data Node 3 Name Node 256M b 64 Mb 64 Mb 64 Mb 64 Mb put /data/f256.txt replication factor 2 put /data/f256.txt/part-r-00000 64 Mb
Distributed storage everywhere 256M b put /data/f256.txt replication factor 2 Data Node 1 Data Node 2 Data Node 4 Data Node 3 Name Node put /data/f256.txt/part-r-00000 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb
Distributed storage Replicate Data Node 1 Data Node 2 Data Node 4 Data Node 3 Name Node 256M b put /data/f256.txt replication factor 2 put /data/f256.txt/part-r-00000 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb
Map Reduce High Level Execution The rocket’s base data part data part data part data part Load the data
Map Reduce High Level Execution The rocket’s engines data part mapper data part data part data part mapper mapper mapper Map and Pair
Map Reduce High Level Execution The rocket’s trunk GroupByKey data part mapper data part data part data part mapper mapper mapper Shuffle Pairs using Keys
Map Reduce High Level Execution The rocket’s cockpit data part mapper GroupByKey Reducer data part data part data part mapper mapper mapper Reducer Reducer Values per key are Reduced
Map Reduce High Level Execution The rocket’s tip data part mapper GroupByKey Reducer data part data part data part mapper mapper mapper Reducer Reducer Results We collect the results
Map Reduce High Level Execution To the infinite and beyond! data part mapper GroupByKey Reducer data part data part data part mapper mapper mapper Reducer Reducer Results The whole #!
Map Reduce Matrix-Vector Product How about word count? =
Map Reduce Matrix-Vector Product Back to school... =
Map Reduce Matrix-Vector Product Wait, that’s maths =
Map Reduce Matrix-Vector Product Where is the RAT? Store Matrix as ordered Vector V loaded in memory as ordered Map function: Each matrix element mapped on a producT
Map Reduce Matrix-Vector Product OK … I TAKE OVER MAP
Map Reduce Matrix-Vector Product just a sum … REDUCE
Map Reduce Summary Summary == Reduce? Simple Abstraction of computations, Map and Reduce Using simple abstraction of data, key value pairs
Map Reduce Summary So what? Brings transparent: ● parallelization ● distribution ● fault tolerance
Why Apache Spark MapReduce on steroids Man… Finally! Uses ● Functional paradigm ● Lazy computations Creates dependencies between tasks definitions and optimizes execution
Why Apache Spark MapReduce on steroids Almost forgot that one Can cache data in memory or local file system. Far less IO or network.
What is Machine learning? Why you must care, by Data Fellas Andy Petrella & Xavier Tordoir
you cannot prove a vague theory is wrong […] Also, if the process of computing the consequences is indefinite, then with a little skill any experimental result can be made to look like the expected consequences. —Richard Feynman [1964] What is Machine Learning? Science with data Surely You’re Joking Mr…
● Modelling without first principle… What is Machine Learning? Overview 2nd law neither...
● Modelling without first principle… What is Machine Learning? Overview Machine learning you do with a Learning Machine Take that Newton...
● Modelling without first principle… ● Modelling dependencies from the data What is Machine Learning? Overview With some “a priori” knowledge
● What is the problem? ● Hypothesis? ● Data Generation Process? ● Collection and Preprocessing ● Interpretation What is Machine Learning? Learning Machine… You still need a domain expert… Like me! Learning Machine
● Estimate dependencies from data What is Machine Learning? Overview Machine learning you do with a Learning Machine Samples Generator System x y ỹ z ? Learning Machine
● Estimate dependencies from data ● Minimize a risk functional over the set given the data What is Machine Learning? Overview I like them so much in LaTeX2e Samples Generator System x y ỹ z ? Learning Machine
● Regression: continuous output ○ Risk = Prediction error ● Classification: categorical output ○ Risk = Probability of misclassification What is Machine Learning? Supervised learning Lyfxw y-fxw2… WTF?
What is Machine Learning? Unsupervised learning: no output I like clusters, specially with roasted nuts ● Clustering ○ Risk = Error Distortion (distances to center) ● Density estimation (probability densities)
What is Machine Learning? Bias - Variance, Regression illustration Playtime! Notebook!
What is Machine Learning? Inductive principle In principle, it should work. An inductive principle tells what to do Finite Data Inductive principle Model
What is Machine Learning? Inductive principle In principle, it should work. Empirical risk minimization Finite Data Model • Functions class not defined • Loss not defined • Optimization procedure not defined
What is Machine Learning? Inductive principle In principle, it should work. Regularization Finite Data Model • control on penalty strength • Penalize complexity/a priori knowledge
What is Machine Learning? Inductive principle In principle, it should work. Early stopping rules Finite Data Model • Iterative optimization • Depends on initial params and algorithm • used for neural networks • Penalize along a path
What is Machine Learning? Inductive principle In principle, it should work. Structural Risk Finite Data Model • Analytic estimates of empirical risk
What is Machine Learning? Inductive principle In principle, it should work. Bayesian inference Finite Data Model • Explicit a priori probabilities • Learn mixtures • Hard multidimensional integrations…
What is Machine Learning? Curse of dimentionality In principle, it should work. We want to control complexity Finite Data Model • smoothness constraint in a neighborhood
What is Machine Learning? Curse of dimensionality In principle, it should work. Data density is key… Finite Data In a Space Model Complexity Inductive principle
What is Machine Learning? Curse of dimensionality In principle, it should work. Data density is key… e.g. ● 1-D 0.1m interval => 10 points/m ● 2-D 0.1M interval => 100 points/M^2 ● d-d 0.1 m interval => 10^d points/m^d Same smoothness requires lots of data in high dimensional spaces
What is Machine Learning? Curse of dimensionality In principle, it should work. Sampling is hard… e.g. ● 1-D 10% sample => 0.1 x size ● 2-D 10% sample => 0.31 x size ● 10-d 10% sample => 0.79 x size => local estimates from samples are difficult
What is Machine Learning? Curse of dimensionality In principle, it should work. Data points are closer to edges… One Data points “sees” himself as an outlier => Predictions require lots of extrapolation
What is Machine Learning? Curse of dimensionality In principle, it should work. Samples must increase exponentially … or model complexity must be controlled
What is Machine Learning? Regularization in more details In principle, it should work. Data driven penalized risk minimization
What is Machine Learning? Regularization in more details In principle, it should work. Loss functions
What is Machine Learning? Regularization in more details In principle, it should work. Regularizers L2 (ridge) L1(lasso) Elastic net
What is Machine Learning? Regularization in more details In principle, it should work. Optimization (there comes the fun… ) Which algorithm to find a minimum in a distributed fashion? Convex optimization methods (linear methods) ● Gradient descent ● Stochastic gradient descent ● Limited-memory BFGS
What is Machine Learning? Regularization in more details In principle, it should work. Optimization (there comes the fun… ) Gradient descent ● Efficient steps but needs to read through the whole data
What is Machine Learning? Regularization in more details In principle, it should work. Optimization (there comes the fun… ) Stochastic Gradient descent ● Samples data for each step but converges very slowly
What is Machine Learning? Regularization in more details In principle, it should work. Optimization (there comes the fun… ) L-BFGS ● quadratic derivative estimates by keeping several previous gradient in memory ● Fast convergence
What is Machine Learning? Model selection all work and no play makes Jack a dull boy Model Complexity control: Resampling Selecting the right lambda… … to minimize prediction risk
What is Machine Learning? Model selection Enough theory boy! The universe
What is Machine Learning? Model selection Enough theory boy! Our data
What is Machine Learning? Model selection Enough theory boy! Our data Learning Set (70%) validation set (30%)
What is Machine Learning? Model selection Enough theory boy! Our data Learning Set (70%) validation set (30%)
What is Machine Learning? Model selection Nice flag K-Fold K = 4
MLLib A library to learn them all...
Distributed computing framework Large Scale Data Processing engine What is Apache Spark? I play BIG!
Distributed computing framework Large Scale Data Processing engine ● SQL & Dataframes ● Streaming ● Graph Processing ● Machine Learning With all colors! What is Apache Spark?
Distributed computing framework Large Scale Data Processing engine ● Optimize memory usage (FAST) ● Optimize computation execution (Complex tasks) ● Easy programming model Let the brain do the work... What is Apache Spark?
Distributed computing framework Large Scale Data Processing engine ● Interactive ● @ any scale Breed mixin’ What is Apache Spark?
MLLib Spark In principle, it should work. Intro to Spark… notebook
MLLib Spark In principle, it should work. Intro to Spark… notebook So we’we seen… ● Basics of Spark data manipulation ● MLLib data representation ● Linear regression ● Regularization and k-fold cross validation What else is there?
MLLib Spark In principle, it should work. Basic statistics Classification and regression Collaborative filtering Clustering Dimensionality reduction Feature extraction and transformation Frequent pattern mining Evaluation metrics … http://spark.apache.org/docs/latest/mllib-guide.html
MLlib for Genomics? ADAM + MLlib (mixture K-Means+RF) Playtime! Some more examples
Genomics The data So… that’s what separates us huh?
1000 genomes: http://www.1000genomes.org/ ~1000 samples ~30M Genotypes per sample (features) Genomics The data Please, don’t mind the colors...
1000 genomes: http://www.1000genomes.org/ ~1000 samples Few samples => Machine Learning Genomics The data Woooow, really, you must be kidding me… ahahahahah
1000 genomes: http://www.1000genomes.org/ ~1000 samples ~30M Genotypes per sample (features) Few samples => Machine Learning Lots of Data => Distributed computing Genomics The data Oh… damned… hum huh
MLlib for Genomics? ADAM + MLlib (mixture K-Means+RF) Playtime! Notebook!
What else? Old and new players are now integrating with Spark (and Scala)
Integrated with Data Frame Offer API to create shareable/reusable Pipeline constructions (PCA, …) Spark ML Pipeline Higher API
Like Pipeline but Type Safe Chainable API (andThen-friendly) Spark ML Keystone Higher API
Memory implementation of “Map-Reduce” Highly optimised structures for the JVM blazing fast convergent models H2O Higher API
DL4J Spark ML Higher API
Intel Data Analytics Acceleration Library DAAL (Intel) Higher API
Declarative large-scale machine learning optimization based on data and cluster characteristics System ML (IBM) Higher API
Nitro's Extremely Exciting Deep Learning Engine MLP, RBM, LSTM and more to come Needle Higher API
H2O Sparkling & Deep Learning on genomics water in fire Learning structures using H2O Deep Learning Algorithm integrated in SparK in a Notebook on an Ec2 Cluster http://h2o.ai/product/sparkling-water/
H2O Sparkling: in-memory data exchange I remember things better when I remember then twice.
Wrap up what we hope you have learned
Distributed computing For machine learning I am ready. Data is exploding Distributed Technologies are maturing Scale up and down, interactivity
Distributed ML on Spark What is available What are my options by the way? Spark MLLib H2O DL4J Needle EC2 GCE URIKA-XA cloudera Mapr Hortonworks HDFS C* kafka
“Create” Cluster Find sources (context, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access Shar3 (Data Fellas) ops data ops data sci sci ops sci ops data web ops data web ops data sci
Shar3 (Data Fellas) Analysis Production DistributionRendering Discovery Catalog Project Generator Micro Service / Binary format Schema for output Metadata
That’s all folks Thanks for listening/staying Poke us on Twitter or via http://data-fellas.guru @DataFellas @Shar3_Fellas @SparkNotebook @Xtordoir & @Noootsab Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra (form → @DataFellas) Check also @TypeSafe: http://t.co/o1Bt6dQtgH

Distributed machine learning 101 using apache spark from a browser devoxx.be2015