Distributed machine learning 101 using apache spark from a browser devoxx.be2015

Distributed Machine Learning using Apache Spark from the Browser Devoxx Belgium 2015, Antwerpen

● Distributed computing ● what is Machine Learning? ● Spark for machine learning? ● Spark MLlib by examples ● Spark and other libraries ● Wrap up Outline

Data Fellas Andy Petrella Maths Geospatial Distributed Computing Spark Notebook Trainer Spark/Scala Machine Learning Xavier Tordoir Physics Bioinformatics Distributed Computing Scala (& Perl) trainer Spark Machine Learning

Distributed Computing Why you must care, by Data Fellas Andy Petrella & Xavier Tordoir

Traditionally, tasks are entirely performed on a single computer using three main resources.Uba ga! Computing Processing Power Memory Storage

Computing Oh no! Hence performance is limited in time and space Processing Power Memory Storage TIME SPACE

Distribute computing: [...] A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. [...]. Ref: https://en.wikipedia. org/wiki/Distributed_computing Distributing Interesting

Consequences Oh no! Algorithms have to work on DATA Partitions and with partial results The entire dataset cannot be accessed at once

New resource! Damned Processing Power Memory StorageSPACE Network TIME Network Will impact performances...

Oops did it again Distributing Storage Processing Memory Processing Memory Processing Memory Processing Memory Storage Storage Storage network

Drawback Partition Huh? Storage Processing Memory Processing Memory Processing Memory Processing Memory Storage Storage Storage network

Drawback Partition Hey, you sank my node! Storage Processing Memory Processing Memory Processing Memory Storage Storage network Processing Memory Storage BOOM

Ouch, my rack Advantage Elastic scaling Stora ge Processi ng Memory Processi ng Memory Processi ng Memory Processi ng Memory Stora ge Stora ge Stora ge network What if this cluster happens to not be big enough?

That’s more reasonable Advantage Elastic scaling Stora ge Processi ng Memory Processi ng Memory Processi ng Memory Processi ng Memory Stora ge Stora ge Stora ge network Stora ge Processi ng Memory Processi ng Memory Processi ng Memory Processi ng Memory Stora ge Stora ge Stora ge network network

HPC: computationally intensive applications Model: specialized hardware (CPU/GPU) and network They are orchestrated by a scheduler that gather their computing power and memory. Yeah! what about? What about HPC?

Drawbacks: ● Costs and upgrades by large blocks ● Decoupled storage storage latency = no streaming / no Iteration Got No Money and NO time What about HPC?

Why processing data if not to model? Machine learning: iterative (streaming & batch) Data is aggregated in the form of a model (parameters) Data change little, model is small Do that baby! Iterate

Iterate you gotta be kidding Storage Processing Memory Processing Memory Processing Memory Processing Memory Storage Storage Storage Storage Moving lots of data again and again...

Distributed computing allow cost effective parallelism Efficiency requires distributed storage Colocated with the processing units What about programming models? Summary Interesting

Distributed storage Partitions! HDFS: Apache implementation of Google FS ● Natural fit for distributed storage ● Works as a service Other chunked sources... ● Apache Cassandra, S3, Tachyon,...

Distributed storage Split da Name Node 256M b put /data/f256.txt replication factor 2 Data Node 1 Data Node 2 Data Node 4 Data Node 3

Distributed storage Split da Data Node 1 Data Node 2 Data Node 4 Data Node 3 Name Node 256M b put /data/f256.txt replication factor 2 64 Mb 64 Mb 64 Mb 64 Mb

Distributed storage Everywhere Data Node 1 Data Node 2 Data Node 4 Data Node 3 Name Node 256M b 64 Mb 64 Mb 64 Mb 64 Mb put /data/f256.txt replication factor 2 put /data/f256.txt/part-r-00000 64 Mb

Distributed storage everywhere 256M b put /data/f256.txt replication factor 2 Data Node 1 Data Node 2 Data Node 4 Data Node 3 Name Node put /data/f256.txt/part-r-00000 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb

Distributed storage Replicate Data Node 1 Data Node 2 Data Node 4 Data Node 3 Name Node 256M b put /data/f256.txt replication factor 2 put /data/f256.txt/part-r-00000 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb 64 Mb

Map Reduce High Level Execution The rocket’s base data part data part data part data part Load the data

Map Reduce High Level Execution The rocket’s engines data part mapper data part data part data part mapper mapper mapper Map and Pair

Map Reduce High Level Execution The rocket’s trunk GroupByKey data part mapper data part data part data part mapper mapper mapper Shuffle Pairs using Keys

Map Reduce High Level Execution The rocket’s cockpit data part mapper GroupByKey Reducer data part data part data part mapper mapper mapper Reducer Reducer Values per key are Reduced

Map Reduce High Level Execution The rocket’s tip data part mapper GroupByKey Reducer data part data part data part mapper mapper mapper Reducer Reducer Results We collect the results

Map Reduce High Level Execution To the infinite and beyond! data part mapper GroupByKey Reducer data part data part data part mapper mapper mapper Reducer Reducer Results The whole #!

Map Reduce Matrix-Vector Product How about word count? =

Map Reduce Matrix-Vector Product Back to school... =

Map Reduce Matrix-Vector Product Wait, that’s maths =

Map Reduce Matrix-Vector Product Where is the RAT? Store Matrix as ordered Vector V loaded in memory as ordered Map function: Each matrix element mapped on a producT

Map Reduce Matrix-Vector Product OK … I TAKE OVER MAP

Map Reduce Matrix-Vector Product just a sum … REDUCE

Map Reduce Summary Summary == Reduce? Simple Abstraction of computations, Map and Reduce Using simple abstraction of data, key value pairs

Map Reduce Summary So what? Brings transparent: ● parallelization ● distribution ● fault tolerance

Why Apache Spark MapReduce on steroids Man… Finally! Uses ● Functional paradigm ● Lazy computations Creates dependencies between tasks definitions and optimizes execution

Why Apache Spark MapReduce on steroids Almost forgot that one Can cache data in memory or local file system. Far less IO or network.

What is Machine learning? Why you must care, by Data Fellas Andy Petrella & Xavier Tordoir

you cannot prove a vague theory is wrong […] Also, if the process of computing the consequences is indefinite, then with a little skill any experimental result can be made to look like the expected consequences. —Richard Feynman [1964] What is Machine Learning? Science with data Surely You’re Joking Mr…

● Modelling without first principle… What is Machine Learning? Overview 2nd law neither...

● Modelling without first principle… What is Machine Learning? Overview Machine learning you do with a Learning Machine Take that Newton...

● Modelling without first principle… ● Modelling dependencies from the data What is Machine Learning? Overview With some “a priori” knowledge

● What is the problem? ● Hypothesis? ● Data Generation Process? ● Collection and Preprocessing ● Interpretation What is Machine Learning? Learning Machine… You still need a domain expert… Like me! Learning Machine

● Estimate dependencies from data What is Machine Learning? Overview Machine learning you do with a Learning Machine Samples Generator System x y ỹ z ? Learning Machine

● Estimate dependencies from data ● Minimize a risk functional over the set given the data What is Machine Learning? Overview I like them so much in LaTeX2e Samples Generator System x y ỹ z ? Learning Machine

● Regression: continuous output ○ Risk = Prediction error ● Classification: categorical output ○ Risk = Probability of misclassification What is Machine Learning? Supervised learning Lyfxw y-fxw2… WTF?

What is Machine Learning? Unsupervised learning: no output I like clusters, specially with roasted nuts ● Clustering ○ Risk = Error Distortion (distances to center) ● Density estimation (probability densities)

What is Machine Learning? Bias - Variance, Regression illustration Playtime! Notebook!

What is Machine Learning? Inductive principle In principle, it should work. An inductive principle tells what to do Finite Data Inductive principle Model

What is Machine Learning? Inductive principle In principle, it should work. Empirical risk minimization Finite Data Model • Functions class not defined • Loss not defined • Optimization procedure not defined

What is Machine Learning? Inductive principle In principle, it should work. Regularization Finite Data Model • control on penalty strength • Penalize complexity/a priori knowledge

What is Machine Learning? Inductive principle In principle, it should work. Early stopping rules Finite Data Model • Iterative optimization • Depends on initial params and algorithm • used for neural networks • Penalize along a path

What is Machine Learning? Inductive principle In principle, it should work. Structural Risk Finite Data Model • Analytic estimates of empirical risk

What is Machine Learning? Inductive principle In principle, it should work. Bayesian inference Finite Data Model • Explicit a priori probabilities • Learn mixtures • Hard multidimensional integrations…

What is Machine Learning? Curse of dimentionality In principle, it should work. We want to control complexity Finite Data Model • smoothness constraint in a neighborhood

What is Machine Learning? Curse of dimensionality In principle, it should work. Data density is key… Finite Data In a Space Model Complexity Inductive principle

What is Machine Learning? Curse of dimensionality In principle, it should work. Data density is key… e.g. ● 1-D 0.1m interval => 10 points/m ● 2-D 0.1M interval => 100 points/M^2 ● d-d 0.1 m interval => 10^d points/m^d Same smoothness requires lots of data in high dimensional spaces

What is Machine Learning? Curse of dimensionality In principle, it should work. Sampling is hard… e.g. ● 1-D 10% sample => 0.1 x size ● 2-D 10% sample => 0.31 x size ● 10-d 10% sample => 0.79 x size => local estimates from samples are difficult

What is Machine Learning? Curse of dimensionality In principle, it should work. Data points are closer to edges… One Data points “sees” himself as an outlier => Predictions require lots of extrapolation

What is Machine Learning? Curse of dimensionality In principle, it should work. Samples must increase exponentially … or model complexity must be controlled

What is Machine Learning? Regularization in more details In principle, it should work. Data driven penalized risk minimization

What is Machine Learning? Regularization in more details In principle, it should work. Loss functions

What is Machine Learning? Regularization in more details In principle, it should work. Regularizers L2 (ridge) L1(lasso) Elastic net

What is Machine Learning? Regularization in more details In principle, it should work. Optimization (there comes the fun… ) Which algorithm to find a minimum in a distributed fashion? Convex optimization methods (linear methods) ● Gradient descent ● Stochastic gradient descent ● Limited-memory BFGS

What is Machine Learning? Regularization in more details In principle, it should work. Optimization (there comes the fun… ) Gradient descent ● Efficient steps but needs to read through the whole data

What is Machine Learning? Regularization in more details In principle, it should work. Optimization (there comes the fun… ) Stochastic Gradient descent ● Samples data for each step but converges very slowly

What is Machine Learning? Regularization in more details In principle, it should work. Optimization (there comes the fun… ) L-BFGS ● quadratic derivative estimates by keeping several previous gradient in memory ● Fast convergence

What is Machine Learning? Model selection all work and no play makes Jack a dull boy Model Complexity control: Resampling Selecting the right lambda… … to minimize prediction risk

What is Machine Learning? Model selection Enough theory boy! The universe

What is Machine Learning? Model selection Enough theory boy! Our data

What is Machine Learning? Model selection Enough theory boy! Our data Learning Set (70%) validation set (30%)

What is Machine Learning? Model selection Nice flag K-Fold K = 4

MLLib A library to learn them all...

Distributed computing framework Large Scale Data Processing engine What is Apache Spark? I play BIG!

Distributed computing framework Large Scale Data Processing engine ● SQL & Dataframes ● Streaming ● Graph Processing ● Machine Learning With all colors! What is Apache Spark?

Distributed computing framework Large Scale Data Processing engine ● Optimize memory usage (FAST) ● Optimize computation execution (Complex tasks) ● Easy programming model Let the brain do the work... What is Apache Spark?

Distributed computing framework Large Scale Data Processing engine ● Interactive ● @ any scale Breed mixin’ What is Apache Spark?

MLLib Spark In principle, it should work. Intro to Spark… notebook

MLLib Spark In principle, it should work. Intro to Spark… notebook So we’we seen… ● Basics of Spark data manipulation ● MLLib data representation ● Linear regression ● Regularization and k-fold cross validation What else is there?

MLLib Spark In principle, it should work. Basic statistics Classification and regression Collaborative filtering Clustering Dimensionality reduction Feature extraction and transformation Frequent pattern mining Evaluation metrics … http://spark.apache.org/docs/latest/mllib-guide.html

MLlib for Genomics? ADAM + MLlib (mixture K-Means+RF) Playtime! Some more examples

Genomics The data So… that’s what separates us huh?

1000 genomes: http://www.1000genomes.org/ ~1000 samples ~30M Genotypes per sample (features) Genomics The data Please, don’t mind the colors...

1000 genomes: http://www.1000genomes.org/ ~1000 samples Few samples => Machine Learning Genomics The data Woooow, really, you must be kidding me… ahahahahah

1000 genomes: http://www.1000genomes.org/ ~1000 samples ~30M Genotypes per sample (features) Few samples => Machine Learning Lots of Data => Distributed computing Genomics The data Oh… damned… hum huh

MLlib for Genomics? ADAM + MLlib (mixture K-Means+RF) Playtime! Notebook!

What else? Old and new players are now integrating with Spark (and Scala)

Integrated with Data Frame Offer API to create shareable/reusable Pipeline constructions (PCA, …) Spark ML Pipeline Higher API

Like Pipeline but Type Safe Chainable API (andThen-friendly) Spark ML Keystone Higher API

Memory implementation of “Map-Reduce” Highly optimised structures for the JVM blazing fast convergent models H2O Higher API

Intel Data Analytics Acceleration Library DAAL (Intel) Higher API

Declarative large-scale machine learning optimization based on data and cluster characteristics System ML (IBM) Higher API

Nitro's Extremely Exciting Deep Learning Engine MLP, RBM, LSTM and more to come Needle Higher API

H2O Sparkling & Deep Learning on genomics water in fire Learning structures using H2O Deep Learning Algorithm integrated in SparK in a Notebook on an Ec2 Cluster http://h2o.ai/product/sparkling-water/

H2O Sparkling: in-memory data exchange I remember things better when I remember then twice.

Wrap up what we hope you have learned

Distributed computing For machine learning I am ready. Data is exploding Distributed Technologies are maturing Scale up and down, interactivity

Distributed ML on Spark What is available What are my options by the way? Spark MLLib H2O DL4J Needle EC2 GCE URIKA-XA cloudera Mapr Hortonworks HDFS C* kafka

“Create” Cluster Find sources (context, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access Shar3 (Data Fellas) ops data ops data sci sci ops sci ops data web ops data web ops data sci

Shar3 (Data Fellas) Analysis Production DistributionRendering Discovery Catalog Project Generator Micro Service / Binary format Schema for output Metadata

That’s all folks Thanks for listening/staying Poke us on Twitter or via http://data-fellas.guru @DataFellas @Shar3_Fellas @SparkNotebook @Xtordoir & @Noootsab Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra (form → @DataFellas) Check also @TypeSafe: http://t.co/o1Bt6dQtgH

Distributed machine learning 101 using apache spark from a browser devoxx.be2015

More Related Content

What's hot

Viewers also liked

Similar to Distributed machine learning 101 using apache spark from a browser devoxx.be2015

More from Andy Petrella

Recently uploaded

Distributed machine learning 101 using apache spark from a browser devoxx.be2015