Foundations for Scaling ML in Apache Spark Joseph K. Bradley August 14, 2016 ® ™
Who am I? Apache Spark committer & PMC member Software Engineer @ Databricks Machine Learning Department @ Carnegie Mellon 2
•  General engine for big data computing •  Fast •  Easy to use •  APIs in Python, Scala, Java & R 3 Apache Spark Spark	SQL Streaming MLlib GraphX Largest cluster: 8000 Nodes (Tencent) Open source •  Apache Software Foundation •  1000+ contributors •  200+ companies & universities
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO Source: Slide 5 of Spark Community Update
MLlib: Spark’s ML library 5 0 500 1000 v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0 commits/release Learning tasks Classification Regression Recommendation Clustering Frequent itemsets Data utilities Featurization Statistics Linear algebra Workflow utilities Model import/export Pipelines DataFrames Cross validation Goals Scale-out ML Standard library Extensible API
MLlib: original design RDDs Challenges for scalability 6
Resilient Distributed Datasets (RDDs) 7 Map Reduce master
Resilient Distributed Datasets (RDDs) 8
Resilient Distributed Datasets (RDDs) 9 Resiliency •  Lineage •  Caching & checkpointing
ML on RDDs: the good Flexible: GLMs, trees, matrix factorization, etc. Scalable: E.g., Alternating Least Squares on Spotify data •  50+ million users x 30+ million songs •  50 billion ratings Cost ~ $10 •  32 r3.8xlarge nodes (spot instances) •  For rank 10 with 10 iterations, ~1 hour running time. 10
ML on RDDs: the challenges Partitioning •  Data partitioning impacts performance. •  E.g., for Alternating Least Squares Lineage •  Iterative algorithms à long RDD lineage •  Solvable via careful caching and checkpointing JVM •  Garbage collection (GC) •  Boxed types 11
MLlib: current status DataFrame & Dataset integration Pipelines API 12
Spark DataFrames & Datasets 13 dept age name Bio 48 H	Smith CS 34 A	Turing Bio 43 B	Jones Chem 61 M	Kennedy Data grouped into named columns DSL for common tasks •  Project, filter, aggregate, join, … •  100+ functions available •  User-Defined Functions (UDFs) data.groupBy(“dept”).avg(“age”) Datasets: Strongly typed DataFrames
DataFrame optimizations Catalyst query optimizer Project Tungsten • Memory management • Code generation 14 Predicate pushdown Join selection … Off-heap Avoid JVM GC Compressed format Combine operations into single, efficient code blocks
ML Pipelines •  DataFrames: unified ML dataset API •  Flexible types •  Add & remove columns during Pipeline execution 15
Load data Feature extracIon Original dataset 16 PredicIve model EvaluaIon Text Label I bought the game... 4 Do NOT bother try... 1 this shirt is aweso... 5 never got it. Seller... 1 I ordered this to... 3
Extract features Feature extracIon Original dataset 17 PredicIve model EvaluaIon Text Label Words Features I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
Fit a model Feature extracIon Original dataset 18 PredicIve model EvaluaIon Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
Evaluate Feature extracIon Original dataset 19 PredicIve model EvaluaIon Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
ML Pipelines DataFrames: unified ML dataset API •  Flexible types •  Add & remove columns during Pipeline execution •  Materialize columns lazily •  Inspect intermediate results 20
Under the hood: optimizations Current use of DataFrames •  API •  Transformations & predictions 21 Feature transformation & model prediction are phrased as User- Defined Functions (UDFs) à Catalyst query optimizer à Tungsten memory management + code generation
MLlib: future scaling DataFrames for training Potential benefits •  Spilling to disk •  Catalyst •  Tungsten Challenges remaining 22
Implementing ML on DataFrames 23 Map Reduce master
Scalability DataFrames automatically spill to disk à Classic pain point of RDDs 24 java.lang.OutOfMemoryError Goal: Smoothly scale, without custom per-algorithm optimizations
Catalyst in ML Key idea: automatic query (ML algorithm) optimization •  DataFrame operations are lazy. •  Express entire algorithm as DataFrame operations. •  Let Catalyst reorganize the algorithm, data, etc. à Fewer manual optimizations 25
Tungsten in ML Tungsten: off-heap memory management •  Avoids JVM GC •  Uses efficient storage formats •  Code generation 26 Issue in ML: object creation during each iteration Issue in ML: Array[(Int,Double,Double)] Issue in ML: Volcano iterator model in MR/RDDs
Prototyping ML on DataFrames Currently: •  Belief propagation •  Connected components Current challenges: •  DataFrame query plans do not have iteration as a top-level concept •  ML/Graph-specific optimizations for Catalyst query planner Eventual goal: Port all ML algorithms to run on top of DataFrames à speed & scalability 27
To summarize... MLlib on RDDs •  Required custom optimizations MLlib with a DataFrame-based API •  Friendly API •  Improvements for prediction MLlib on DataFrames •  Potential for even greater scaling for training •  Simpler for non-experts to write new algorithms 28
Get started Get involved •  JIRA http://issues.apache.org •  mailing lists http://spark.apache.org •  Github http://github.com/apache/spark •  Spark Packages http://spark-packages.org Learn more •  New in Apache Spark 2.0 http://databricks.com/blog/2016/06/01 •  MOOCs on EdX http://databricks.com/spark/training 29 Try out Apache Spark 2.0 in Databricks Community Edition http://databricks.com/ce Many thanks to the community for contributions & support!
Databricks Founded by the creators of Apache Spark Offers hosted service •  Spark on EC2 •  Notebooks •  Visualizations •  Cluster management •  Scheduled jobs 30 We’re hiring!
Thank you! Twitter: @jkbatcmu

Foundations for Scaling ML in Apache Spark

  • 1.
    Foundations for ScalingML in Apache Spark Joseph K. Bradley August 14, 2016 ® ™
  • 2.
    Who am I? ApacheSpark committer & PMC member Software Engineer @ Databricks Machine Learning Department @ Carnegie Mellon 2
  • 3.
    •  General enginefor big data computing •  Fast •  Easy to use •  APIs in Python, Scala, Java & R 3 Apache Spark Spark SQL Streaming MLlib GraphX Largest cluster: 8000 Nodes (Tencent) Open source •  Apache Software Foundation •  1000+ contributors •  200+ companies & universities
  • 4.
    NOTABLE USERS THATPRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO Source: Slide 5 of Spark Community Update
  • 5.
    MLlib: Spark’s MLlibrary 5 0 500 1000 v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0 commits/release Learning tasks Classification Regression Recommendation Clustering Frequent itemsets Data utilities Featurization Statistics Linear algebra Workflow utilities Model import/export Pipelines DataFrames Cross validation Goals Scale-out ML Standard library Extensible API
  • 6.
  • 7.
    Resilient Distributed Datasets(RDDs) 7 Map Reduce master
  • 8.
  • 9.
    Resilient Distributed Datasets(RDDs) 9 Resiliency •  Lineage •  Caching & checkpointing
  • 10.
    ML on RDDs:the good Flexible: GLMs, trees, matrix factorization, etc. Scalable: E.g., Alternating Least Squares on Spotify data •  50+ million users x 30+ million songs •  50 billion ratings Cost ~ $10 •  32 r3.8xlarge nodes (spot instances) •  For rank 10 with 10 iterations, ~1 hour running time. 10
  • 11.
    ML on RDDs:the challenges Partitioning •  Data partitioning impacts performance. •  E.g., for Alternating Least Squares Lineage •  Iterative algorithms à long RDD lineage •  Solvable via careful caching and checkpointing JVM •  Garbage collection (GC) •  Boxed types 11
  • 12.
    MLlib: current status DataFrame& Dataset integration Pipelines API 12
  • 13.
    Spark DataFrames &Datasets 13 dept age name Bio 48 H Smith CS 34 A Turing Bio 43 B Jones Chem 61 M Kennedy Data grouped into named columns DSL for common tasks •  Project, filter, aggregate, join, … •  100+ functions available •  User-Defined Functions (UDFs) data.groupBy(“dept”).avg(“age”) Datasets: Strongly typed DataFrames
  • 14.
    DataFrame optimizations Catalyst queryoptimizer Project Tungsten • Memory management • Code generation 14 Predicate pushdown Join selection … Off-heap Avoid JVM GC Compressed format Combine operations into single, efficient code blocks
  • 15.
    ML Pipelines •  DataFrames:unified ML dataset API •  Flexible types •  Add & remove columns during Pipeline execution 15
  • 16.
    Load data Feature extracIon Original dataset 16 PredicIve model EvaluaIon Text Label Ibought the game... 4 Do NOT bother try... 1 this shirt is aweso... 5 never got it. Seller... 1 I ordered this to... 3
  • 17.
    Extract features Feature extracIon Original dataset 17 PredicIve model EvaluaIon Text LabelWords Features I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
  • 18.
    Fit a model Feature extracIon Original dataset 18 PredicIve model EvaluaIon TextLabel Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
  • 19.
    Evaluate Feature extracIon Original dataset 19 PredicIve model EvaluaIon Text Label WordsFeatures Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
  • 20.
    ML Pipelines DataFrames: unifiedML dataset API •  Flexible types •  Add & remove columns during Pipeline execution •  Materialize columns lazily •  Inspect intermediate results 20
  • 21.
    Under the hood:optimizations Current use of DataFrames •  API •  Transformations & predictions 21 Feature transformation & model prediction are phrased as User- Defined Functions (UDFs) à Catalyst query optimizer à Tungsten memory management + code generation
  • 22.
    MLlib: future scaling DataFramesfor training Potential benefits •  Spilling to disk •  Catalyst •  Tungsten Challenges remaining 22
  • 23.
    Implementing ML onDataFrames 23 Map Reduce master
  • 24.
    Scalability DataFrames automatically spillto disk à Classic pain point of RDDs 24 java.lang.OutOfMemoryError Goal: Smoothly scale, without custom per-algorithm optimizations
  • 25.
    Catalyst in ML Keyidea: automatic query (ML algorithm) optimization •  DataFrame operations are lazy. •  Express entire algorithm as DataFrame operations. •  Let Catalyst reorganize the algorithm, data, etc. à Fewer manual optimizations 25
  • 26.
    Tungsten in ML Tungsten:off-heap memory management •  Avoids JVM GC •  Uses efficient storage formats •  Code generation 26 Issue in ML: object creation during each iteration Issue in ML: Array[(Int,Double,Double)] Issue in ML: Volcano iterator model in MR/RDDs
  • 27.
    Prototyping ML onDataFrames Currently: •  Belief propagation •  Connected components Current challenges: •  DataFrame query plans do not have iteration as a top-level concept •  ML/Graph-specific optimizations for Catalyst query planner Eventual goal: Port all ML algorithms to run on top of DataFrames à speed & scalability 27
  • 28.
    To summarize... MLlib onRDDs •  Required custom optimizations MLlib with a DataFrame-based API •  Friendly API •  Improvements for prediction MLlib on DataFrames •  Potential for even greater scaling for training •  Simpler for non-experts to write new algorithms 28
  • 29.
    Get started Get involved • JIRA http://issues.apache.org •  mailing lists http://spark.apache.org •  Github http://github.com/apache/spark •  Spark Packages http://spark-packages.org Learn more •  New in Apache Spark 2.0 http://databricks.com/blog/2016/06/01 •  MOOCs on EdX http://databricks.com/spark/training 29 Try out Apache Spark 2.0 in Databricks Community Edition http://databricks.com/ce Many thanks to the community for contributions & support!
  • 30.
    Databricks Founded by thecreators of Apache Spark Offers hosted service •  Spark on EC2 •  Notebooks •  Visualizations •  Cluster management •  Scheduled jobs 30 We’re hiring!
  • 31.