Combining Machine Learning frameworks with Apache Spark Tim Hunter Hadoop Summit June 2016
About me Apache Spark contributor (since Spark 0.6) Software Engineer @ Databricks Ph.D. in Machine Learning @ UC Berkeley 2
Founded by the team who created Apache Spark Offers a hosted service: - Apache Spark in the cloud - Notebooks - Cluster management - Production environment About Databricks 3
Apache Spark The most active open-source project in big data
Large-scale machine learning on Apache Spark Spark MLlib
MLlib’s Mission MLlib’s mission is to make practical machine learning easy and scalable. • Easy to build machine learning applications • Capable of learning from large-scale datasets • Easy to integrate into existing workflows 6
Algorithm Coverage • Classification • Logistic regression • Naive Bayes • Streaming logistic regression • Linear SVMs • Decision trees • Random forests • Gradient-boosted trees • Multilayer perceptron • Regression • Ordinary least squares • Ridge regression • Lasso • Isotonic regression • Decision trees • Random forests • Gradient-boosted trees • Streaming linear methods • Generalized Linear Models • Frequent itemsets • FP-growth • PrefixSpan 7 Clustering • Gaussian mixture models • K-Means • Streaming K-Means • Latent Dirichlet Allocation • Power Iteration Clustering • Bisecting K-Means Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation • Kolmogorov–Smirnov test • Online hypothesis testing • Survival analysis Linear algebra • Local dense & sparse vectors & matrices • Normal equation for least squares • Distributed matrices • Block-partitioned matrix • Row matrix • Indexed row matrix • Coordinate matrix • Matrix decompositions Recommendation • Alternating Least Squares Feature extraction & selection • Word2Vec • Chi-Squared selection • Hashing term frequency • Inverse document frequency • Normalizer • Standard scaler • Tokenizer • One-Hot Encoder • StringIndexer • VectorIndexer • VectorAssembler • Binarizer • Bucketizer • ElementwiseProduct • PolynomialExpansion • Quantile discretizer • SQL transformer Model import/export Pipelines List based on Spark 2.0
Outline • ML workflows are complex • Distributing single-machine ML frameworks: • Embedding with Spark: • Unified cross-languages ML pipelines with MLlib 8
ML workflows are complex • Specify the pipeline • Re-run on new data • Inspect the results • Tune the parameters • Usually, each step of a pipeline is easier with one framework 9
ML Workflows are Complex 10 Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract featuresExtract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble
Existing tools • Scikit-learn – Excellent documentation – Standard for Python • R – Lots of packages available • Pandas – Very easy to use • A lot of investment in tooling and education – How to integrate big data with these tools? 11
Common misconceptions • Spark is for big data only • Spark can only work with dedicated, distributed libraries 12
Spark as a scheduler • A lot of tasks in ML are ”embarrassingly parallel” • Use Spark for data management and for scheduling 13
One example: learning digits • Learning tasks: given set of images, recognized digits • Standard benchmark dataset in computer vision built by NIST: 14
Training Deep Learning algorithms • Training a neural network is hard: • It is a sequential procedure (present one image after the other to learn from) • It can be sensitive to noise and order of images: robustness analysis is critical • Tuning the training parameters (descent rate, batch sizes, etc.) is very important. Otherwise, learning is too slow or gets stuck in a local minima. A lot of heuristics are used in practice. 15
TensorFlow as a training library • A lot of algorithms have been presented for this task, we will choose TensorFlow, from Google: • Popular choice for neural network training and deep learning • Competitive performance • Easy to experiment with • Python interface makes it easy to integrate with Spark 16
Distributing TensorFlow computations • Even if TF is used as a single-machine library, we get speedups from Spark 17 Distributed Cross Validation ... Best Model Model #1 Training Model #2 Training Model #3 Training
Distributing TensorFlow computations 18 Distributed Cross Validation ... Best Model Model #4 Training Model #6 Training Model #3 Training Model #1 Training Model #5 Training Model #2 Training
Results • Running a 2-layer neural network, and testing for different update rates and different layer sizes 19 0 3000 6000 9000 12000 1 node 2 nodes 13 nodes
Embedding deep learning in Spark • Best known algorithms are essentially sequential during training • Careful selection of training parameters is critical • Spark can help for fast iterations and find a good set of parameters 20
Managing ML workflows with Spark 21
A data scientist’s wish list: • Run original code on a production environment • Use distributed data sources • Use familiar APIs and libraries • Distribute ML workload piece by piece • Only distribute as needed • Easily switch between local & distributed settings 22
Example: sentiment analysis 23 Given a review (text), predict the user’s rating. Data from https://snap.stanford.edu/data/web-Amazon.html
ML Workflow 24 Train model Evaluate Load data Extract features Review: This product doesn't seem to be made to last… Rating: 2 feature_vector: [0.1 -1.3 0.23 … -0.74] rating: 2.0 Regression: (review: String) => Double
Load Data 25 built-in external { JSON } JDBC and more … Data sources for DataFrames LIBSVM Train model Evaluate Load data Extract features
Extract Features words: [this, product, doesn't, seem, to, …] feature_vector: [0.1 -1.3 0.23 … -0.74] Review: This product doesn't seem to be made to last… Rating: 2 Prediction: 3.0 Train model Evaluate Load data Tokenizer Hashed Term Frequ.
Extract Features words: [this, product, doesn't, seem, to, …] feature_vector: [0.1 -1.3 0.23 … -0.74] Review: This product doesn't seem to be made to last… Rating: 2 Prediction: 3.0 Linear regression Evaluate Load data Tokenizer Hashed Term Frequ.
Our ML workflow 28 Cross Validation Model Training Feature Extraction regularization parameter: {0.0, 0.1, ...}
Cross validation 29 Cross Validation ... Best Model Model #1 Training Model #2 Training Feature Extraction Model #3 Training
Example 30
MLlib in production ML Persistence 31
A data scientist’s wish list: • Run original code on a production environment • Use distributed data sources • Use familiar APIs and libraries • Distribute ML workload piece by piece • Only distribute as needed • Easily switch between local & distributed settings 32
DataFrame-based API for MLlib a.k.a. “Pipelines” API, with utilities for constructing ML Pipelines In 2.0, the DataFrame-based API will become the primary API for MLlib. • Voted by community • org.apache.spark.ml, pyspark.ml The RDD-based API will enter maintenance mode. • Still maintained with bug fixes, but no new features •org.apache.spark.mllib, pyspark.mllib 33
Why ML persistence? 34 Data Science Software Engineering Prototype (Python/R) Create model Re-implement model for production (Java) Deploy model
Why ML persistence? 35 Data Science Software Engineering Prototype (Python/R) Create Pipeline • Extract raw features • Transform features • Select key features • Fit multiple models • Combine results to make prediction • Extra implementation work • Different code paths • Synchronization overhead Re-implement Pipeline for production (Java) Deploy Pipeline
With ML persistence... 36 Data Science Software Engineering Prototype (Python/R) Create Pipeline Persist model or Pipeline: model.save(“s3n://...”) Load Pipeline (Scala/Java) Model.load(“s3n://…”) Deploy in production
Model tuning ML persistence status 37 Text preprocessin g Feature generation Generalize d Linear Regressio n Unfitted Fitted Model Pipeline Supported in MLlib’s RDD-based API “recipe” “result”
ML persistence status Near-complete coverage in all Spark language APIs • Scala & Java: complete (29 feature transformers, 21 models) • Python: complete except for 2 algorithms • R: complete for existing APIs Single underlying implementation of models Exchangeable data format • JSON for metadata • Parquet for model data (coefficients, etc.) 38
A data scientist’s wish list: • Run original code on a production environment • Directly apply learned pipelines • Use MLlib as export format • Use distributed data sources • Builtin Spark conversions • Use familiar APIs and libraries • Distribute ML workload piece by piece • Easy to distribute the most common ML tasks 39
What’s next? Prioritized items on the 2.1 roadmap JIRA (SPARK- 15581): • Critical feature completeness for the DataFrame-based API – Multiclass logistic regression – Statistics • Python API parity & R API expansion • Scaling & speed tuning for key algorithms: trees & ensembles GraphFrames • Release for Spark 2.0 • Speed improvements (join elimination, connected components) 40
Get started • Get involved via roadmap JIRA (SPARK- 15581) + mailing lists • Download notebook for this talk http://dbricks.co/1UfvAH9 • ML persistence blog post http://databricks.com/blog/2016/05/31 41 Try out the Apache Spark 2.0 preview release: http://databricks.com/try
Thank you! spark.apache.org spark-packages.org databricks.com

Combining Machine Learning Frameworks with Apache Spark

  • 1.
    Combining Machine Learning frameworkswith Apache Spark Tim Hunter Hadoop Summit June 2016
  • 2.
    About me Apache Sparkcontributor (since Spark 0.6) Software Engineer @ Databricks Ph.D. in Machine Learning @ UC Berkeley 2
  • 3.
    Founded by theteam who created Apache Spark Offers a hosted service: - Apache Spark in the cloud - Notebooks - Cluster management - Production environment About Databricks 3
  • 4.
    Apache Spark The mostactive open-source project in big data
  • 5.
    Large-scale machine learningon Apache Spark Spark MLlib
  • 6.
    MLlib’s Mission MLlib’s missionis to make practical machine learning easy and scalable. • Easy to build machine learning applications • Capable of learning from large-scale datasets • Easy to integrate into existing workflows 6
  • 7.
    Algorithm Coverage • Classification •Logistic regression • Naive Bayes • Streaming logistic regression • Linear SVMs • Decision trees • Random forests • Gradient-boosted trees • Multilayer perceptron • Regression • Ordinary least squares • Ridge regression • Lasso • Isotonic regression • Decision trees • Random forests • Gradient-boosted trees • Streaming linear methods • Generalized Linear Models • Frequent itemsets • FP-growth • PrefixSpan 7 Clustering • Gaussian mixture models • K-Means • Streaming K-Means • Latent Dirichlet Allocation • Power Iteration Clustering • Bisecting K-Means Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation • Kolmogorov–Smirnov test • Online hypothesis testing • Survival analysis Linear algebra • Local dense & sparse vectors & matrices • Normal equation for least squares • Distributed matrices • Block-partitioned matrix • Row matrix • Indexed row matrix • Coordinate matrix • Matrix decompositions Recommendation • Alternating Least Squares Feature extraction & selection • Word2Vec • Chi-Squared selection • Hashing term frequency • Inverse document frequency • Normalizer • Standard scaler • Tokenizer • One-Hot Encoder • StringIndexer • VectorIndexer • VectorAssembler • Binarizer • Bucketizer • ElementwiseProduct • PolynomialExpansion • Quantile discretizer • SQL transformer Model import/export Pipelines List based on Spark 2.0
  • 8.
    Outline • ML workflowsare complex • Distributing single-machine ML frameworks: • Embedding with Spark: • Unified cross-languages ML pipelines with MLlib 8
  • 9.
    ML workflows arecomplex • Specify the pipeline • Re-run on new data • Inspect the results • Tune the parameters • Usually, each step of a pipeline is easier with one framework 9
  • 10.
    ML Workflows areComplex 10 Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract featuresExtract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble
  • 11.
    Existing tools • Scikit-learn –Excellent documentation – Standard for Python • R – Lots of packages available • Pandas – Very easy to use • A lot of investment in tooling and education – How to integrate big data with these tools? 11
  • 12.
    Common misconceptions • Sparkis for big data only • Spark can only work with dedicated, distributed libraries 12
  • 13.
    Spark as ascheduler • A lot of tasks in ML are ”embarrassingly parallel” • Use Spark for data management and for scheduling 13
  • 14.
    One example: learningdigits • Learning tasks: given set of images, recognized digits • Standard benchmark dataset in computer vision built by NIST: 14
  • 15.
    Training Deep Learningalgorithms • Training a neural network is hard: • It is a sequential procedure (present one image after the other to learn from) • It can be sensitive to noise and order of images: robustness analysis is critical • Tuning the training parameters (descent rate, batch sizes, etc.) is very important. Otherwise, learning is too slow or gets stuck in a local minima. A lot of heuristics are used in practice. 15
  • 16.
    TensorFlow as atraining library • A lot of algorithms have been presented for this task, we will choose TensorFlow, from Google: • Popular choice for neural network training and deep learning • Competitive performance • Easy to experiment with • Python interface makes it easy to integrate with Spark 16
  • 17.
    Distributing TensorFlow computations •Even if TF is used as a single-machine library, we get speedups from Spark 17 Distributed Cross Validation ... Best Model Model #1 Training Model #2 Training Model #3 Training
  • 18.
    Distributing TensorFlow computations 18 DistributedCross Validation ... Best Model Model #4 Training Model #6 Training Model #3 Training Model #1 Training Model #5 Training Model #2 Training
  • 19.
    Results • Running a2-layer neural network, and testing for different update rates and different layer sizes 19 0 3000 6000 9000 12000 1 node 2 nodes 13 nodes
  • 20.
    Embedding deep learningin Spark • Best known algorithms are essentially sequential during training • Careful selection of training parameters is critical • Spark can help for fast iterations and find a good set of parameters 20
  • 21.
    Managing ML workflowswith Spark 21
  • 22.
    A data scientist’swish list: • Run original code on a production environment • Use distributed data sources • Use familiar APIs and libraries • Distribute ML workload piece by piece • Only distribute as needed • Easily switch between local & distributed settings 22
  • 23.
    Example: sentiment analysis 23 Givena review (text), predict the user’s rating. Data from https://snap.stanford.edu/data/web-Amazon.html
  • 24.
    ML Workflow 24 Train model Evaluate Loaddata Extract features Review: This product doesn't seem to be made to last… Rating: 2 feature_vector: [0.1 -1.3 0.23 … -0.74] rating: 2.0 Regression: (review: String) => Double
  • 25.
    Load Data 25 built-in external {JSON } JDBC and more … Data sources for DataFrames LIBSVM Train model Evaluate Load data Extract features
  • 26.
    Extract Features words: [this,product, doesn't, seem, to, …] feature_vector: [0.1 -1.3 0.23 … -0.74] Review: This product doesn't seem to be made to last… Rating: 2 Prediction: 3.0 Train model Evaluate Load data Tokenizer Hashed Term Frequ.
  • 27.
    Extract Features words: [this,product, doesn't, seem, to, …] feature_vector: [0.1 -1.3 0.23 … -0.74] Review: This product doesn't seem to be made to last… Rating: 2 Prediction: 3.0 Linear regression Evaluate Load data Tokenizer Hashed Term Frequ.
  • 28.
    Our ML workflow 28 CrossValidation Model Training Feature Extraction regularization parameter: {0.0, 0.1, ...}
  • 29.
    Cross validation 29 Cross Validation ... BestModel Model #1 Training Model #2 Training Feature Extraction Model #3 Training
  • 30.
  • 31.
    MLlib in production MLPersistence 31
  • 32.
    A data scientist’swish list: • Run original code on a production environment • Use distributed data sources • Use familiar APIs and libraries • Distribute ML workload piece by piece • Only distribute as needed • Easily switch between local & distributed settings 32
  • 33.
    DataFrame-based API forMLlib a.k.a. “Pipelines” API, with utilities for constructing ML Pipelines In 2.0, the DataFrame-based API will become the primary API for MLlib. • Voted by community • org.apache.spark.ml, pyspark.ml The RDD-based API will enter maintenance mode. • Still maintained with bug fixes, but no new features •org.apache.spark.mllib, pyspark.mllib 33
  • 34.
    Why ML persistence? 34 DataScience Software Engineering Prototype (Python/R) Create model Re-implement model for production (Java) Deploy model
  • 35.
    Why ML persistence? 35 DataScience Software Engineering Prototype (Python/R) Create Pipeline • Extract raw features • Transform features • Select key features • Fit multiple models • Combine results to make prediction • Extra implementation work • Different code paths • Synchronization overhead Re-implement Pipeline for production (Java) Deploy Pipeline
  • 36.
    With ML persistence... 36 DataScience Software Engineering Prototype (Python/R) Create Pipeline Persist model or Pipeline: model.save(“s3n://...”) Load Pipeline (Scala/Java) Model.load(“s3n://…”) Deploy in production
  • 37.
    Model tuning ML persistencestatus 37 Text preprocessin g Feature generation Generalize d Linear Regressio n Unfitted Fitted Model Pipeline Supported in MLlib’s RDD-based API “recipe” “result”
  • 38.
    ML persistence status Near-completecoverage in all Spark language APIs • Scala & Java: complete (29 feature transformers, 21 models) • Python: complete except for 2 algorithms • R: complete for existing APIs Single underlying implementation of models Exchangeable data format • JSON for metadata • Parquet for model data (coefficients, etc.) 38
  • 39.
    A data scientist’swish list: • Run original code on a production environment • Directly apply learned pipelines • Use MLlib as export format • Use distributed data sources • Builtin Spark conversions • Use familiar APIs and libraries • Distribute ML workload piece by piece • Easy to distribute the most common ML tasks 39
  • 40.
    What’s next? Prioritized itemson the 2.1 roadmap JIRA (SPARK- 15581): • Critical feature completeness for the DataFrame-based API – Multiclass logistic regression – Statistics • Python API parity & R API expansion • Scaling & speed tuning for key algorithms: trees & ensembles GraphFrames • Release for Spark 2.0 • Speed improvements (join elimination, connected components) 40
  • 41.
    Get started • Getinvolved via roadmap JIRA (SPARK- 15581) + mailing lists • Download notebook for this talk http://dbricks.co/1UfvAH9 • ML persistence blog post http://databricks.com/blog/2016/05/31 41 Try out the Apache Spark 2.0 preview release: http://databricks.com/try
  • 42.

Editor's Notes

  • #2 Thanks I would like to discuss the integration between Apache Spark and single ML frameworks and tools, such as …
  • #3 Can I have a show of hands of many people in the audience have used scikit-learn, or R or Pandas?
  • #5 More than 1000 committers in Jan 2016
  • #13 Spark is a very simple tool to accelerate your computations, even if you do not use have big data Spark integrates well with existing libraries - easy to use as a simple scheduler - easy to write small bindings for the critical parts of libraries
  • #22 I am demonstrating from databricks Mounted S3 buckets Parquet datafiles
  • #23 - Be able to extract some smaller amount of data from the production storage system Slowly start distributing the system: piece by piece Easily switch between local and distributed Keep familiar APIs or even the same tools Show how thet distribution happens
  • #24 Reviews scraped f
  • #25 ML algorithms like to have numerical vectors
  • #29 Model training / tuning Regularization: parameter that controls how the linear model does on unseen data There is no single good value for the regularization parameter. One common method to find on is to try out different values. This technique is called CV: you split your training data into 2 sets: one set used to learn some parameters with a given regularization parameter, and another set to evaluate how well we are doing with the given parameter.
  • #32 Note: Recap verbally before this.
  • #33 - Be able to extract some smaller amount of data from the production storage system Slowly start distributing the system: piece by piece Easily switch between local and distributed Keep familiar APIs or even the same tools Show how thet distribution happens
  • #37 Note this is loading into Spark.
  • #38 Saving & loading ML types Models, both unfitted (“recipe”) & fitted Complex Pipelines, both unfitted (“workflow”) & fitted
  • #39 (DEMO)