Combining Machine Learning Frameworks with Apache Spark

Combining Machine Learning frameworks with Apache Spark Tim Hunter Hadoop Summit June 2016

About me Apache Spark contributor (since Spark 0.6) Software Engineer @ Databricks Ph.D. in Machine Learning @ UC Berkeley 2

Founded by the team who created Apache Spark Offers a hosted service: - Apache Spark in the cloud - Notebooks - Cluster management - Production environment About Databricks 3

Apache Spark The most active open-source project in big data

Large-scale machine learning on Apache Spark Spark MLlib

MLlib’s Mission MLlib’s mission is to make practical machine learning easy and scalable. • Easy to build machine learning applications • Capable of learning from large-scale datasets • Easy to integrate into existing workflows 6

Algorithm Coverage • Classification • Logistic regression • Naive Bayes • Streaming logistic regression • Linear SVMs • Decision trees • Random forests • Gradient-boosted trees • Multilayer perceptron • Regression • Ordinary least squares • Ridge regression • Lasso • Isotonic regression • Decision trees • Random forests • Gradient-boosted trees • Streaming linear methods • Generalized Linear Models • Frequent itemsets • FP-growth • PrefixSpan 7 Clustering • Gaussian mixture models • K-Means • Streaming K-Means • Latent Dirichlet Allocation • Power Iteration Clustering • Bisecting K-Means Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation • Kolmogorov–Smirnov test • Online hypothesis testing • Survival analysis Linear algebra • Local dense & sparse vectors & matrices • Normal equation for least squares • Distributed matrices • Block-partitioned matrix • Row matrix • Indexed row matrix • Coordinate matrix • Matrix decompositions Recommendation • Alternating Least Squares Feature extraction & selection • Word2Vec • Chi-Squared selection • Hashing term frequency • Inverse document frequency • Normalizer • Standard scaler • Tokenizer • One-Hot Encoder • StringIndexer • VectorIndexer • VectorAssembler • Binarizer • Bucketizer • ElementwiseProduct • PolynomialExpansion • Quantile discretizer • SQL transformer Model import/export Pipelines List based on Spark 2.0

Outline • ML workflows are complex • Distributing single-machine ML frameworks: • Embedding with Spark: • Unified cross-languages ML pipelines with MLlib 8

ML workflows are complex • Specify the pipeline • Re-run on new data • Inspect the results • Tune the parameters • Usually, each step of a pipeline is easier with one framework 9

ML Workflows are Complex 10 Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract featuresExtract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble

Existing tools • Scikit-learn – Excellent documentation – Standard for Python • R – Lots of packages available • Pandas – Very easy to use • A lot of investment in tooling and education – How to integrate big data with these tools? 11

Common misconceptions • Spark is for big data only • Spark can only work with dedicated, distributed libraries 12

Spark as a scheduler • A lot of tasks in ML are ”embarrassingly parallel” • Use Spark for data management and for scheduling 13

One example: learning digits • Learning tasks: given set of images, recognized digits • Standard benchmark dataset in computer vision built by NIST: 14

Training Deep Learning algorithms • Training a neural network is hard: • It is a sequential procedure (present one image after the other to learn from) • It can be sensitive to noise and order of images: robustness analysis is critical • Tuning the training parameters (descent rate, batch sizes, etc.) is very important. Otherwise, learning is too slow or gets stuck in a local minima. A lot of heuristics are used in practice. 15

TensorFlow as a training library • A lot of algorithms have been presented for this task, we will choose TensorFlow, from Google: • Popular choice for neural network training and deep learning • Competitive performance • Easy to experiment with • Python interface makes it easy to integrate with Spark 16

Distributing TensorFlow computations • Even if TF is used as a single-machine library, we get speedups from Spark 17 Distributed Cross Validation ... Best Model Model #1 Training Model #2 Training Model #3 Training

Distributing TensorFlow computations 18 Distributed Cross Validation ... Best Model Model #4 Training Model #6 Training Model #3 Training Model #1 Training Model #5 Training Model #2 Training

Results • Running a 2-layer neural network, and testing for different update rates and different layer sizes 19 0 3000 6000 9000 12000 1 node 2 nodes 13 nodes

Embedding deep learning in Spark • Best known algorithms are essentially sequential during training • Careful selection of training parameters is critical • Spark can help for fast iterations and find a good set of parameters 20

Managing ML workflows with Spark 21

A data scientist’s wish list: • Run original code on a production environment • Use distributed data sources • Use familiar APIs and libraries • Distribute ML workload piece by piece • Only distribute as needed • Easily switch between local & distributed settings 22

Example: sentiment analysis 23 Given a review (text), predict the user’s rating. Data from https://snap.stanford.edu/data/web-Amazon.html

ML Workflow 24 Train model Evaluate Load data Extract features Review: This product doesn't seem to be made to last… Rating: 2 feature_vector: [0.1 -1.3 0.23 … -0.74] rating: 2.0 Regression: (review: String) => Double

Load Data 25 built-in external { JSON } JDBC and more … Data sources for DataFrames LIBSVM Train model Evaluate Load data Extract features

Extract Features words: [this, product, doesn't, seem, to, …] feature_vector: [0.1 -1.3 0.23 … -0.74] Review: This product doesn't seem to be made to last… Rating: 2 Prediction: 3.0 Train model Evaluate Load data Tokenizer Hashed Term Frequ.

Extract Features words: [this, product, doesn't, seem, to, …] feature_vector: [0.1 -1.3 0.23 … -0.74] Review: This product doesn't seem to be made to last… Rating: 2 Prediction: 3.0 Linear regression Evaluate Load data Tokenizer Hashed Term Frequ.

Our ML workflow 28 Cross Validation Model Training Feature Extraction regularization parameter: {0.0, 0.1, ...}

Cross validation 29 Cross Validation ... Best Model Model #1 Training Model #2 Training Feature Extraction Model #3 Training

MLlib in production ML Persistence 31

A data scientist’s wish list: • Run original code on a production environment • Use distributed data sources • Use familiar APIs and libraries • Distribute ML workload piece by piece • Only distribute as needed • Easily switch between local & distributed settings 32

DataFrame-based API for MLlib a.k.a. “Pipelines” API, with utilities for constructing ML Pipelines In 2.0, the DataFrame-based API will become the primary API for MLlib. • Voted by community • org.apache.spark.ml, pyspark.ml The RDD-based API will enter maintenance mode. • Still maintained with bug fixes, but no new features •org.apache.spark.mllib, pyspark.mllib 33

Why ML persistence? 34 Data Science Software Engineering Prototype (Python/R) Create model Re-implement model for production (Java) Deploy model

Why ML persistence? 35 Data Science Software Engineering Prototype (Python/R) Create Pipeline • Extract raw features • Transform features • Select key features • Fit multiple models • Combine results to make prediction • Extra implementation work • Different code paths • Synchronization overhead Re-implement Pipeline for production (Java) Deploy Pipeline

With ML persistence... 36 Data Science Software Engineering Prototype (Python/R) Create Pipeline Persist model or Pipeline: model.save(“s3n://...”) Load Pipeline (Scala/Java) Model.load(“s3n://…”) Deploy in production

Model tuning ML persistence status 37 Text preprocessin g Feature generation Generalize d Linear Regressio n Unfitted Fitted Model Pipeline Supported in MLlib’s RDD-based API “recipe” “result”

ML persistence status Near-complete coverage in all Spark language APIs • Scala & Java: complete (29 feature transformers, 21 models) • Python: complete except for 2 algorithms • R: complete for existing APIs Single underlying implementation of models Exchangeable data format • JSON for metadata • Parquet for model data (coefficients, etc.) 38

A data scientist’s wish list: • Run original code on a production environment • Directly apply learned pipelines • Use MLlib as export format • Use distributed data sources • Builtin Spark conversions • Use familiar APIs and libraries • Distribute ML workload piece by piece • Easy to distribute the most common ML tasks 39

What’s next? Prioritized items on the 2.1 roadmap JIRA (SPARK- 15581): • Critical feature completeness for the DataFrame-based API – Multiclass logistic regression – Statistics • Python API parity & R API expansion • Scaling & speed tuning for key algorithms: trees & ensembles GraphFrames • Release for Spark 2.0 • Speed improvements (join elimination, connected components) 40

Get started • Get involved via roadmap JIRA (SPARK- 15581) + mailing lists • Download notebook for this talk http://dbricks.co/1UfvAH9 • ML persistence blog post http://databricks.com/blog/2016/05/31 41 Try out the Apache Spark 2.0 preview release: http://databricks.com/try

Thank you! spark.apache.org spark-packages.org databricks.com

Combining Machine Learning Frameworks with Apache Spark

In this document