MLeap: Deploy Spark ML Pipelines to Production API Servers

MLeap: Scaling Machine Learning From Research to Production https://github.com/combust/mleap twitter: @combustml

Intros Hollin Wilkins Mikhail Semeniuk

Our Talk in 3 Parts 1. What is MLeap? ○ Problem Statement + Architecture of serialization format and execution engine + Benchmarks 2. Future of MLeap/Product Roadmap ○ Beyond Spark and JVM 3. Demo: Train and deploy a streaming model to an API server with MLeap-Serving

Initially Took too long to get Spark models out into production

Original MLeap Requirements - Has to eliminate re-coding of feature pipelines and models from research to production - Serving/inference system has to be fast, sub-20ms at worst - Should require minimal amount of new code to be written by the researcher to add new features/models - Should be a lightweight library that will allow users/organizations to customize as they see fit

Now Solve for more than just Spark … we’ll talk about this later

New MLeap Requirements - Inference needs to happen outside of the JVM (Train in Spark, execute on an embedded device) - Should support other popular ML frameworks like Scikit-Learn, and TensorFlow

MLeap Architecture (high-level) A Serialization Framework For Machine Learning Pipelines An Execution Engine for Machine Learning Pipelines

From Research to Production in 3 Steps 1. Continue to write your ML pipelines and training of models in Spark 2. Serialize your entire feature pipeline and model(s) to an MLeap bundle, called bundle.ml 3. Load the serialized pipeline to MLeap serving and execute via a REST-api, without any dependency on the Spark-context

MLeap Bundle (a.ka. bundle.ml)

bundle.ml: Serialization Framework Vector Assembler Continuous Feature Vector Standard Scaler Vector Assembler Scaled Continuous Feature Vector String Indexer String Indexer OneHotEncoder Vector Assembler Categorical Feature Vector OneHotEncoder Linear Regression Continuous Features Categorical Feature (bundle 1)

bundle.ml: Serialization Framework Vector Assembler Continuous Feature Vector Standard Scaler Vector Assembler Scaled Continuous Feature Vector String Indexer String Indexer OneHotEncoder Vector Assembler Categorical Feature Vector OneHotEncoder Random Forest Regression Continuous Features Categorical Feature (bundle 2) PCA

bundle.ml: Structure Bundle.json - Root-level meta data about pipeline (version, names, etc.) Model.json - Data required to execute the model (coefficients, decision trees, intercepts, string lookups, etc.) Node.json - Connects input/output data for models to a LeapFrame (features for a logistic regression, prediction field for a random forest, etc.)

Custom Transformers Custom Spark Transformer Custom MLeap Transformer Bundle Spark Serializer Bundle MLeap Serializer MLeap Bundle - Define your model and node conversions - Bundle.ML handles serializing as either JSON or Protobuf - All transformers in MLeap are implemented in this way - Custom MLeap TFs: Unary/Binary, SVM, Imputer Logic

Core Concepts: Data Frame (LeapFrame) square_feet (Int) room_type (string) avg_rating (double) is_special (bool) 1200 House .93 true 800 Apartment .90 false 1. Schema defines names and types of columns 2. Rows to hold data

IoT Devices Cloud APIs MLeap-JVM MLeap-RS MLeap Bundles

MLeap <> Scikit-Learn - Preprocessing (Scalers, Label/OneHot Encoder) - Base Models (Linear/Logistic) - Dimension Reduction (PCA) - Other: PCA, RF, GBRT - Pipelines/Feature Unions MLeap Bundles MLeap-JVM

MLeap <> TensorFlow Spark Feature Pipeline TensorFlow Transformer TF via JNI MLeap Serving TF via JNI

MLeap Benchmarks Default MLeap Transformer (Random Forest): MLeap Row Transformer (Random Forest):

MLeap Deployed in Prod Deployed With MLeap MLeap 4-15ms Spark 0.2s-1s

MLeap Roadmap MLeap <> Spark - Full Streaming Support - Fully typed transformer pipelines MLeap <> Scikit - Scikit <> Spark Transformer Parity - Protobuf Serialization - Deserialization MLeap <> Rust - MIR - mid-level intermediate representation - JIT/Interpreted Mode

Grow the Community Deployed With MLeap 14 Contributors 21k Lines of Code Started 2016 - Become a contributor! - File an issue report - Write a cool demo using MLeap - Discuss the future of MLeap - Chat with us about your use case - Let us help if you run into any problems Deployed With MLeap - Looking for Product Managers for MLeap - Share your MLeap success story - Write a blog post about your MLeap project

MLeap Rust CPU CUDA/HIP OpenCL MLeap Rust Python Obj-C/Swift Go Ruby C#

Demo Train Spark Feature Pipeline + Linear Regression Model Apartment Listing Data MLeap Serving Client API Feature Pipeline Bundle Streaming Listing Data Transform Features Stream to Socket Online Linear Regression

Thank You! Hollin Wilkins e: hollin@combust.ml tw: @hollinwilkins https://github.com/combust/mleap Mikhail Semeniuk e: mikhail@combust.ml tw: @mikhailsemeniuk https://gitter.im/combust/mleap https://twitter.com/combustml

MLeap: Deploy Spark ML Pipelines to Production API Servers

In this document

More Related Content

Similar to MLeap: Deploy Spark ML Pipelines to Production API Servers

More from DataWorks Summit

Recently uploaded

MLeap: Deploy Spark ML Pipelines to Production API Servers