Making Data Science Scalable - 5 Lessons Learned

Making Data Science Scalable Lessons Learned from building ML Platforms 16. Mai 2019 Laurenz Wuttke, Till Döhmen “Orbital ATK Antares Launch (101410280027HQ)” by NASA HQ PHOTO is licensed under CC BY-NC-ND 2.0

About us Till Döhmen • Data Scientist / Software Engineer • Working on RecSys & AutoML Platform Laurenz Wuttke • Data Scientist & Founder datasolut • Working on RecSys & Feature Stores • Blog: www.mlguide.de

Why do we need Scalability? Rising… • Number of Contributors • Number of Use Cases • Volume and Velocity of Data • Complexity of Models • Number of End-Users • Frequency of Updates

What is a ML Platform? • A company wide environment that supports Data Scientist in their daily work • Data Preparation • Modelling • Evaluation • Deployment • Model Monitoring • etc.. • It is built to scale in multiple dimensions with   growing demands

Source: “Hidden Technical Debt in Machine Learning System” by D. Sculley et al, 2016 ML is extremely technical

Fblearner ML Platforms are developing quickly

#1: Data Science in silos is bad

Data Science Silos • Notebook instances on various (local) machines • No proper processes defined • ML Pipeline Jungle makes Machine Learning very inefficient and hard to maintain, track and scale! & Hard to hit business expectations!

#2: Feature stores should be at the heart of every ML Platform

Feature Stores • Central data layer für Machine Learning Features • Quality tested & curated • Highly automated processes • Efficiency for Data Science Teams (e.g. 80% of workload) ! Focus on building models

Data Science Data Engineering Feature Engineering ETL processes data transformation data cleaning models & visualizations

Old way… Source: Logical Clocks AB

With a Feature Store… Source: Logical Clocks AB

Data Science Projekt Costs Resourcesneeded 0 250 500 750 1000 No. of Features in Features Store 10 15 20 40 60 100 200 250

#3: AutoML works great if you have a feature store

AutoML • AutoML is advancing as a rapid pace • Algorithm selection • Hyperparameter tuning • Model stacking • (feature generation & selection) • (neural architecture search) • Usually works only on „flat“ tables

AutoML • Add Feature Generation to your AutoML Pipeline • Don’t be too afraid of crazy black box models,   packages like SHAP can help with interpretability • But Models are not optimized for runtime AutoML Feature Selection Feature Generatio n

#4: Treat Data Science (ML) Projects more like Software Development Projects

Source: “Hidden Technical Debt in Machine Learning System” by D. Sculley et al, 2016

Model Design Model Training Evaluation Requirements /Ideation Data Acquisition Data Preparation Experimentin g Training/ Optimization Integration Testing/QA Deployment Maintenance ML Lifecycle Requirements Design Implementatio n Integration/ Build Testing/QA Deployment Maintenance Software Dev. Lifecycle

Is ML really like Software Dev.? • ML feels more like debugging • Experimentation-heavy • Notebooks are the preferred   mode of development • Not easy to version-control • Not easy to deploy

Model Tracking • We need a way to keep track of experiments • Models • Parameters • Evaluation results • Other artifacts (data) • Tools like MLFlow or DVC facilitate that • DVC more git-like, MLFlow explicit in-code ! Build up a (central) Model Repository

Requirement s/ Ideation Data Acquisitio n Experimenti ng Training / Optimizin g Testing/ QA Maintenanc e Model Design Model Training Evaluation Integration Deployment Data Preparatio n Requirement s Design Implementation Testing/ QA Integration /Build Deployment Maintenanc e Software Development Machine Learning Source Code Management Continuous Integration / Continuous Delivery Feature Store Model Repository CI / CD Source Code Management

CI/CD • In Software Dev. long established practice • We can use CI/CD software to • Schedule training/evaluation jobs • Run automatic tests • Integrate our models into e.g. a Docker container • Ship our deployments to the production environment • Provide mechanisms for failover etc.

Unit Testing • (Automated) Testing & QA should be   in place for production systems • Example test cases • Modelling/infrastructure code for bugs • Training process with predefined data • Significant changes of data in Feature Store • Significant changes in model output • Testing of Data is challenging and an  open problem, start simple

Monitoring Score distributions (may) change over time [0 ,0.1] (0.2 ,0.3] (0.4 ,0.5] (0.6 ,0.7] (0.8 ,0.9] Week 1 Week 4 • Validate & track your model performance constantly • Retrain (automatically) on new data if needed

#5: A Cloud-based Infrastructure makes it easy to get started

Summary • Don’t work in silos • Create a feature store • Keep track of your models • Make use of AutoML where applicable • Use Cloud Infrastructure if you want to start quickly • Build your own ML Platform

Requirements / Ideation Data Acquisitio n Experimenti ng Training / Optimizin g Testing/ QA Maintenanc e Model Design Model Training Evaluation Integration Deployment Data Preparatio n ML Platform Data/Feature Management Model Management CI/CD ML Platform AutoML Unit Testing Moni- toring Cloud / On-Premise Infrastructure Docker- ization

Thank you! • Questions… • You find us on LinkedIn… https://www.linkedin.com/in/tdoehmen/ https://www.linkedin.com/in/laurenz-wuttke/

Links • https://engineering.linkedin.com/blog/2019/01/scaling- machine-learning-productivity-at-linkedin • https://databricks.com/session/zipline-airbnbs-machine- learning-data-management-platform • https://eng.uber.com/michelangelo/ • https://www.logicalclocks.com/feature-store/

Making Data Science Scalable - 5 Lessons Learned

More Related Content

What's hot

Similar to Making Data Science Scalable - 5 Lessons Learned

Recently uploaded

Making Data Science Scalable - 5 Lessons Learned