Making Data Science Scalable Lessons Learned from building ML Platforms 16. Mai 2019 Laurenz Wuttke, Till Döhmen “Orbital ATK Antares Launch (101410280027HQ)” by NASA HQ PHOTO is licensed under CC BY-NC-ND 2.0
About us Till Döhmen • Data Scientist / Software Engineer • Working on RecSys & AutoML Platform Laurenz Wuttke • Data Scientist & Founder datasolut • Working on RecSys & Feature Stores • Blog: www.mlguide.de
Why do we need Scalability? Rising… • Number of Contributors • Number of Use Cases • Volume and Velocity of Data • Complexity of Models • Number of End-Users • Frequency of Updates
What is a ML Platform? • A company wide environment that supports Data Scientist in their daily work • Data Preparation • Modelling • Evaluation • Deployment • Model Monitoring • etc.. • It is built to scale in multiple dimensions with 
 growing demands
Source: “Hidden Technical Debt in Machine Learning System” by D. Sculley et al, 2016 ML is extremely technical
Fblearner ML Platforms are developing quickly
5 Lessons Learned
#1: Data Science in silos is bad
Data Science Silos • Notebook instances on various (local) machines • No proper processes defined • ML Pipeline Jungle makes Machine Learning very inefficient and hard to maintain, track and scale! & Hard to hit business expectations!
#2: Feature stores should be at the heart of every ML Platform
Feature Stores • Central data layer für Machine Learning Features • Quality tested & curated • Highly automated processes • Efficiency for Data Science Teams (e.g. 80% of workload) ! Focus on building models
Data Science Data Engineering Feature Engineering ETL processes data transformation data cleaning models & visualizations
Old way… Source: Logical Clocks AB
With a Feature Store… Source: Logical Clocks AB
Data Science Projekt Costs Resourcesneeded 0 250 500 750 1000 No. of Features in Features Store 10 15 20 40 60 100 200 250
#3: AutoML works great if you have a feature store
AutoML • AutoML is advancing as a rapid pace • Algorithm selection • Hyperparameter tuning • Model stacking • (feature generation & selection) • (neural architecture search) • Usually works only on „flat“ tables
AutoML • Add Feature Generation to your AutoML Pipeline • Don’t be too afraid of crazy black box models, 
 packages like SHAP can help with interpretability • But Models are not optimized for runtime AutoML Feature Selection Feature Generatio n
#4: Treat Data Science (ML) Projects more like Software Development Projects
Source: “Hidden Technical Debt in Machine Learning System” by D. Sculley et al, 2016
Model Design Model Training Evaluation Requirements /Ideation Data Acquisition Data Preparation Experimentin g Training/ Optimization Integration Testing/QA Deployment Maintenance ML Lifecycle Requirements Design Implementatio n Integration/ Build Testing/QA Deployment Maintenance Software Dev. Lifecycle
Is ML really like Software Dev.? • ML feels more like debugging • Experimentation-heavy • Notebooks are the preferred 
 mode of development • Not easy to version-control • Not easy to deploy
Model Tracking • We need a way to keep track of experiments • Models • Parameters • Evaluation results • Other artifacts (data) • Tools like MLFlow or DVC facilitate that • DVC more git-like, MLFlow explicit in-code ! Build up a (central) Model Repository
Requirement s/ Ideation Data Acquisitio n Experimenti ng Training / Optimizin g Testing/ QA Maintenanc e Model Design Model Training Evaluation Integration Deployment Data Preparatio n Requirement s Design Implementation Testing/ QA Integration /Build Deployment Maintenanc e Software Development Machine Learning Source Code Management Continuous Integration / Continuous Delivery Feature Store Model Repository CI / CD Source Code Management
CI/CD • In Software Dev. long established practice • We can use CI/CD software to • Schedule training/evaluation jobs • Run automatic tests • Integrate our models into e.g. a Docker container • Ship our deployments to the production environment • Provide mechanisms for failover etc.
Unit Testing • (Automated) Testing & QA should be 
 in place for production systems • Example test cases • Modelling/infrastructure code for bugs • Training process with predefined data • Significant changes of data in Feature Store • Significant changes in model output • Testing of Data is challenging and an
 open problem, start simple
Monitoring Score distributions (may) change over time [0 ,0.1] (0.2 ,0.3] (0.4 ,0.5] (0.6 ,0.7] (0.8 ,0.9] Week 1 Week 4 • Validate & track your model performance constantly • Retrain (automatically) on new data if needed
#5: A Cloud-based Infrastructure makes it easy to get started
Cloud vs. On-Premise
Summary
Summary • Don’t work in silos • Create a feature store • Keep track of your models • Make use of AutoML where applicable • Use Cloud Infrastructure if you want to start quickly • Build your own ML Platform
Requirements / Ideation Data Acquisitio n Experimenti ng Training / Optimizin g Testing/ QA Maintenanc e Model Design Model Training Evaluation Integration Deployment Data Preparatio n ML Platform Data/Feature Management Model Management CI/CD ML Platform AutoML Unit Testing Moni- toring Cloud / On-Premise Infrastructure Docker- ization
Thank you! • Questions… • You find us on LinkedIn… https://www.linkedin.com/in/tdoehmen/ https://www.linkedin.com/in/laurenz-wuttke/
Links • https://engineering.linkedin.com/blog/2019/01/scaling- machine-learning-productivity-at-linkedin • https://databricks.com/session/zipline-airbnbs-machine- learning-data-management-platform • https://eng.uber.com/michelangelo/ • https://www.logicalclocks.com/feature-store/

Making Data Science Scalable - 5 Lessons Learned