© 2018 IBM Corporation Applying Software Engineering Practices for the Data Science & ML Lifecycle Data Works Summit, San Jose 2018 Sriram Srinivasan Architect - IBM Data Science & Machine Learning, Cloud Private for Data IBM Data Science Experience
© 2017 IBM Corporation<#> Overview  Enterprises, as usual, want quick return on investments in Data Science  But with a shrinking dev -> prod cycle: application of new techniques and course corrections on a continuous basis is the norm  Data Science & Machine Learning are increasingly cross-team endeavors  Data Engineers, Business Analysts, DBAs and Data Stewards are frequently involved in the lifecycle  The “Cloud” has been a great influence  Economies of scale with regards to infrastructure cost, quick re-assignments of resources are expected  Automation, APIs, Repeatability, Reliability and Elasticity are essential for “operations”.  Data Science & ML need to exhibit the same maturity as other enterprise class apps :  Compliance & Regulations are still a critical mandate for large Enterprises to adhere to  Security, audit-ability and governance need to be in-place from the start and not an after-thought.
© 2017 IBM Corporation<#> Data Scientist Concerns  Where is the data I need to drive business insights?  I don’t want to do all the plumbing – connect to databases, Hadoop etc. How do I collaborate and share my work with others?  What visualization techniques exist to tell my story?  How do I bring my familiar R/Python libraries to this new Data Science platform?  How do I use the latest libraries/Technique or newer versions ?  How do I procure compute resources for my experimentation ?  With specialized compute such as GPUs How are my Machine Learning Models performing & how to improve them?  I have this Machine Learning Model, how do I help deploy it in production?
Data Science Experience Access to libraries & tools.. an ever growing list..  Multiple programming languages – Python, R, Scala..  Modern Data Scientists are programmers/ software developers too !  Build your favorite libraries or experiment with new ones  Modularization via packages & dependency management are problems just as with any Software development  Publish apps and expose APIs.. – share & collaborate  Work with a variety of data sources and technologies.. easily.. Machine Learning Environments Deep Learning Environments SPSS Modeler …. …. ….
© 2017 IBM Corporation<#> Challenges for the Enterprise  Ensure secure data access & auditability - for governance and compliance  Control and Curate access to data and for all open source libraries used  Explainability and reproducibility of machine learning activities  Improve trust in analytics and predictions  Efficient Collaboration and versioning of all source, sample data and models  Easy teaming with accountability  Establish Continuous integration practices just as with any Enterprise software  Agility in delivery and problem resolutions in production  Publish/Share and identify provenance/ lineage with confidence  Visibility and Access control  Effective Resource utilization and ability to scale-out on demand  Guarantee SLAs for production work, balance resources amongst different data scientists, machine learning practioners' workloads  Goal: Operationalize Data Science !
5 tenets for operationalizing data science Analytics-Ready Data Managed Trusted Quality, Provenance and Explainability Resilient Measurable Monitor + Measure Evolution Deliver & ImproveAt Scale & Always On
Where’s my data ? Analytics-Ready Data Managed • access to data with techniques to track & deal with sensitive content • data virtualization • automate-able pipelines for data preparation, transformations Need: An Enterprise Catalog & Data Integration capabilities
How can I convince you to use this model ? • provenance of data used to train & test • lineage of the model - transforms , features & labels • model explainability - algorithm, size of data, compute resources used to train & test, evaluation thresholds, repeatable Trusted Quality, Provenance and Explainability How was the model built ? Need: An enterprise Catalog for Analytics & Model assets
Dependable for your business Resilient At Scale & Always On • reliable & performant for (re-)training • highly available, low latency model serving at real time even with sophisticated data prep • outage free model /version upgrades in production ML infused in real-time, critical business processes Must have: A platform for elasticity, reliability & load-balancing
Is the model still good enough ? Measurable Monitor + Measure • latency metrics for real-time scoring • frequent accuracy evaluations with thresholds • health monitoring for model decay Desired: Continuous Model evaluations & instrumentations
Growth & Maturity Evolution Deliver & Improve • versioning: champion/challenger, experimentation and hyper-parameterization • process efficiencies: automated re-training auto deployments (with curation & approvals) Must-have: Delta deployments & outage free upgrades
12 A git and Docker/Kubernetes based approach from lessons learnt during the implementation of : IBM Data Science Experience Local & Desktop https://www.ibm.com/products/data-science-experience and IBM Cloud Private for Data http://ibm.biz/data4ai
Part 1: Establish a way to organize Models, scripts & all other assets  A “Data Science Project” o just a folder of assets grouped together o Contains “models” (say .pickle/.joblib or R objects with metadata) o scripts o for data wrangling and transformations o used for training & testing, evaluations, batch scoring o interactive notebooks and apps (R Shiny etc.) o sample data sets & references to remote data sources o perhaps even your own home-grown libraries & modules.. o is a git repository • why ? Easy to share, version & publish – across different teams or users with different Roles • track history of commit changes, setup approval practices & version. Familiar concept - Projects exist in most IDEs & tools Open & Portable.. - Even works on your laptop Sharable & Versionable .. - Courtesy of .git  Use Projects with all tools/IDEs and services – one container for all artifacts & for tracking dependencies
Part 2: Provide reproducible Environments  Enabled by Docker & Kubernetes o A Docker image represents a ”Runtime” – and essentially enables repeatability by other users • For example - a Python 2.7 Runtime with Anaconda 4.x, or an R 3.4.3 Runtime environment • Can also include IDEs or tooling, such as Jupyter /Jupyterlab or Zeppelin , RStudio etc. – exposed via a http address • Allows for many different types of packages & versions of packages to be surfaced. Compatibility of package-versions can be maintained, avoiding typical package conflict issues o Docker containers with Project volumes provide for reproducible compute environments o With Kubernetes - Orchestrate & Scale out compute and scale for users. Automate via Kube Cron jobs Port forwarding +auth Project (.git repo) mounted as a volume - or just git cloned & pushed when work is done.. Example -a Jupyterlab container Churn predictor –v2 Scoring Server pods Kiubesvc Load balance Auth-Proxy Example -a scoring service for a specific version of a Model Port forwarding Create replicas for scale/load-balancing
Part 3: A dev-ops process & an Enterprise Catalog  Establish a “Release” to production mechanism o git tag the state of a Data Science project to identify a version that is a release candidate o Take that released project tag through a conventional dev->stage->production pipeline o An update to the release would simply translate to a “pull” from a new git tag. o Stand up Docker containers in Kubernetes deployments + svc) to deploy Data Science artifacts, expose Web Services or simply use Kube Jobs to run scripts as needed.  A catalog for metadata and lineage o All asset metadata is recorded in this repository, including all data assets in the Enterprise • Enables tracking of relationships between assets – including references to data sources such as Relational tables/views or HDFS files etc. • Manage Projects and versions in development, Releases in production • Track APIs / URL end-points and Apps being exposed (and consumers) o Establish policies for governance & compliance (apart from just access control)
Summary: The Governed Data Science lifecycle - is a team sport Data Engineer CDO (Data Steward) Data Scientist Organizes • Data & Analytics Asset Enterprise Catalog • Lineage • Governance of Data & Models • Audits & Policies • Enables Model Explainability Collects Builds data lakes and warehouses Gets Data Ready for Analytics Analyzes Explores, Shapes data & trains models Exec App. Developer Problem Statement or target opportunity Finds Data Explores & Understands Data Collects, Preps & Persists Data Extracts features for ML Train Models Deploy & monitor accuracy Trusted Predictions Experiments Sets goals & measures results Real-time Apps & business processes Infuses ML in apps & processes PRODUCTION • Secure & Governed • Monitor & Instrument • High Throughput -Load Balance & Scale • Reliable Deployments - outage free upgrades Auto-retrain & upgrade Refine Features, Lineage/Relationships recorded in Governance Catalog Development & prototyping Production Admin/Ops
17 June 2018 / © 2018 IBM Corporation Backup IBM Cloud Private for Data & Data Science Experience Local
Build & Collaborate Collaborate within git-backed projects Sample data Understand data distributions & profile
Understand, Analyze, Train ML models Jupyter notebook Environment Python 2.7/3.5 with Anaconda Scala, R R Studio Environment with > 300 packages, R Markdown, R Shiny Zeppelin notebook with Python 2.7 with Anaconda Data Scientist Analyzes Experiments, trains models Evaluates model accuracy Publish Apps & models Experiments Features, Lineage recorded in Governance Catalog Models, Scoring services & Apps Publish to the Catalog Explores & Understands Data, distributions - ML Feature engineering - Visualizations - Notebooks - Train Models - Dashboards, apps
Self service Compute Environments Servers/IDEs - lifecyle easily controlled by each Data Scientist Self-serve reservations of compute resources Worker compute resources – for batch jobs run on-demand or on schedule Environments are essentially Kubernetes pods – with High Availability & Compute scale-out baked in (load-balancing/auto-scaling is being planned for a future spring) On demand or leased compute
Extend .. – Roll your own Environments Add libs/packages to the existing Jupyter, Rstudio , Zeppelin IDE Environments or introduce new Job “Worker” environments https://content-dsxlocal.mybluemix.net/docs/content/local/images.html DSX Local provides a Docker Registry (and replicated for HA) as well. These images get managed by DSX and is used to help build out custom Environments Plug-n-Play extensibility Reproducibility, courtesy of Docker images
Automate .. Jobs – trigger on-demand or by a schedule. such as for Model Evaluations, Batch scoring or even continuous (re-) training
Monitor models through a dashboard Model versioning, evaluation history Publish versions of models, supporting dev/stage/production paradigm Monitor scalability through cluster dashboard Adapt scalability by redistributing compute/memory/disk resources Deploy, monitor and manage
Deployment manager - Project Releases Project releases Deployed & (delta) updatable Current git tag • Develop in one DSX Local instance & deploy/manage in another (or the same too) • Easy support for Hybrid use cases - develop & train on-prem, deploy in the cloud (or vice versa)
Bring in a new “release” to production New Releases - from a “Source” Project in the same cluster New Releases - from a “Source” Project pulled from github/bitbucket New Releases - from a “Source” Project created from a .tar.gz package
Expose a ML model via a REST API replicas for load balancing pick a version to expose (multiple deployments are possible too..) Optionally reserve compute scoring end-point Model pre-loaded into memory inside scoring containers
Expose Python and R scripts as a Web Service Custom scripts can be externalized as a REST service - say for custom prediction functions

Software engineering practices for the data science and machine learning lifecycle

  • 1.
    © 2018 IBMCorporation Applying Software Engineering Practices for the Data Science & ML Lifecycle Data Works Summit, San Jose 2018 Sriram Srinivasan Architect - IBM Data Science & Machine Learning, Cloud Private for Data IBM Data Science Experience
  • 2.
    © 2017 IBMCorporation<#> Overview  Enterprises, as usual, want quick return on investments in Data Science  But with a shrinking dev -> prod cycle: application of new techniques and course corrections on a continuous basis is the norm  Data Science & Machine Learning are increasingly cross-team endeavors  Data Engineers, Business Analysts, DBAs and Data Stewards are frequently involved in the lifecycle  The “Cloud” has been a great influence  Economies of scale with regards to infrastructure cost, quick re-assignments of resources are expected  Automation, APIs, Repeatability, Reliability and Elasticity are essential for “operations”.  Data Science & ML need to exhibit the same maturity as other enterprise class apps :  Compliance & Regulations are still a critical mandate for large Enterprises to adhere to  Security, audit-ability and governance need to be in-place from the start and not an after-thought.
  • 3.
    © 2017 IBMCorporation<#> Data Scientist Concerns  Where is the data I need to drive business insights?  I don’t want to do all the plumbing – connect to databases, Hadoop etc. How do I collaborate and share my work with others?  What visualization techniques exist to tell my story?  How do I bring my familiar R/Python libraries to this new Data Science platform?  How do I use the latest libraries/Technique or newer versions ?  How do I procure compute resources for my experimentation ?  With specialized compute such as GPUs How are my Machine Learning Models performing & how to improve them?  I have this Machine Learning Model, how do I help deploy it in production?
  • 4.
    Data Science Experience Access tolibraries & tools.. an ever growing list..  Multiple programming languages – Python, R, Scala..  Modern Data Scientists are programmers/ software developers too !  Build your favorite libraries or experiment with new ones  Modularization via packages & dependency management are problems just as with any Software development  Publish apps and expose APIs.. – share & collaborate  Work with a variety of data sources and technologies.. easily.. Machine Learning Environments Deep Learning Environments SPSS Modeler …. …. ….
  • 5.
    © 2017 IBMCorporation<#> Challenges for the Enterprise  Ensure secure data access & auditability - for governance and compliance  Control and Curate access to data and for all open source libraries used  Explainability and reproducibility of machine learning activities  Improve trust in analytics and predictions  Efficient Collaboration and versioning of all source, sample data and models  Easy teaming with accountability  Establish Continuous integration practices just as with any Enterprise software  Agility in delivery and problem resolutions in production  Publish/Share and identify provenance/ lineage with confidence  Visibility and Access control  Effective Resource utilization and ability to scale-out on demand  Guarantee SLAs for production work, balance resources amongst different data scientists, machine learning practioners' workloads  Goal: Operationalize Data Science !
  • 6.
    5 tenets foroperationalizing data science Analytics-Ready Data Managed Trusted Quality, Provenance and Explainability Resilient Measurable Monitor + Measure Evolution Deliver & ImproveAt Scale & Always On
  • 7.
    Where’s my data? Analytics-Ready Data Managed • access to data with techniques to track & deal with sensitive content • data virtualization • automate-able pipelines for data preparation, transformations Need: An Enterprise Catalog & Data Integration capabilities
  • 8.
    How can Iconvince you to use this model ? • provenance of data used to train & test • lineage of the model - transforms , features & labels • model explainability - algorithm, size of data, compute resources used to train & test, evaluation thresholds, repeatable Trusted Quality, Provenance and Explainability How was the model built ? Need: An enterprise Catalog for Analytics & Model assets
  • 9.
    Dependable for yourbusiness Resilient At Scale & Always On • reliable & performant for (re-)training • highly available, low latency model serving at real time even with sophisticated data prep • outage free model /version upgrades in production ML infused in real-time, critical business processes Must have: A platform for elasticity, reliability & load-balancing
  • 10.
    Is the modelstill good enough ? Measurable Monitor + Measure • latency metrics for real-time scoring • frequent accuracy evaluations with thresholds • health monitoring for model decay Desired: Continuous Model evaluations & instrumentations
  • 11.
    Growth & Maturity Evolution Deliver& Improve • versioning: champion/challenger, experimentation and hyper-parameterization • process efficiencies: automated re-training auto deployments (with curation & approvals) Must-have: Delta deployments & outage free upgrades
  • 12.
    12 A git andDocker/Kubernetes based approach from lessons learnt during the implementation of : IBM Data Science Experience Local & Desktop https://www.ibm.com/products/data-science-experience and IBM Cloud Private for Data http://ibm.biz/data4ai
  • 13.
    Part 1: Establisha way to organize Models, scripts & all other assets  A “Data Science Project” o just a folder of assets grouped together o Contains “models” (say .pickle/.joblib or R objects with metadata) o scripts o for data wrangling and transformations o used for training & testing, evaluations, batch scoring o interactive notebooks and apps (R Shiny etc.) o sample data sets & references to remote data sources o perhaps even your own home-grown libraries & modules.. o is a git repository • why ? Easy to share, version & publish – across different teams or users with different Roles • track history of commit changes, setup approval practices & version. Familiar concept - Projects exist in most IDEs & tools Open & Portable.. - Even works on your laptop Sharable & Versionable .. - Courtesy of .git  Use Projects with all tools/IDEs and services – one container for all artifacts & for tracking dependencies
  • 14.
    Part 2: Providereproducible Environments  Enabled by Docker & Kubernetes o A Docker image represents a ”Runtime” – and essentially enables repeatability by other users • For example - a Python 2.7 Runtime with Anaconda 4.x, or an R 3.4.3 Runtime environment • Can also include IDEs or tooling, such as Jupyter /Jupyterlab or Zeppelin , RStudio etc. – exposed via a http address • Allows for many different types of packages & versions of packages to be surfaced. Compatibility of package-versions can be maintained, avoiding typical package conflict issues o Docker containers with Project volumes provide for reproducible compute environments o With Kubernetes - Orchestrate & Scale out compute and scale for users. Automate via Kube Cron jobs Port forwarding +auth Project (.git repo) mounted as a volume - or just git cloned & pushed when work is done.. Example -a Jupyterlab container Churn predictor –v2 Scoring Server pods Kiubesvc Load balance Auth-Proxy Example -a scoring service for a specific version of a Model Port forwarding Create replicas for scale/load-balancing
  • 15.
    Part 3: Adev-ops process & an Enterprise Catalog  Establish a “Release” to production mechanism o git tag the state of a Data Science project to identify a version that is a release candidate o Take that released project tag through a conventional dev->stage->production pipeline o An update to the release would simply translate to a “pull” from a new git tag. o Stand up Docker containers in Kubernetes deployments + svc) to deploy Data Science artifacts, expose Web Services or simply use Kube Jobs to run scripts as needed.  A catalog for metadata and lineage o All asset metadata is recorded in this repository, including all data assets in the Enterprise • Enables tracking of relationships between assets – including references to data sources such as Relational tables/views or HDFS files etc. • Manage Projects and versions in development, Releases in production • Track APIs / URL end-points and Apps being exposed (and consumers) o Establish policies for governance & compliance (apart from just access control)
  • 16.
    Summary: The GovernedData Science lifecycle - is a team sport Data Engineer CDO (Data Steward) Data Scientist Organizes • Data & Analytics Asset Enterprise Catalog • Lineage • Governance of Data & Models • Audits & Policies • Enables Model Explainability Collects Builds data lakes and warehouses Gets Data Ready for Analytics Analyzes Explores, Shapes data & trains models Exec App. Developer Problem Statement or target opportunity Finds Data Explores & Understands Data Collects, Preps & Persists Data Extracts features for ML Train Models Deploy & monitor accuracy Trusted Predictions Experiments Sets goals & measures results Real-time Apps & business processes Infuses ML in apps & processes PRODUCTION • Secure & Governed • Monitor & Instrument • High Throughput -Load Balance & Scale • Reliable Deployments - outage free upgrades Auto-retrain & upgrade Refine Features, Lineage/Relationships recorded in Governance Catalog Development & prototyping Production Admin/Ops
  • 17.
    17 June 2018 /© 2018 IBM Corporation Backup IBM Cloud Private for Data & Data Science Experience Local
  • 18.
    Build & Collaborate Collaboratewithin git-backed projects Sample data Understand data distributions & profile
  • 19.
    Understand, Analyze, Train MLmodels Jupyter notebook Environment Python 2.7/3.5 with Anaconda Scala, R R Studio Environment with > 300 packages, R Markdown, R Shiny Zeppelin notebook with Python 2.7 with Anaconda Data Scientist Analyzes Experiments, trains models Evaluates model accuracy Publish Apps & models Experiments Features, Lineage recorded in Governance Catalog Models, Scoring services & Apps Publish to the Catalog Explores & Understands Data, distributions - ML Feature engineering - Visualizations - Notebooks - Train Models - Dashboards, apps
  • 20.
    Self service ComputeEnvironments Servers/IDEs - lifecyle easily controlled by each Data Scientist Self-serve reservations of compute resources Worker compute resources – for batch jobs run on-demand or on schedule Environments are essentially Kubernetes pods – with High Availability & Compute scale-out baked in (load-balancing/auto-scaling is being planned for a future spring) On demand or leased compute
  • 21.
    Extend .. – Rollyour own Environments Add libs/packages to the existing Jupyter, Rstudio , Zeppelin IDE Environments or introduce new Job “Worker” environments https://content-dsxlocal.mybluemix.net/docs/content/local/images.html DSX Local provides a Docker Registry (and replicated for HA) as well. These images get managed by DSX and is used to help build out custom Environments Plug-n-Play extensibility Reproducibility, courtesy of Docker images
  • 22.
    Automate .. Jobs –trigger on-demand or by a schedule. such as for Model Evaluations, Batch scoring or even continuous (re-) training
  • 23.
    Monitor models througha dashboard Model versioning, evaluation history Publish versions of models, supporting dev/stage/production paradigm Monitor scalability through cluster dashboard Adapt scalability by redistributing compute/memory/disk resources Deploy, monitor and manage
  • 24.
    Deployment manager -Project Releases Project releases Deployed & (delta) updatable Current git tag • Develop in one DSX Local instance & deploy/manage in another (or the same too) • Easy support for Hybrid use cases - develop & train on-prem, deploy in the cloud (or vice versa)
  • 25.
    Bring in anew “release” to production New Releases - from a “Source” Project in the same cluster New Releases - from a “Source” Project pulled from github/bitbucket New Releases - from a “Source” Project created from a .tar.gz package
  • 26.
    Expose a MLmodel via a REST API replicas for load balancing pick a version to expose (multiple deployments are possible too..) Optionally reserve compute scoring end-point Model pre-loaded into memory inside scoring containers
  • 27.
    Expose Python andR scripts as a Web Service Custom scripts can be externalized as a REST service - say for custom prediction functions