Agile Data Science with Scala by @DataFellas Xavier Tordoir xtordoir@data-fellas.guru @xtordoir Andy Petrella noootsab@data-fellas.guru @noootsab
Data Fellas Andy Petrella Maths Geospatial Distributed Computing Spark Notebook Trainer Spark/Scala Machine Learning Xavier Tordoir Physics Bioinformatics Distributed Computing Scala (& Perl) trainer Spark Machine Learning
© Data Fellas SPRL 2016 ● Pipeline: productizing Data Science ● Demo of Distributed Pipeline (Spark, Mesos, Akka, Cassandra, Kafka, Spark Notebook) ● Why Micro Services? ● Painful points: ○ Data science is Discontiguous ○ Context Lost in Translation ● Solution: Data Fellas’ Agile Data Science Toolkit Lineup So if you’re not sure you want to stay...
© Data Fellas SPRL 2016 Pipeline Productizing Data Science Modelling Coding Deploying Finding Data Parsing structures Cleaning (Reducing) Learning Predicting Connect PROD data Tuning training parameters Create Prediction Service Generate Deployable Connect to PROD infrastructure Integration with existing env Allocate (schedule) resources Ensure availability
© Data Fellas SPRL 2016 Distributed Data Science Demo All-In Spark Notebooks Get data: Source → Kafka Prepare View: Kafka → Cassandra Train Model: Cassandra → ML... Create Server: Cassandra/ML/... → Akka Http Create Client: Json → Html Form, Chart, table, ...
© Data Fellas SPRL 2016 Bad Pipeline Targeting Dashboard Modelling Coding Deploying Dashboard »»» Data Scientist focusing on the dashboard/report instead of content breaks reusability of data time wasted on learning viz instead of increasing accuracy (or velocity) monolithic instead of service oriented
© Data Fellas SPRL 2016 Extended Pipeline Micro Services Modelling Coding Deploying Integrating Application Creating Services Abstracts access to prepared views Exposes Prediction capabilities Highly horizontally scalable Scaling micro services cluster → cheaper than computing cluster Customer integration Can be any technologies Can even be another pipeline!
© Data Fellas SPRL 2016 Painful points Data science is Discontiguous ➔ Highly heterogeneous environment ➔ Too many friction areas ➔ Time to market too long Modelling Coding Deploying Integrating Application Scientist Data Eng. Ops. Eng. Web Eng. Customers ➔ No integration ➔ Error prone ➔ Schedule delays Creating Services Frictions Result: Lack of Agility Collecting Data Eng.
© Data Fellas SPRL 2016 Painful points Context Lost in Translation Data Lake Processing Machine Learning Model Output Data Input Data No contextual discovery No quality info No lineage (origin of the data) Link to process and input discarded Huge gap in architecture: binary and schema aware serving layer Accuracy depends on concealed quality of inputs No schema! hard and long integration, poor satisfaction Moreover: No backward links → no agility and no context awareness Result: Lack of Reproducibility Application
Data Fellas… Agile Data Science Toolkit
© Data Fellas SPRL 2016 Our Approach Agile Data Science Toolkit Automatic Semantics Engine + Autogenerated Microservices Integrated End-to-End Environment Huge gain in Time and Reliability + = Notebook Computing Cluster Access Layer Knowledge Base Consumers Customers Exposes database, learning models, stream sources, notebooks, ... data type process lineage usage Easy to Release Easy to (Re)Use Notebook Version Control (Git) Spark Job Project (SBT) Service Projects (SBT) Metadata (Doc, Logic, Schema, ...) Catalog (ElasticSearch) Deployable (Jar, Docker) Repository (Nexus, Docker Repo, Pypi, Gem Server) Client Projects (Node.Js, Java, Scala, Python, Ruby) Publishable (NPM, Jar, Pip/EasyInstall, Gem) scientist data Engineer ops Engineer
© Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
© Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
© Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
© Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
© Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
© Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
Data Fellas… Announcements!!!
© Data Fellas SPRL 2016 O’Reilly Online seminar
© Data Fellas SPRL 2016 Growing We’re Hiring! http://www.data-fellas.guru/#skillsjobs
Q/A References http://www.data-fellas.guru/ http://spark-notebook.io/ https://github.com/andypetrella/spark-notebook/ https://gitter.im/andypetrella/spark-notebook Come at Strata -- London at least -- We have two talks :-)

Agile data science with scala

  • 1.
    Agile Data Sciencewith Scala by @DataFellas Xavier Tordoir xtordoir@data-fellas.guru @xtordoir Andy Petrella noootsab@data-fellas.guru @noootsab
  • 2.
    Data Fellas Andy Petrella Maths Geospatial DistributedComputing Spark Notebook Trainer Spark/Scala Machine Learning Xavier Tordoir Physics Bioinformatics Distributed Computing Scala (& Perl) trainer Spark Machine Learning
  • 3.
    © Data FellasSPRL 2016 ● Pipeline: productizing Data Science ● Demo of Distributed Pipeline (Spark, Mesos, Akka, Cassandra, Kafka, Spark Notebook) ● Why Micro Services? ● Painful points: ○ Data science is Discontiguous ○ Context Lost in Translation ● Solution: Data Fellas’ Agile Data Science Toolkit Lineup So if you’re not sure you want to stay...
  • 4.
    © Data FellasSPRL 2016 Pipeline Productizing Data Science Modelling Coding Deploying Finding Data Parsing structures Cleaning (Reducing) Learning Predicting Connect PROD data Tuning training parameters Create Prediction Service Generate Deployable Connect to PROD infrastructure Integration with existing env Allocate (schedule) resources Ensure availability
  • 5.
    © Data FellasSPRL 2016 Distributed Data Science Demo All-In Spark Notebooks Get data: Source → Kafka Prepare View: Kafka → Cassandra Train Model: Cassandra → ML... Create Server: Cassandra/ML/... → Akka Http Create Client: Json → Html Form, Chart, table, ...
  • 6.
    © Data FellasSPRL 2016 Bad Pipeline Targeting Dashboard Modelling Coding Deploying Dashboard »»» Data Scientist focusing on the dashboard/report instead of content breaks reusability of data time wasted on learning viz instead of increasing accuracy (or velocity) monolithic instead of service oriented
  • 7.
    © Data FellasSPRL 2016 Extended Pipeline Micro Services Modelling Coding Deploying Integrating Application Creating Services Abstracts access to prepared views Exposes Prediction capabilities Highly horizontally scalable Scaling micro services cluster → cheaper than computing cluster Customer integration Can be any technologies Can even be another pipeline!
  • 8.
    © Data FellasSPRL 2016 Painful points Data science is Discontiguous ➔ Highly heterogeneous environment ➔ Too many friction areas ➔ Time to market too long Modelling Coding Deploying Integrating Application Scientist Data Eng. Ops. Eng. Web Eng. Customers ➔ No integration ➔ Error prone ➔ Schedule delays Creating Services Frictions Result: Lack of Agility Collecting Data Eng.
  • 9.
    © Data FellasSPRL 2016 Painful points Context Lost in Translation Data Lake Processing Machine Learning Model Output Data Input Data No contextual discovery No quality info No lineage (origin of the data) Link to process and input discarded Huge gap in architecture: binary and schema aware serving layer Accuracy depends on concealed quality of inputs No schema! hard and long integration, poor satisfaction Moreover: No backward links → no agility and no context awareness Result: Lack of Reproducibility Application
  • 10.
  • 11.
    © Data FellasSPRL 2016 Our Approach Agile Data Science Toolkit Automatic Semantics Engine + Autogenerated Microservices Integrated End-to-End Environment Huge gain in Time and Reliability + = Notebook Computing Cluster Access Layer Knowledge Base Consumers Customers Exposes database, learning models, stream sources, notebooks, ... data type process lineage usage Easy to Release Easy to (Re)Use Notebook Version Control (Git) Spark Job Project (SBT) Service Projects (SBT) Metadata (Doc, Logic, Schema, ...) Catalog (ElasticSearch) Deployable (Jar, Docker) Repository (Nexus, Docker Repo, Pypi, Gem Server) Client Projects (Node.Js, Java, Scala, Python, Ruby) Publishable (NPM, Jar, Pip/EasyInstall, Gem) scientist data Engineer ops Engineer
  • 12.
    © Data FellasSPRL 2016 Agile Data Science Toolkit In a nutshell
  • 13.
    © Data FellasSPRL 2016 Agile Data Science Toolkit In a nutshell
  • 14.
    © Data FellasSPRL 2016 Agile Data Science Toolkit In a nutshell
  • 15.
    © Data FellasSPRL 2016 Agile Data Science Toolkit In a nutshell
  • 16.
    © Data FellasSPRL 2016 Agile Data Science Toolkit In a nutshell
  • 17.
    © Data FellasSPRL 2016 Agile Data Science Toolkit In a nutshell
  • 18.
  • 19.
    © Data FellasSPRL 2016 O’Reilly Online seminar
  • 20.
    © Data FellasSPRL 2016 Growing We’re Hiring! http://www.data-fellas.guru/#skillsjobs
  • 21.