What is a Distributed Data Science Pipeline How with Apache Spark and Friends. by @DataFellas @Noootsab, 23th Nov. ‘15 @YaJUG
● (Legacy) Data Science Pipeline/Product ● What changed since then ● Distributed Data Science (today) ● Challenges ● Going beyond (productivity) Outline
Data Fellas 6 months old Belgian Startup Andy Petrella @noootsab Maths Geospatial Distributed Computing @SparkNotebook Spark/Scala trainer Machine Learning Xavier Tordoir @xtordoir Physics Bioinformatics Distributed Computing Scala (& Perl) Spark trainer Machine Learning
(Legacy) Data Science Pipeline Or, so called, Data Product Static Results Lot of information lost in translation Sounds like Waterfall ETL look and feel Sampling Modelling Tuning Report Interprete
(Legacy) Data Science Pipeline Or, so called, Data Product Mono machine! CPU bounds Memory bounds Or resampling because small-ish data Sampling Modelling Tuning Report Interprete
Facts Data gets bigger or, precisely, the amount of available sources explodes Data gets faster (and faster) - - only even consider: watching netflix on 4G ôÖ Our world Today No, it wasn’t better before
Consequences HARD (or will be too big...) Ephemeral Restricted View Sampling Report Our world Today No, it wasn’t better before
Interpretation ⇒ Too SLOW to get real ROI out of the overall system How to work around that? Our world Today No, it wasn’t better before Consequences
Our world Today No, it wasn’t better before Alerting system over descriptive charts More accurate results more or harder models (e.g. Deep Learning) More data Constant data flow Online interactions under control (e.g. direct feedback) Needs are
Our world Today No, it wasn’t better before Distributed Systems So, we need...
Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access YO! Aren’t we talking about “Big” Data ? Fast Data ? So could really (all) results being neither big nor fast? Actually, Results are becoming themselves “Big” Data ! Fast Data !
Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access how do we access data since 90’s? remember SOA? → SERVICES! Nowadays, we’re talking about micro services. Here we are, one service for one result.
Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access C’mon, charts/Tables Cannot only be the only views offered to customers/clients right? We need to open the capabilities to UI (dashboard), connectors (third parties), other services (“SOA”) … … OTHER Pipelines !!!
What about Productivity? Streamlining development lifecycle most welcome “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops data sci sci ops sci ops data web ops data web ops data sci
What about Productivity? Streamlining development lifecycle most welcome “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops sci sci ops sci ops data web ops data web ops data data sci
What about Productivity? Streamlining development lifecycle most welcome “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops data sci sci ops sci ops data web ops data web ops data sci
What about Productivity? Streamlining development lifecycle most welcome ➔ Longer production line ➔ More constraints (resources sharing, time, …) ➔ More people ➔ More skills Overlooking these points and you’ll be soon or sooner So, how to have: ● results coming fast enough whilst keeping accuracy level high? ● Responsivity to external/unpredictable events? WHEN... kicked
Warning Team Fight: seen by members
Warning Team Fight: seen by managers
Warning Team Fight: seen by employers
Warning Team Fight: seen by customers
What about Productivity? Streamlining development lifecycle most welcome At Data Fellas, we think that we need Interactivity and Reactivity to tighten the frontiers (within team and in time). Hence, Data Fellas ● extends the Spark Notebook (interactivity) ● builds the Shar3 product arounds it (Integrated Reactivity)
Concepts of Data Fellas’ Shar3 Shareable and Streamlined Data Science Analysis Production DistributionRendering Discovery Catalog Project Generator Micro Service / Binary format Schema for output Metadata
Using Shar3 yeah o/ Let’s take this example where some buddies from Datastax Joel Jacobson @joeljacobson Simon Ambridge @stratman1958 Mesosphere Michael Hausenblas @mhausenblas Typesafe Iulian Dragos @jaguarul Data Fellas Xavier Tordoir @xtordoir (and me)
Using Shar3 yeah o/ Let’s take this example where some buddies from Datastax Joel Jacobson @joeljacobson Simon Ambridge @stratman1958 Mesosphere Michael Hausenblas @mhausenblas Typesafe Iulian Dragos @jaguarul Data Fellas Xavier Tordoir @xtordoir (and me)
Using Shar3 yeah o/
Using Shar3 yeah o/
Using Shar3 yeah o/
Using Shar3 yeah o/
Using Shar3 yeah o/ What do we need to do now? ● Deploy ● connect the dots ● track ● scale BoTh the Jobs and the services
Using Shar3 yeah o/ From notebook to SBT project to Docker to Marathon SNB SBT/JAR Docker marathon
Using Shar3 yeah o/ From Notebook ● to output ● to Avro SNB
Using Shar3 yeah o/ From Notebook ● to Avro ● to service ● to SBT ● to Docker ● to Marathon SNB SBT/JAR Docker marathon
Using Shar3 yeah o/ From Notebook ● to Avro ● to Tableau ● or QlikView ● or D3.JS ● or … SNB
Using Shar3 yeah o/ So we have these information available: ● notebook’s markdown text ● notebook’s code/model ● data sources ● Output/sinks ● Output/services ● Avro schema Shouldn’t them all be reused???
Using Shar3 yeah o/ Variant Analysis
There is a service! Using Shar3 yeah o/
Let’s use it… Using Shar3 yeah o/
What was the process? Using Shar3 yeah o/
Fine and the output is in C* Using Shar3 yeah o/
Let’s check what’s in-there Using Shar3 yeah o/
not what I need, let’s ADAPT Using Shar3 yeah o/
Poke us on @DataFellas @Shar3_Fellas @SparkNotebook @Xtordoir & @Noootsab Now @TypeSafe: http://t.co/o1Bt6dQtgH If you wanna learn more about the different tools… Join us @ O’Reilly Follow up Soon on http://NoETL.org (HI5 to @ChiefScientist for that) That’s all folks Thanks for listening/staying

What is a distributed data science pipeline. how with apache spark and friends.

  • 1.
    What is aDistributed Data Science Pipeline How with Apache Spark and Friends. by @DataFellas @Noootsab, 23th Nov. ‘15 @YaJUG
  • 2.
    ● (Legacy) DataScience Pipeline/Product ● What changed since then ● Distributed Data Science (today) ● Challenges ● Going beyond (productivity) Outline
  • 3.
    Data Fellas 6 monthsold Belgian Startup Andy Petrella @noootsab Maths Geospatial Distributed Computing @SparkNotebook Spark/Scala trainer Machine Learning Xavier Tordoir @xtordoir Physics Bioinformatics Distributed Computing Scala (& Perl) Spark trainer Machine Learning
  • 4.
    (Legacy) Data SciencePipeline Or, so called, Data Product Static Results Lot of information lost in translation Sounds like Waterfall ETL look and feel Sampling Modelling Tuning Report Interprete
  • 5.
    (Legacy) Data SciencePipeline Or, so called, Data Product Mono machine! CPU bounds Memory bounds Or resampling because small-ish data Sampling Modelling Tuning Report Interprete
  • 6.
    Facts Data gets biggeror, precisely, the amount of available sources explodes Data gets faster (and faster) - - only even consider: watching netflix on 4G ôÖ Our world Today No, it wasn’t better before
  • 7.
    Consequences HARD (or willbe too big...) Ephemeral Restricted View Sampling Report Our world Today No, it wasn’t better before
  • 8.
    Interpretation ⇒ Too SLOWto get real ROI out of the overall system How to work around that? Our world Today No, it wasn’t better before Consequences
  • 9.
    Our world Today No,it wasn’t better before Alerting system over descriptive charts More accurate results more or harder models (e.g. Deep Learning) More data Constant data flow Online interactions under control (e.g. direct feedback) Needs are
  • 10.
    Our world Today No,it wasn’t better before Distributed Systems So, we need...
  • 11.
    Distributed Data Science System/Platform/SDK/Pipeline/Product/…whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
  • 12.
    Distributed Data Science System/Platform/SDK/Pipeline/Product/…whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
  • 13.
    Distributed Data Science System/Platform/SDK/Pipeline/Product/…whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
  • 14.
    Distributed Data Science System/Platform/SDK/Pipeline/Product/…whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
  • 15.
    Distributed Data Science System/Platform/SDK/Pipeline/Product/…whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
  • 16.
    Distributed Data Science System/Platform/SDK/Pipeline/Product/…whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access YO! Aren’t we talking about “Big” Data ? Fast Data ? So could really (all) results being neither big nor fast? Actually, Results are becoming themselves “Big” Data ! Fast Data !
  • 17.
    Distributed Data Science System/Platform/SDK/Pipeline/Product/…whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access how do we access data since 90’s? remember SOA? → SERVICES! Nowadays, we’re talking about micro services. Here we are, one service for one result.
  • 18.
    Distributed Data Science System/Platform/SDK/Pipeline/Product/…whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access C’mon, charts/Tables Cannot only be the only views offered to customers/clients right? We need to open the capabilities to UI (dashboard), connectors (third parties), other services (“SOA”) … … OTHER Pipelines !!!
  • 19.
    What about Productivity? Streamliningdevelopment lifecycle most welcome “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops data sci sci ops sci ops data web ops data web ops data sci
  • 20.
    What about Productivity? Streamliningdevelopment lifecycle most welcome “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops sci sci ops sci ops data web ops data web ops data data sci
  • 21.
    What about Productivity? Streamliningdevelopment lifecycle most welcome “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops data sci sci ops sci ops data web ops data web ops data sci
  • 22.
    What about Productivity? Streamliningdevelopment lifecycle most welcome ➔ Longer production line ➔ More constraints (resources sharing, time, …) ➔ More people ➔ More skills Overlooking these points and you’ll be soon or sooner So, how to have: ● results coming fast enough whilst keeping accuracy level high? ● Responsivity to external/unpredictable events? WHEN... kicked
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    What about Productivity? Streamliningdevelopment lifecycle most welcome At Data Fellas, we think that we need Interactivity and Reactivity to tighten the frontiers (within team and in time). Hence, Data Fellas ● extends the Spark Notebook (interactivity) ● builds the Shar3 product arounds it (Integrated Reactivity)
  • 28.
    Concepts of DataFellas’ Shar3 Shareable and Streamlined Data Science Analysis Production DistributionRendering Discovery Catalog Project Generator Micro Service / Binary format Schema for output Metadata
  • 29.
    Using Shar3 yeah o/ Let’stake this example where some buddies from Datastax Joel Jacobson @joeljacobson Simon Ambridge @stratman1958 Mesosphere Michael Hausenblas @mhausenblas Typesafe Iulian Dragos @jaguarul Data Fellas Xavier Tordoir @xtordoir (and me)
  • 30.
    Using Shar3 yeah o/ Let’stake this example where some buddies from Datastax Joel Jacobson @joeljacobson Simon Ambridge @stratman1958 Mesosphere Michael Hausenblas @mhausenblas Typesafe Iulian Dragos @jaguarul Data Fellas Xavier Tordoir @xtordoir (and me)
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    Using Shar3 yeah o/ Whatdo we need to do now? ● Deploy ● connect the dots ● track ● scale BoTh the Jobs and the services
  • 36.
    Using Shar3 yeah o/ Fromnotebook to SBT project to Docker to Marathon SNB SBT/JAR Docker marathon
  • 37.
    Using Shar3 yeah o/ FromNotebook ● to output ● to Avro SNB
  • 38.
    Using Shar3 yeah o/ FromNotebook ● to Avro ● to service ● to SBT ● to Docker ● to Marathon SNB SBT/JAR Docker marathon
  • 39.
    Using Shar3 yeah o/ FromNotebook ● to Avro ● to Tableau ● or QlikView ● or D3.JS ● or … SNB
  • 40.
    Using Shar3 yeah o/ Sowe have these information available: ● notebook’s markdown text ● notebook’s code/model ● data sources ● Output/sinks ● Output/services ● Avro schema Shouldn’t them all be reused???
  • 41.
  • 42.
    There is aservice! Using Shar3 yeah o/
  • 43.
  • 44.
    What was theprocess? Using Shar3 yeah o/
  • 45.
    Fine and theoutput is in C* Using Shar3 yeah o/
  • 46.
    Let’s check what’sin-there Using Shar3 yeah o/
  • 47.
    not what Ineed, let’s ADAPT Using Shar3 yeah o/
  • 48.
    Poke us on @DataFellas @Shar3_Fellas @SparkNotebook @Xtordoir& @Noootsab Now @TypeSafe: http://t.co/o1Bt6dQtgH If you wanna learn more about the different tools… Join us @ O’Reilly Follow up Soon on http://NoETL.org (HI5 to @ChiefScientist for that) That’s all folks Thanks for listening/staying