What is a distributed data science pipeline. how with apache spark and friends.

What is a Distributed Data Science Pipeline How with Apache Spark and Friends. by @DataFellas @Noootsab, 23th Nov. ‘15 @YaJUG

● (Legacy) Data Science Pipeline/Product ● What changed since then ● Distributed Data Science (today) ● Challenges ● Going beyond (productivity) Outline

Data Fellas 6 months old Belgian Startup Andy Petrella @noootsab Maths Geospatial Distributed Computing @SparkNotebook Spark/Scala trainer Machine Learning Xavier Tordoir @xtordoir Physics Bioinformatics Distributed Computing Scala (& Perl) Spark trainer Machine Learning

(Legacy) Data Science Pipeline Or, so called, Data Product Static Results Lot of information lost in translation Sounds like Waterfall ETL look and feel Sampling Modelling Tuning Report Interprete

(Legacy) Data Science Pipeline Or, so called, Data Product Mono machine! CPU bounds Memory bounds Or resampling because small-ish data Sampling Modelling Tuning Report Interprete

Facts Data gets bigger or, precisely, the amount of available sources explodes Data gets faster (and faster) - - only even consider: watching netflix on 4G ôÖ Our world Today No, it wasn’t better before

Consequences HARD (or will be too big...) Ephemeral Restricted View Sampling Report Our world Today No, it wasn’t better before

Interpretation ⇒ Too SLOW to get real ROI out of the overall system How to work around that? Our world Today No, it wasn’t better before Consequences

Our world Today No, it wasn’t better before Alerting system over descriptive charts More accurate results more or harder models (e.g. Deep Learning) More data Constant data flow Online interactions under control (e.g. direct feedback) Needs are

Our world Today No, it wasn’t better before Distributed Systems So, we need...

Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access

Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access YO! Aren’t we talking about “Big” Data ? Fast Data ? So could really (all) results being neither big nor fast? Actually, Results are becoming themselves “Big” Data ! Fast Data !

Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access how do we access data since 90’s? remember SOA? → SERVICES! Nowadays, we’re talking about micro services. Here we are, one service for one result.

Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access C’mon, charts/Tables Cannot only be the only views offered to customers/clients right? We need to open the capabilities to UI (dashboard), connectors (third parties), other services (“SOA”) … … OTHER Pipelines !!!

What about Productivity? Streamlining development lifecycle most welcome “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops data sci sci ops sci ops data web ops data web ops data sci

What about Productivity? Streamlining development lifecycle most welcome “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops sci sci ops sci ops data web ops data web ops data data sci

What about Productivity? Streamlining development lifecycle most welcome ➔ Longer production line ➔ More constraints (resources sharing, time, …) ➔ More people ➔ More skills Overlooking these points and you’ll be soon or sooner So, how to have: ● results coming fast enough whilst keeping accuracy level high? ● Responsivity to external/unpredictable events? WHEN... kicked

Warning Team Fight: seen by members

Warning Team Fight: seen by managers

Warning Team Fight: seen by employers

Warning Team Fight: seen by customers

What about Productivity? Streamlining development lifecycle most welcome At Data Fellas, we think that we need Interactivity and Reactivity to tighten the frontiers (within team and in time). Hence, Data Fellas ● extends the Spark Notebook (interactivity) ● builds the Shar3 product arounds it (Integrated Reactivity)

Concepts of Data Fellas’ Shar3 Shareable and Streamlined Data Science Analysis Production DistributionRendering Discovery Catalog Project Generator Micro Service / Binary format Schema for output Metadata

Using Shar3 yeah o/ Let’s take this example where some buddies from Datastax Joel Jacobson @joeljacobson Simon Ambridge @stratman1958 Mesosphere Michael Hausenblas @mhausenblas Typesafe Iulian Dragos @jaguarul Data Fellas Xavier Tordoir @xtordoir (and me)

Using Shar3 yeah o/ What do we need to do now? ● Deploy ● connect the dots ● track ● scale BoTh the Jobs and the services

Using Shar3 yeah o/ From notebook to SBT project to Docker to Marathon SNB SBT/JAR Docker marathon

Using Shar3 yeah o/ From Notebook ● to output ● to Avro SNB

Using Shar3 yeah o/ From Notebook ● to Avro ● to service ● to SBT ● to Docker ● to Marathon SNB SBT/JAR Docker marathon

Using Shar3 yeah o/ From Notebook ● to Avro ● to Tableau ● or QlikView ● or D3.JS ● or … SNB

Using Shar3 yeah o/ So we have these information available: ● notebook’s markdown text ● notebook’s code/model ● data sources ● Output/sinks ● Output/services ● Avro schema Shouldn’t them all be reused???

Using Shar3 yeah o/ Variant Analysis

There is a service! Using Shar3 yeah o/

Let’s use it… Using Shar3 yeah o/

What was the process? Using Shar3 yeah o/

Fine and the output is in C* Using Shar3 yeah o/

Let’s check what’s in-there Using Shar3 yeah o/

not what I need, let’s ADAPT Using Shar3 yeah o/

Poke us on @DataFellas @Shar3_Fellas @SparkNotebook @Xtordoir & @Noootsab Now @TypeSafe: http://t.co/o1Bt6dQtgH If you wanna learn more about the different tools… Join us @ O’Reilly Follow up Soon on http://NoETL.org (HI5 to @ChiefScientist for that) That’s all folks Thanks for listening/staying

What is a distributed data science pipeline. how with apache spark and friends.

More Related Content

What's hot

Viewers also liked

Similar to What is a distributed data science pipeline. how with apache spark and friends.

More from Andy Petrella

Recently uploaded

What is a distributed data science pipeline. how with apache spark and friends.