Data science lifecycle with Apache Flink and Apache Zeppelin (incubating) Flink Forward Moon moon@nflabs.com NFLabs www.nflabs.com
Content 1. Data science lifecycle 2. Zeppelin for data science 3. Zeppelin and Flink 4. Project Roadmap
Data science lifecycle
Data Science: process https://en.wikipedia.org/wiki/Data_analysis
Data Science: tools MLlib
Data Science: people Engineer Data Scientist DevOps Business http://aarondavis.design/
Content 1. Data science lifecycle 2. Zeppelin for data science 3. Zeppelin and Flink 4. Project Roadmap
Zeppelin for data scientist
ProjectTimeline ASF Incubation12.2014 08.2014 Started getting adoption http://zeppelin.incubator.apache.org 12.2012 Commercial Product for data analysis 10.2013 Open sourced a single feature
Hadoop Landscape Cloudera-ML ML-base MRQL Shark ?
Commercial Product 12.2012
Zeppelin 10.2013
Zeppelin 10.2013
Zeppelin 08.2014
Zeppelin 08.2014
Third-party Products 10.2014
Apache Incubation Proposal 11.2014
Acceptance by Incubator 23.12.2014
Current Status 1 Release 68 Contributors worldwide 722 Stars on GH 300/900 Emails at users/dev @i.a.o
Interactive Notebooks
InteractiveVisualization
Multiple Backends
Zeppelin & Friends Z-Manager ZeppelinHub …⋯ Collaboration/Sharing Packaging & Deployment Zeppelin + Full stack on a cloud Packages Backend Integration
OnlineViewer
Deployment https://github.com/hortonworks-gallery/ambari-zeppelin-service
Deployment
As a Service
Before Cloudera-ML ML-base MRQL Shark ?
After Cloudera-ML ML-base MRQL Shark
Content 1. Data science lifecycle 2. Zeppelin for data science 3. Zeppelin and Flink 4. Project Roadmap
Flink integration Integrated through Interpreter 
 Data processing system abstraction in Zeppelin
Interpreter http://zeppelin.incubator.apache.org/docs/development/writingzeppelininterpreter.html
Writing an Interpreter public abstract void open(); public abstract void close(); public abstract InterpreterResult interpret(String st, InterpreterContext context); public abstract void cancel(InterpreterContext context); public abstract int getProgress(InterpreterContext context); public abstract List<String> completion(String buf, int cursor); public abstract FormType getFormType(); public Scheduler getScheduler(); Must have Good to have Advanced
Flink Interpreter https://github.com/apache/incubator-zeppelin/blob/master/flink/src/main/java/org/apache/zeppelin/flink/FlinkInterpreter.java Zeppelin Server Thrift Flink Interpreter Interpreter JVM process FlinkILoop ExecutionEnvironment
Using interpreter Configure Bind use
Using interpreter Use different interpreters in the same notebook
Display System Zeppelin Server Flink Interpreter Other Interpreter Zeppelin webapp Websocket, REST Text Html Table Angular
Display System Select display system through output
Built in scheduler Built-in scheduler runs your notebook with cron expression.
Flexible layout Flexible layout
DEMO
Content 1. Data science lifecycle 2. Zeppelin for data science 3. Zeppelin and Flink 4. Project Roadmap
Flink Integration • ZeppelinContext :Access to Zeppelin provided features • - Dynamic form • - Angular display system • Dependency loading • Auto completion • Cancel • Get progress information
Thank you Q & A Moon moon@nflabs.com NFLabs www.nflabs.com http://zeppelin.incubator.apache.org/
Project roadmap
Multi-tenancy Two approaches 1. Implement authentication,ACL inside of Zeppelin https://github.com/apache/incubator-zeppelin/pull/53 2. Run Zeppelin on top of Docker
 
 http://github.com/NFLabs/z-manager
Zeppelin for organizations
An Engineer engineer by http://aarondavis.design/
ATeam engineer by http://aarondavis.design/
An Organization engineer by http://aarondavis.design/
That’s too many! engineer by http://aarondavis.design/
What is the problem? Too much: Install Configure Cluster resources
Solution? We have containers + reverse proxy
Z Manager PoC httpd + mod_php nginx Linux box engineer by http://aarondavis.design/ 2 days, bash + php :(
Z Manager PoC
Z Manager http://github.com/NFLabs/z-manager Apache 2.0 Licence Containerized deployment per user Reverse proxy Single binary Simple web application Z Manager SGA to ASF coming *
Z Manager Auto-update engineer by http://aarondavis.design/ Linux box go + react :) Z Manager process
Z Manager
Helium
People do the similar work with different data New visualization Model & Algorithm Data process pipeline engineer by http://aarondavis.design/
Package and distribute work New visualization Model & Algorithm Data process pipeline Pkg Repo engineer by http://aarondavis.design/
Helium https://s.apache.org/helium Platform for on top of Apache Zeppelin Data Analytics Application
Helium Application = + View Algorithm Zeppelin provided Resources
Resources Data Computing Any java object
 -
 Result
 of
 last
 execution

 -

Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin