Kevin, Between (VCNC) kevin@between.us Powering a Startup with Apache Spark #EUent8
Seoul, South Korea
Gangnam Hongdae Itaewon Myungdong
2011: 100 beta users 2012: 1.0 release, 2M downloads 2013: 5M downloads, global launches 2014: Between 2.0, 10M downloads 2015: Between 3.0 2016: Starts monetization, 20M downloads 2017: Global expansion, new business, team of 60
put your #assignedhashtag here by setting the footer in view-header/footer Kevin Kim • Came from Seoul, South Korea • Co-founder, used to be a product developer • Now a data analyst, engineer, team leader • Founder of Korea Spark User Group • Committer and PMC member of Apache Zeppelin 6
Between Data Team 7
Intro to Between Data Team • Data engineer * 4 – Manager, engineer with various stack of knowledge and experience – Junior engineer, used to be a server engineer – Senior engineer, has lots of exps and skills – Data engineer, used to be a top level Android developer • Hiring data analyst and machine learning expert 8
Between Data Team is doing.. • Analysis – Service monitoring – Analysis usage of new features and build product strategies • Data Infrastructure – Build and manage infrastructure – Spark, Zeppelin, AWS, BI Tools, etc • Third Party Management – Mobile Attribution Tools for marketing (Kochava, Tune, Appsflyer, etc) – Google Analytics, Firebase, etc – Ad Networks 9
Between Data Team is doing.. • Machine Learning Study & Research – For the next business model • Support team – To build business, product, monetization strategies • Performance Marketing Analysis – Monitoring effectiveness of marketing budgets • Product Development – Improves client performance, server architecture, etc 10
11
7 PM ~ 12 Sunset @ Between Office
Technologies 13
Requirements • Big Data – 2TB/day of log data from millions of DAU – 20M of users • Small Team – Team of 4, need to support 50 • Tiny Budget – Company is just over BEP (Break Even Point) • Need very efficient tech stack! 14
Way We Work • Use Apache Spark as a general processing engine • Scriptify everything with Apache Zeppelin • Heavy utilization of AWS and Spot instances to cut cost • Proper selection of BI Dashboard Tools 15
Apache Spark, General Engine • Definitely the best way to deal with big data (as you all know!) • It’s performance, agility exactly meets startup requirements – Used Spark from 2014 • Great match with Cloud Service, especially with Spot instance – Utilizing burst nature of Cloud Service 16
Scriptify Everything with Zeppelin • Doing everything on Zeppelin! • Daily batch tasks in form of Spark scripts (using Zeppelin scheduler) • Ad hoc analysis • Cluster control scripts • The world first user of Zeppelin! • More than 200 Zeppelin notebooks 17
AWS Cloud • Spot Instance is my friend! – Mostly use spot instance for analysis – only 10 ~ 20% of cost compare to on-demand instances • Dynamic cluster launch with Auto Scale – Launch clusters automatically for batch analysis – Manually launch more clusters on Zeppelin, with Auto Scale script – Automatically diminish clusters when no usage 18
BI Dashboard Tools • Use Zeppelin as a dashboard using Spark SQL with ZEPL • Holistics (holistics.io) or Dash (plot.ly/products/dash/) 19
Questions & Challenges 20
RDD API or DataFrame API? • Now Spark has very different style of APIs – Programmatic RDD API – SQL-like DataFrame, DataSet API • In case of having many, simple ad-hoc queries – DataFrame works • Having more complex, deep dive analytic questions – RDD works • For a while, mostly use RDD, DataFrame for ML or simple ad hoc tasks 21
Sushi or Cooked Data? • Keeping data in a raw form as possible! – Doing ETL’s usually makes trouble, increasing management cost – The Sushi Principle (Joseph & Robert in Strata) – Drastically reduce operation & management cost – Apache Spark is a great tool for extracting insight from raw data 22 fresh data!
To Hire Data Analyst or Not? • For data analyst, expected skill set are.. – Excel, SQL, R, .. • Those skills are not expected.. – Programatic API like Spark RDD – Cooking raw data • Prefer data engineer with analytic skills • May need to add some ETL tasks to work with data analyst 23
Better, Faster Team Support? • Better - Zeppelin is great for analyzing data, but not enough for sharing data for team – We have really few alternatives – Increase of using BI dashboard tools? – Still finding a good way • Faster - Launching a Spark cluster takes few minutes – Not bad, but we want it faster – Google BigQuery or AWS Athena – SQL Database with ETL 24
Future Plan? • Prepare for exploding # of data operations! – Team is growing, business is growing – # of tasks – # of 3rd party data products – Communication cost • Operations with machine learning & deep learning – Better way to manage task & data flow 25
Let’s wrap up.. 26
What Matters for Us • Support Team – Each Team should see proper data and make good decision from it – Regular meetings, fast response to adhoc data request – Ultimately, our every activity should be related to company’s business • Technical Lead – Technical investments for competence of both company and individual – Working in Between should be a best experience for each individuals • Social Impact – Our activity on work has valuable impact for society? – Open source, activity on community 27
How Apache Spark is Powering a Startup? • One great tool for general purpose – Daily batch tasks – Agile, adhoc analysis – Drawing dashboard – Many more.. • Helps saving time, reducing cost of data operations • Great experience for engineer and analyst • Sharing know-how’s to / from community 28
Work as a data engineer at Startup • Fascinating, fast evolution of tech • Need hard work and labor • Data work will shine only when it is understood and used by teammates 29 Two Peasants Digging, Vincent van GoghTwo Men Digging, Jean-Francois Millet
Thank you! 30

Powering a Startup with Apache Spark with Kevin Kim

  • 1.
    Kevin, Between (VCNC) kevin@between.us Poweringa Startup with Apache Spark #EUent8
  • 2.
  • 3.
  • 5.
    2011: 100 betausers 2012: 1.0 release, 2M downloads 2013: 5M downloads, global launches 2014: Between 2.0, 10M downloads 2015: Between 3.0 2016: Starts monetization, 20M downloads 2017: Global expansion, new business, team of 60
  • 6.
    put your #assignedhashtaghere by setting the footer in view-header/footer Kevin Kim • Came from Seoul, South Korea • Co-founder, used to be a product developer • Now a data analyst, engineer, team leader • Founder of Korea Spark User Group • Committer and PMC member of Apache Zeppelin 6
  • 7.
  • 8.
    Intro to BetweenData Team • Data engineer * 4 – Manager, engineer with various stack of knowledge and experience – Junior engineer, used to be a server engineer – Senior engineer, has lots of exps and skills – Data engineer, used to be a top level Android developer • Hiring data analyst and machine learning expert 8
  • 9.
    Between Data Teamis doing.. • Analysis – Service monitoring – Analysis usage of new features and build product strategies • Data Infrastructure – Build and manage infrastructure – Spark, Zeppelin, AWS, BI Tools, etc • Third Party Management – Mobile Attribution Tools for marketing (Kochava, Tune, Appsflyer, etc) – Google Analytics, Firebase, etc – Ad Networks 9
  • 10.
    Between Data Teamis doing.. • Machine Learning Study & Research – For the next business model • Support team – To build business, product, monetization strategies • Performance Marketing Analysis – Monitoring effectiveness of marketing budgets • Product Development – Improves client performance, server architecture, etc 10
  • 11.
  • 12.
    7 PM ~ 12 Sunset@ Between Office
  • 13.
  • 14.
    Requirements • Big Data –2TB/day of log data from millions of DAU – 20M of users • Small Team – Team of 4, need to support 50 • Tiny Budget – Company is just over BEP (Break Even Point) • Need very efficient tech stack! 14
  • 15.
    Way We Work •Use Apache Spark as a general processing engine • Scriptify everything with Apache Zeppelin • Heavy utilization of AWS and Spot instances to cut cost • Proper selection of BI Dashboard Tools 15
  • 16.
    Apache Spark, GeneralEngine • Definitely the best way to deal with big data (as you all know!) • It’s performance, agility exactly meets startup requirements – Used Spark from 2014 • Great match with Cloud Service, especially with Spot instance – Utilizing burst nature of Cloud Service 16
  • 17.
    Scriptify Everything withZeppelin • Doing everything on Zeppelin! • Daily batch tasks in form of Spark scripts (using Zeppelin scheduler) • Ad hoc analysis • Cluster control scripts • The world first user of Zeppelin! • More than 200 Zeppelin notebooks 17
  • 18.
    AWS Cloud • SpotInstance is my friend! – Mostly use spot instance for analysis – only 10 ~ 20% of cost compare to on-demand instances • Dynamic cluster launch with Auto Scale – Launch clusters automatically for batch analysis – Manually launch more clusters on Zeppelin, with Auto Scale script – Automatically diminish clusters when no usage 18
  • 19.
    BI Dashboard Tools •Use Zeppelin as a dashboard using Spark SQL with ZEPL • Holistics (holistics.io) or Dash (plot.ly/products/dash/) 19
  • 20.
  • 21.
    RDD API orDataFrame API? • Now Spark has very different style of APIs – Programmatic RDD API – SQL-like DataFrame, DataSet API • In case of having many, simple ad-hoc queries – DataFrame works • Having more complex, deep dive analytic questions – RDD works • For a while, mostly use RDD, DataFrame for ML or simple ad hoc tasks 21
  • 22.
    Sushi or CookedData? • Keeping data in a raw form as possible! – Doing ETL’s usually makes trouble, increasing management cost – The Sushi Principle (Joseph & Robert in Strata) – Drastically reduce operation & management cost – Apache Spark is a great tool for extracting insight from raw data 22 fresh data!
  • 23.
    To Hire DataAnalyst or Not? • For data analyst, expected skill set are.. – Excel, SQL, R, .. • Those skills are not expected.. – Programatic API like Spark RDD – Cooking raw data • Prefer data engineer with analytic skills • May need to add some ETL tasks to work with data analyst 23
  • 24.
    Better, Faster TeamSupport? • Better - Zeppelin is great for analyzing data, but not enough for sharing data for team – We have really few alternatives – Increase of using BI dashboard tools? – Still finding a good way • Faster - Launching a Spark cluster takes few minutes – Not bad, but we want it faster – Google BigQuery or AWS Athena – SQL Database with ETL 24
  • 25.
    Future Plan? • Preparefor exploding # of data operations! – Team is growing, business is growing – # of tasks – # of 3rd party data products – Communication cost • Operations with machine learning & deep learning – Better way to manage task & data flow 25
  • 26.
  • 27.
    What Matters forUs • Support Team – Each Team should see proper data and make good decision from it – Regular meetings, fast response to adhoc data request – Ultimately, our every activity should be related to company’s business • Technical Lead – Technical investments for competence of both company and individual – Working in Between should be a best experience for each individuals • Social Impact – Our activity on work has valuable impact for society? – Open source, activity on community 27
  • 28.
    How Apache Sparkis Powering a Startup? • One great tool for general purpose – Daily batch tasks – Agile, adhoc analysis – Drawing dashboard – Many more.. • Helps saving time, reducing cost of data operations • Great experience for engineer and analyst • Sharing know-how’s to / from community 28
  • 29.
    Work as adata engineer at Startup • Fascinating, fast evolution of tech • Need hard work and labor • Data work will shine only when it is understood and used by teammates 29 Two Peasants Digging, Vincent van GoghTwo Men Digging, Jean-Francois Millet
  • 30.