Using Scalding for Data-Driven Product Development Sasha Ovsankin LinkedIn Presented to Scala By The Bay Aug 9, 2014
/summary Data-Driven Product Development
/summary Data-Driven Product Development Scalding = Hadoop + Scala
/summary Data-Driven Product Development Scalding = Hadoop + Scala
/data-driven Your Service
/data-driven Your Service Value
/data-driven Your Service Value Data
/data-driven Your Service Value Data
/data-driven Your Service Value Data
/data-driven Your Amazing Service Value Data
“Online” World /data-driven/linkedin Web Applications NoSQL Data Stores “Offline” World (Hadoop) HDFS Hadoop Jobs Tracking/l ogging Analytics Data Products Messaging Message delivery Databases
/linkedin/big-data/links • “LinkedIn Big Data Ecosystem” – http://lnkd.in/big-data-ecosystem • Grid Operations – http://lnkd.in/gridops2013
/scalding http://github.com/twitter/scalding • Scala-based DSL for Map/Reduce jobs • Built on Cascading, stable and mature Hadoop framework • Uses API similar to Scala collections: class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) } • Succinct and powerful • High level of abstraction
/data-driven/problem/scaling • Problem: Scaling • Solution – Distributed processing – High-level description of algorithms – Functional programming
…/solution/scalding
../problem/complexity • Problem: Complexity • Solution – Consistent way of organizing data • Self-describing data formats (Avro) • File organization – Type safety – Modularization
…/solution/scalding
/linkedin/hadoop/practices • All online data end up in HDFS – Avro encoding is standard • Production Process – CI/Automatic Build • More info forthcoming – Production Review – Operations and Monitoring • More info at http://lnkd.in/gridops2013 • Result: Thousands of jobs running in production • More info at http://lnkd.in/big-data-ecosystem
../solution/scala/killer-argument • Map & reduce -- primitives scala> (1 to 1000) map { pow(_,2) } reduce { _ + _ } res20: Int = 333833500
/linkedin/scalding/status • Started >1 year ago • Thousands of production LOC written in Scalding by our team – Pretty happy with readability, maintainability and tooling support • Dozens of flows are currently in production, and counting • Created Scalding user group • Growing interest • Learning: – Scala[Scalding] < Scala[ _ ]
/summary Data-Driven Product Development Scalding = Hadoop + Scala
/linkedin/join-us • Work on unique and interesting problems • Be part of great engineering community • Use latest tools and technologies • Help connect the world’s professionals to help them become more productive and successful • We are looking for amazing people interested in Software Engineering and Data Science – http://linkedin.com/careers Questions?

Using Scalding for Data Driven Product Development at LinkedIn