Using Scalding for Data Driven Product Development at LinkedIn

Using Scalding for Data-Driven Product Development Sasha Ovsankin LinkedIn Presented to Scala By The Bay Aug 9, 2014

/summary Data-Driven Product Development

/summary Data-Driven Product Development Scalding = Hadoop + Scala

/data-driven Your Service Value

/data-driven Your Service Value Data

/data-driven Your Amazing Service Value Data

“Online” World /data-driven/linkedin Web Applications NoSQL Data Stores “Offline” World (Hadoop) HDFS Hadoop Jobs Tracking/l ogging Analytics Data Products Messaging Message delivery Databases

/linkedin/big-data/links • “LinkedIn Big Data Ecosystem” – http://lnkd.in/big-data-ecosystem • Grid Operations – http://lnkd.in/gridops2013

/scalding http://github.com/twitter/scalding • Scala-based DSL for Map/Reduce jobs • Built on Cascading, stable and mature Hadoop framework • Uses API similar to Scala collections: class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) } • Succinct and powerful • High level of abstraction

/data-driven/problem/scaling • Problem: Scaling • Solution – Distributed processing – High-level description of algorithms – Functional programming

../problem/complexity • Problem: Complexity • Solution – Consistent way of organizing data • Self-describing data formats (Avro) • File organization – Type safety – Modularization

/linkedin/hadoop/practices • All online data end up in HDFS – Avro encoding is standard • Production Process – CI/Automatic Build • More info forthcoming – Production Review – Operations and Monitoring • More info at http://lnkd.in/gridops2013 • Result: Thousands of jobs running in production • More info at http://lnkd.in/big-data-ecosystem

../solution/scala/killer-argument • Map & reduce -- primitives scala> (1 to 1000) map { pow(_,2) } reduce { _ + _ } res20: Int = 333833500

/linkedin/scalding/status • Started >1 year ago • Thousands of production LOC written in Scalding by our team – Pretty happy with readability, maintainability and tooling support • Dozens of flows are currently in production, and counting • Created Scalding user group • Growing interest • Learning: – Scala[Scalding] < Scala[ _ ]

/linkedin/join-us • Work on unique and interesting problems • Be part of great engineering community • Use latest tools and technologies • Help connect the world’s professionals to help them become more productive and successful • We are looking for amazing people interested in Software Engineering and Data Science – http://linkedin.com/careers Questions?

Using Scalding for Data Driven Product Development at LinkedIn

More Related Content

What's hot

Similar to Using Scalding for Data Driven Product Development at LinkedIn

Recently uploaded

Using Scalding for Data Driven Product Development at LinkedIn