Design Patterns for Large-Scale Real-Time Learning

Design Patterns for Large-Scale Real-Time Learning Sean Owen / Director of Data Science / Cloudera 1

What We Talk About When We Talk About Data Science 2

www.quora.com/Data-Science/What-is-the-difference-between-a-data-scientist-and-a-statistician 3

Data Science Is Exploratory Analytics? www.tc.umn.edu/~zief0002/Comparing-Groups/blog.html thenextweb.com/microsoft/2013/07/08/microsoft-brings-the-office-store-to-22-new-markets-adds-power-bi-an-intelligence-tool-to-office-365/ 6

Example: • • • • • • Search, ML over Patient Data MapReduce for indexing, learning HBase for storage and fast access Also: Storm for incremental update And: relational DB for most recent derived data API façade for input; API for querying learning Engineering 8 Machine Learning engineering.cerner.com/2013/02/near-real-time-processing-over-hadoop-and-hbase/

Adding Operational Analytics 9

Data Science Will Be Operational Analytics 11

I Built A Model. Now What? Collect Input Repeat 12 Build Model Query Model

I Built A Model On Hadoop. Now What? ? Collect Input ? Repeat 13 Build Model ? Query Model

www.mwttl.com/wp-content/uploads/2013/11/IMG_5446_edited-2_mwttl.jpg 15

Gaps to fill, and Goals • Model Building • • • • • Model Serving • • 17 Large-scale Continuous Apache Hadoop™-based Few, good algorithms Real-time query Real-time update • Algorithms • • • • Parallelizable Updateable Works on diverse input Interoperable • • • PMML model format Simple REST API Open source

Large-Scale or Real-Time? Large-Scale Offline Batch vs Real-Time Online Streaming Why Don’t We Have Both? λ! 18

Lambda Architecture Batch, Stream Processing are different • Tackle separately in 2+ Layers • Batch Layer: offline, asynchronous • Serving / Speed Layer: real-time, incremental, approximate • … λ? jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for-architecting 19

Two Layers • Computation Layer • • • • • Java-based server process Client of Hadoop 2.x Periodically builds “generation” from recent data and past model Baby-sits MapReduce* jobs (or, locally in-core) Publishes models • Serving Layer • • • • • • * Apache Spark later 21 Apache Tomcat™-based server process Consumes models from HDFS (or local FS) Serves queries from model in memory Updates from new input Also writes input to HDFS Replicas for scale

Collaborative Filtering : ALS • • • • • • 22 Alternating Least Squares Latent-factor model Accepts implicit or explicit feedback Real-time update via fold-in of input No cold-start Parallelizable YT X

Clustering : k-means++ Well-known and understood • Parallelizable • Clusters updateable • cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering 23

Classification / Regression : RDF • • • • • • 24 Random Decision Forests Ensemble method Numeric, categorical features and target Very parallel Nodes updateable Works well on many problems age$ 30 >$ female? income$ 20000 >$ Yes Yes Yes No

PMML Predictive Modeling Markup Language • XML-based format for predictive models • Standardized by Data Mining Group (www.dmg.org) • Wide tool support • <PMML xmlns="http://www.dmg.org/PMML-4_1" version="4.1"> <Header copyright="www.dmg.org"/> <DataDictionary numberOfFields="5"> <DataField name="temperature" optype="continuous" dataType="double"/> … </DataDictionary> <TreeModel modelName="golfing" functionName="classification"> <MiningSchema> <MiningField name="temperature"/> … </MiningSchema> <Node score="will play"> <Node score="will play"> <SimplePredicate field="outlook" operator="equal" value="sunny"/> … </Node> </Node> </TreeModel> </PMML> www.dmg.org/v4-1/TreeModel.html 25

HTTP REST API • • • • • 26 Convention for RPC-like request / response HTTP verbs, transport GET : query POST : add input Easy from browser, CLI, Java, Python, Scala, etc. GET /recommend/jwills HTTP/1.1 200 OK Content-Type: text/plain "Ray LaMontagne",0.951 "Fleet Foxes",0.7905 "The National",0.688 "Shearwater",0.3017

Wish List • Revamp workflow • • • De-emphasize model building • • • Well-solved Bring your own Emphasize integration • 27 Oozie? Spark / Crunch-like API, not raw M/R PMML, etc. More component-ized • Less black-box service • More “push” options • • • Flume? “Pull” options • • Kafka? Hive / Impala ?

Open Source github.com/cloudera/oryx 100% Apache License 2.0 28

Design Patterns for Large-Scale Real-Time Learning

Design Patterns for Large-Scale Real-Time Learning

More Related Content

What's hot

Similar to Design Patterns for Large-Scale Real-Time Learning

More from Swiss Big Data User Group

Recently uploaded

Design Patterns for Large-Scale Real-Time Learning

Editor's Notes