Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using Apache APIs: Kafka, Spark, HBase

© 2017 MapR Technologies Data Collect Process Store Spark Streami ng Analyze HBase SQL ML Model Stream Input Spark Streami ng Stream Enriched Use Case: Real-Time Analysis of Geographically Clustered Vehicles

© 2017 MapR Technologies Fast data Pipeline for Uber Data Using Apache APIs: Kafka, Spark, Hba •  Why combine Machine Learning with Streaming Events? •  Introduction to Machine Learning and Spark •  Introduction to Kafka & Spark Streaming •  ISpark Streaming and NoSQL HBase Note: this code example is from me, only the data is from Uber

© 2017 MapR Technologies Why combine Streaming Events with Machine Learning? Fraud detection Smart Machinery Utility Smart Meters Home Automation Networks Manufacturing Security Systems Patient Monitoring

© 2017 MapR Technologies Why combine IOT/Streaming Events with Machine Learning? •  Audi and Daimler deep learning for autonomous vehicles –  Using MapR platform to scale deep learning efforts https://mapr.com/company/press-releases/norcom-selects-mapr-deep- learning/

© 2017 MapR Technologies Why combine IOT with Machine Learning? •  Monitoring devices combined with ML can provide alerts for Sepsis, which is one of the leading causes for death in hospitals –  http://www.computerweekly.com/news/450422258/Putting-sepsis-algorithms-into-electronic- patient-records

© 2017 MapR Technologies Why combine IOT with Machine Learning? •  A Stanford team has shown that a machine-learning model can identify heart arrhythmias from an electrocardiogram (ECG) better than an expert –  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready-to-play-doctor/

© 2017 MapR Technologies Why combine IOT/Streaming Events with Machine Learning? •  ML & location data: –  identify behavior patterns and trends: –  telecom, travel, marketing... •  http://www.cisco.com/c/en/us/solutions/ industries/smart-connected-communities.html

© 2017 MapR Technologies Why combine IOT with Machine Learning? •  Uber Near Realtime Price Surging –  https://www.slideshare.net/ConfluentInc/kafka-uber- the-worlds-realtime-transit-infrastructure-aaron- schildkrout •  (however this example code is mine) NEAR REALTIME PRICE SURGING

© 2017 MapR Technologies ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Historical Data Deployed Model Insights Data Discovery, Model Creation Production Feature Extraction Feature Extraction ●  Churn Modelling Uber trips Stream TopicUber trips New Data

© 2017 MapR Technologies What is Supervised Machine Learning? Supervised •  Classification –  Naïve Bayes –  SVM –  Random Decision Forests •  Regression –  Linear –  Logistic Machine Learning Unsupervised •  Clustering –  K-means •  Dimensionality reduction –  Principal Component Analysis –  SVD

© 2017 MapR Technologies Debit Card Fraud Example •  What are we trying to predict? –  This is the Label or Target outcome: –  Fraud or Not Fraud •  What are the “if questions” or properties we can use to predict? –  These are the Features: –  Is the amount spent today > historical average? –  Unusual region for card history ? –  Known merchant or not ?

© 2017 MapR Technologies Decision Tree For Classification •  Tree of decisions about features •  IF THEN ELSE questions using features at each tree node •  Answers branch to child nodes Is the amount spent in 24 hours > average Is the number of states used from > 2 Are there multiple Purchases today from risky merchants? YES NO NoYES Fraud 90% Not Fraud 50% Fraud 90% Not Fraud 30% YES No

© 2017 MapR Technologies What is Unsupervised Machine Learning? Machine Learning Unsupervised •  Clustering –  K-means •  Dimensionality reduction –  Principal Component Analysis –  SVD Supervised •  Classification –  Naïve Bayes –  SVM –  Random Decision Forests •  Regression –  Linear –  Logistic

© 2017 MapR Technologies Unsupervised Algorithms use Unlabeled data Customer GroupsBuild ModelTrain Algorithm Finds patterns New Customer Purchase Data Use Model Similar Customer Group Contains patterns Recognizes patterns Customer purchase data

© 2017 MapR Technologies Clustering: Definition •  Unsupervised learning task •  Groups objects into clusters of high similarity –  Search results grouping –  Grouping of customers, patients –  Text categorization –  recommendations •  Anomaly detection: find what’s not similar

© 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid x x x x x

© 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid 3.  Update centroids to center of points x x x x x

© 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid 3.  Update centroids to center of points 4.  Repeat until conditions met x x x x x

© 2017 MapR Technologies Apache Spark Streaming •  Task scheduling •  Memory Management •  Fault recovery •  Interacting with storage systems •  DataFrame API •  Catalyst Optimizer •  Processing of live streams •  Micro-batching •  Machine Learning •  Multiple types of ML algorithms •  Graph processing •  Graph parallel computations Distributed Parallel Cluster computing Programming Framework

© 2017 MapR Technologies Spark Distributed Datasets Dataset W Executor P4 W Executor P1 P3 W Executor P2 partitioned Partition 1 8213034705, 95, 2.927373, jake7870, 0…… Partition 2 8213034705, 115, 2.943484, Davidbresler2, 1…. Partition 3 8213034705, 100, 2.951285, gladimacowgirl, 58… Partition 4 8213034705, 117, 2.998947, daysrus, 95…. •  Read only collection of typed objects Dataset[T] •  Partitioned across a cluster •  Operated on in parallel •  in memory can be Cached

© 2017 MapR Technologies Example loading a Dataset val df: Dataset[Uber] = spark.read.option("inferSchema", "false").schema(schema).csv(“data/uber.csv").as[Uber] df.cache df.count Worker Worker Worker Driver Block 1 Block 2 Block 3

© 2017 MapR Technologies Example: Worker Worker Worker Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 results results results val df: Dataset[Uber] = spark.read.option("inferSchema", "false").schema(schema).csv(“data/uber.csv").as[Uber] df.cache df.count res9: Long = 829275

© 2017 MapR Technologies Example Worker Worker Worker Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 val df: Dataset[Uber] = spark.read.option("inferSchema", "false").schema(schema).csv(“data/uber.csv").as[Uber] df.cache df.count df.show

© 2017 MapR Technologies Example: Log Mining Worker Worker Worker Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 Driver Process from Cache Process from Cache Process from Cache Cached, does not have to read from file again val df: Dataset[Uber] = spark.read.option("inferSchema", "false").schema(schema).csv(“data/uber.csv").as[Uber] df.cache df.count df.show

© 2017 MapR Technologies Example: Worker Worker Worker Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 Driver results results results val df: Dataset[Uber] = spark.read.option("inferSchema", "false").schema(schema).csv(“data/uber.csv").as[Uber] df.cache df.count df.show

© 2017 MapR Technologies Example: Worker Worker Worker Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 Driver Cache your data è Faster Results val df: Dataset[Uber] = spark.read.option("inferSchema", "false").schema(schema).csv(“data/uber.csv").as[Uber] df.cache df.count df.show

© 2017 MapR Technologies Spark Use Cases Iterative Algorithms on large amounts of data Some Algorithms that need iterations •  Clustering (K-Means) •  Linear Regression •  Graph Algorithms (e.g., PageRank) •  Alternating Least Squares ALS Some Example Use Cases: •  Anomaly detection •  Classification •  Recommendations

© 2017 MapR Technologies Part 1: Spark Machine Learning •  End to End Application for Monitoring Uber Data using Spark ML •  https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine- learning-streaming-and-kafka-api-part-1/

© 2017 MapR Technologies Uber Data •  Date/Time: The date and time of the Uber pickup •  Lat: The latitude of the Uber pickup •  Lon: The longitude of the Uber pickup •  Base: The TLC base company affiliated with the Uber pickup The Data Records are in CSV format. An example line is shown below: •  2014-08-01 00:00:00,40.729,-73.9422,B02598

© 2017 MapR Technologies Load the data into a Dataframe: Define the Schema case class Uber(dt: String, lat: Double, lon: Double, base: String) val schema = StructType(Array( StructField("dt", TimestampType, true), StructField("lat", DoubleType, true), StructField("lon", DoubleType, true), StructField("base", StringType, true) )) Input Comma Separated Values: datetime, lattitude, longitude, base 2014-08-01 00:00:00,40.729,-73.9422,B02598

© 2017 MapR Technologies Dataset merged with Dataframe •  in Spark 2.0, DataFrame APIs merged with Datasets APIs •  A Dataset is a collection of typed objects (SQL and functions) •  A DataFrame is a Dataset of generic Row objects (SQL)

© 2017 MapR Technologies Data Frame Load data Load the data into a Dataframe val schema = StructType(Array( StructField("dt", TimestampType, true), StructField("lat", DoubleType, true), StructField("lon", DoubleType, true), StructField("base", StringType, true) )) val df = spark.read.option("inferSchema", "false").schema(schema) .csv("/user/user01/data/uber.csv") df.show

© 2017 MapR Technologies Dataset Load data Load the data into a Dataset case class Uber(dt: String, lat: Double, lon: Double, base: String) extends Serializable val schema = StructType(Array( StructField("dt", TimestampType, true), StructField("lat", DoubleType, true), StructField("lon", DoubleType, true), StructField("base", StringType, true) )) val df = spark.read.option("inferSchema", "false").schema(schema) .csv("/user/user01/data/uber.csv") .as[Uber] df.show

© 2017 MapR Technologies Uber Example •  What are the “if questions” or properties we can use to group? –  These are the Features: –  We will group by Lattitude, longitude •  Use Spark SQL to analyze: Day of the week, time, rush hour … NEAR REALTIME PRICE SURGING

© 2017 MapR Technologies Extract the Features Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Model Featurization Training Model Evaluation Best Model Training Data + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ Feature Vectors are vectors of numbers representing the value for each feature

© 2017 MapR Technologies Use VectorAssembler to put features in vector column val featureCols = Array("lat", "lon") val assembler = new VectorAssembler() .setInputCols(featureCols) .setOutputCol("features") val df2 = assembler.transform(df) Data Frame Load data transform DataFrame + Features

© 2017 MapR Technologies Data Frame Load data transform Estimator val kmeans = new KMeans() .setK(8) .setFeaturesCol("features") .setMaxIter(5) Create Kmeans Estimator, Set Features DataFrame + Features

© 2017 MapR Technologies Data Frame Load data transform Estimator model.clusterCenters.foreach(println) [40.76930621976264,-73.96034885367698] [40.67562793272868,-73.79810579052476] [40.68848772848041,-73.9634449047477] [40.78957777777776,-73.14270740740741] [40.32418330308531,-74.18665245009073] [40.732808848486286,-74.00150153727878] [40.75396549974632,-73.57692359208531] [40.901700842900674,-73.868760398198] Clusters from fitted model DataFrame + Features fit fitted model input

© 2017 MapR Technologies fitted model Transform new data, adds column with Clusters transform features val clusters = model.transform(newdata) prediction DataFrame + Features DataFrame + Features + prediciton

© 2017 MapR Technologies fitted model Save the model to distributed file system save model.write.overwrite().save("/path/savemodel") Use later val sameModel = KMeansModel.load("/user/user01/data/savemodel") DataFrame + Features

© 2017 MapR Technologies Part 2: MapR Event Streams with Kafka API and Spark Streaming •  End to End Application for Monitoring Uber Data using Spark ML •  https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine- learning-streaming-and-kafka-api-part-2/

© 2017 MapR Technologies Organize Data into Topics with MapR Streams Topics Organize Events into Categories and Decouple Producers from Consumers Consumers MapR Cluster Topic: Pressure Topic: Temperature Topic: Warnings Consumers Consumers Kafka API Kafka API

© 2017 MapR Technologies Scalable Messaging with MapR Streams Server 1 Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Server 2 Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Server 3 Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Topics are partitioned for throughput and scalability

© 2017 MapR Technologies Scalable Messaging with MapR Streams Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Producers are load balanced between partitions Kafka API

© 2017 MapR Technologies Scalable Messaging with MapR Streams Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Consumers Consumers Consumers Consumer groups can read in parallel Kafka API

© 2017 MapR Technologies Partition is like a Queue Consumers MapR Cluster Topic: Admission / Server 1 Topic: Admission / Server 2 Topic: Admission / Server 3 Consumers Consumers Partition 1 New Messages are appended to the end Partition 2 Partition 3 6 5 4 3 2 1 3 2 1 5 4 3 2 1 Producers Producers Producers New Message 6 5 4 3 2 1 Old Message

© 2017 MapR Technologies Events are delivered in the order they are received, like a queue messages are delivered in the order they are received MapR Cluster 6 5 4 3 2 1 Consumer groupProducers Read cursors Consumer group

© 2017 MapR Technologies Unlike a queue, events are persisted even after they’re delivered Messages remain on the partition, available to other consumers MapR Cluster (1 Server) Topic: Warning Partition 1 3 2 1 Unread Events Get Unread 3 2 1 Client Library ConsumerPoll

© 2017 MapR Technologies Machine Learning Logistics Input Data + Actual Delay Input Data + Predictions Consumer withML Model 2 Consumer withML Model 1 Decoy results Consumer Consumer withML Model 3 Consumer Stream Archive Stream Scores Stream Input SQL SQL Real time Data Stream Input Delayed data Input Data + Predictions + Actual Delay Real Time dashboard + Historical Analysis

© 2017 MapR Technologies Collect Data Process the Data with Spark Streaming and Spark Machine Learning Process Data Stream Topic •  Extension of the core Spark AP •  scalable, high-throughput, fault- tolerant stream processing

© 2017 MapR Technologies ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Historical Data Deployed Model Insights Data Discovery, Model Creation Production Feature Extraction Feature Extraction Uber trips Stream TopicUber trips New Data

© 2017 MapR Technologies Use Case: Real-Time Analysis of Geographically Clustered Vehicles Uber trip data enrich with K-means Cluster location Stream Topic Stream Topic Spark Streaming Spark Streaming Write to MapR-DB SQL

© 2017 MapR Technologies Use Case: Time Series Data Uber trip data Stream Topic 2014-08-01 00:00:00, 40.729,-73.9422,B02598 {"dt":"2014-08-01 00:00:00.0”, "lat":40.3495,"lon":-74.0667, "base":"B02682","cluster":5} Enrich with K-means cluster id Spark Streaming read Stream Topic

© 2017 MapR Technologies Create a DStream DStream: a sequence of RDDs representing a stream of data val messagesDStream = KafkaUtils.createDirectStream[String,String] (ssc, LocationStrategies.PreferConsistent,consumerStrategy) // get message values from key,value and parse to Uber objects val uDStream = linesDStream.map(_.value()) batch time 0 to 1 batch time 1 to 2 batch time 2 to 3 dStream Stored in memory as an RDD

© 2017 MapR Technologies Parse message txt to Uber Object and convert to DataFrame uDStream.foreachRDD{ rdd => // get cluster centers and add to df // send to Topic } ssc.start() ssc.awaitTermination()

© 2017 MapR Technologies Part 3: Realtime Dashboard using Vert.x •  End to End Application for Monitoring Uber Data using Spark ML •  https://mapr.com/blog/monitoring-uber-with-spark-streaming-kafka-and- vertx/

© 2017 MapR Technologies Creating the Vertx EventBus •  create an instance of the vertx.EventBus object •  add an onopen listener, which registers an event bus handler for the address “dashboard.” •  handler will receive all messages published to the “dashboard” address

© 2017 MapR Technologies MapR-DB (HBase API) is Designed to Scale Key Range xxxx xxxx Key Range xxxx xxxx Key Range xxxx xxxx Fast Reads and Writes by Key! Data is automatically partitioned by Key Range! Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val

© 2017 MapR Technologies Store Lots of Data with NoSQL MapR-DB bottleneck Storage ModelRDBMS MapR-DB Normalized schema à Joins for queries can cause bottleneck De-Normalized schema à Data that is read together is stored together Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val

© 2017 MapR Technologies Spark Hbase streamBulkPut •  HBaseContext streamBulkPut method parameters: •  message value DStream, the TableName to write to, function to convert the Dstream values to HBase put records.

© 2017 MapR Technologies Use Case: Real-Time Data Pipelines Input Data + Actual Delay Input Data + Predictions Consumer withML Model 2 Consumer withML Model 1 Decoy results Consumer Consumer withML Model 3 Consumer Stream Archive Stream Scores Stream Input SQL SQL Real time Flight Data Stream Input Actual Delay Input Data + Predictions + Actual Delay Real Time dashboard + Historical Analysis

© 2017 MapR Technologies …helping you put data technology to work ●  Find answers ●  Ask technical questions ●  Join on-demand training course discussions ●  Follow release announcements ●  Share and vote on product ideas ●  Find Meetup and event listings Connect with fellow Apache Hadoop and Spark professionals community.mapr.com

© 2017 MapR Technologies Stream Processing Building a Complete Data Architecture MapR File System (MapR-XD) MapR Converged Data Platform MapR Database (MapR-DB) MapR Event Streams Sources/Apps Bulk Processing

Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using Apache APIs: Kafka, Spark, HBase

More Related Content

What's hot

Similar to Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using Apache APIs: Kafka, Spark, HBase

More from Carol McDonald

Recently uploaded

Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using Apache APIs: Kafka, Spark, HBase