Big data streaming Willem Meints
Microservices & analytics
Event bus Micro services • Multiple smaller services that scale independedely • Each service his own data store • Data flows between services through the event bus
Rapids Rivers Lakes
Data analytics challenges with microservices • A complete picture is there, but spread over a vast landscape • Most data doesn’t come in a database • Data changes rapidly
Exploring some scenarios
Scenario 1: Get a annual sales report • The goal is to get a complete picture of the situation • Data based on business events
OrdersInvoices Event bus Data analytics Data Lake
OrdersInvoices Event bus Data analytics Data Lake
Scenario 2: Detect anomalies • The goal is to detect anomalies on the website and prevent abuse • Machine learning needed to detect the anomalies • Data based on the data lake
Click stream collector Event bus Data analytics Data Lake
Click stream collector Event bus Data analytics Data Lake Model
Analytics tools
vs
Event bus Data processing tool Distributed database Alerting Dashboarding
Event bus Data processing tool Distributed database Alerting Dashboarding Flow control logic Cluster Manager
The Azure based solution
Azure Event Hub HDInsight Azure Data Lake Alerting Dashboarding Azure App Services Cluster Manager
Demo A short introduction into Apache Spark
Spark SQL Spark Streaming Machine Learning GraphX Apache Spark Core
Resilient Distributed Data Sets Resilient Distributed Dataset Partition Record Record Partition Record Record
Stream Batches Processed data Streams with Spark
Stream Batches Processed data Streams with Spark Lists of RDDs
Demo Deploying Spark to Azure using HDInsight
Azure Event Hubs • Capable of streaming large volumes of data • SDK available in many languages • Ruby • Python • Java/Scala • C# • Apache Spark
Hoe werkt een Azure Event Hub? Partition Partition Partition Consumer group Consumer group
Demo Using Azure Event Hub with Spark
Tips for going in production • When using streams, always have n+1 worker nodes • More partitions = more speed • Longer intervals is slower, but sometimes better
Thanks! Willem Meints Technical Evangelist/Microsoft MVP @willem_meints

Big data streaming with Apache Spark on Azure