Reactive data analysis with vert.x

Reactive Data- Analysis with Vert.x GERALD MÜCKE, @GMUECKE 1

@gmuecke About me  IT Consultant & Java Specialist at DevCon5 (CH)  Focal Areas  Tool-assisted quality assurance  Performance (-testing, -analysis, -tooling)  Operational Topics (APM, Monitoring)  Twitter: @gmuecke 2

@gmuecke What is Big Data?  Volume  The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not.  Variety  The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.  Velocity  In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.  Variability  Inconsistency of the data set can hamper processes to handle and manage it.  Veracity  The data quality of captured data can vary greatly, affecting the accurate analysis. 3 https://en.wikipedia.org/wiki/Big_Data#Characteristics Velocity the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.

@gmuecke Fast Data Processing  Database  File  Network (Stream) 4

@gmuecke The Starting Point  Customer stored and keep response time measurement of test runs in a MongoDB  Lots of Data  Timestamp & Value  No Proper Visualization 5

@gmuecke What are timeseries data?  a set of datapoints with a timestamp and a value time value 6

@gmuecke What is MongoDB?  MongoDB  NoSQL database with focus on scale  JSON as data representation  No HTTP endpoint (TCP based Wire Protocol)  Aggregation framework for complex queries  Provides an Async Driver 7

@gmuecke What is Grafana?  A Service for Visualizing Time Series Data  Open Source  Backend written in Go  Frontend based on Angular  Dashboards & Alerts 8

@gmuecke Grafana Architecture 9 Grafana Server • Implemented in GO • Persistence for Settings and Dashboards • Offers Proxy for Service Calls Browser Angular UI Data Source Data Source Plugin... Proxy DB DB

@gmuecke Datasources for Grafana 10 Grafana Server • Implemented in GO • Persistence for Settings and Dashboards • Offers Proxy for Service Calls Browser Datasource Angular UI Data Source Plugin • Angular • JavaScript HTTP

@gmuecke Connect Angular Directly to Mongo? 11

@gmuecke From 2 Tier to 3 Tier 12 Grafana (Angular) Mongo DB Grafana (Angular) Mongo DB Datasource Service HTTP Mongo Wire Protocol

@gmuecke Start Simple SimpleJsonDatasource (Plugin) 3 ServiceEndpoints  /search  Labels – names of available timeseries  /annotations  Annotations – textual markers  /query  Query – actual time series data 13 https://github.com/grafana/simple-json-datasource

@gmuecke /search Format Request { "target" : "select metric", "refId" : "E" } Response [ "Metric Name 1", "Metric Name2", ] An array of strings 14

@gmuecke /annotations Format Request { "annotation" : { "name" : "Test", "iconColor" : "rgba(255, 96, 96, 1)", "datasource" : "Simple Example DS", "enable" : true, "query" : "{"name":"Timeseries A"}" }, "range" : { "from" : "2016-06-13T12:23:47.387Z", "to" : "2016-06-13T12:24:19.217Z" }, "rangeRaw" : { "from" : "2016-06-13T12:23:47.387Z", "to" : "2016-06-13T12:24:19.217Z" } } Response [ { "annotation": { "name": "Test", "iconColor": "rgba(255, 96, 96, 1)", "datasource": "Simple Example DS", "enable": true, "query": "{"name":"Timeseries A"}" }, "time": 1465820629774, "title": "Marker", "tags": [ "Tag 1", "Tag 2" ] } ] 15

@gmuecke /query Format Request { "panelId" : 1, "maxDataPoints" : 1904, "format" : "json", "range" : { "from" : "2016-06-13T12:23:47.387Z", "to" : "2016-06-13T12:24:19.217Z" }, "rangeRaw" : { "from" : "2016-06-13T12:23:47.387Z", "to" : "2016-06-13T12:24:19.217Z" }, "interval" : "20ms", "targets" : [ { "target" : "Time series A", "refId" : "A" },] } Response [ { "target":"Timeseries A", "datapoints":[ [1936,1465820629774], [2105,1465820632673], [4187,1465820635570], [30001,1465820645243] }, { "target":"Timeseries B", "datapoints":[ ] } ] 16

@gmuecke Structure of the Source Data { "_id" : ObjectId("56375bc54f3c4caedfe68aca"), "t" : { "eDesc" : "some description", "eId" : "56375ae24f3c4caedfe68a07", "name" : "some name", "profile" : "I01", "rnId" : "56375b694f3c4caedfe68aa0", "rnStatus" : "PASSED", "uId" : "anonymous" }, "n" : { "begin" : NumberLong("1446468494689"), "value" : NumberLong(283) } } 17

@gmuecke Custom Datasource  Should be  Lightweight  Fast / Performant  Simple 18

@gmuecke Microservice?  Options for implementation  Java EE Microservice (i.e. Wildfly Swarm)  Springboot Microservice  Vert.x Microservice  Node.js  ... 19

@gmuecke The Alternative Options Node.js  Single Threaded  Child Worker Processes  Javascript Only  Not best-choice for heavy computation Spring / Java EE  Multithreaded  Clusterable  Java Only  Solid Workhorses, cumbersome at times 20

@gmuecke Why Vert.x?  High Performance  Low Memory Footprint  Few Dependencies  Polyglott  Scalable 21

@gmuecke But first, some basics 22

@gmuecke Vert.x is a Library for  Asynchronous  Non-Blocking  Reactive  Polyglott  Microservices 23 This Photo by Unknown Author is licensed under CC BY-NC

@gmuecke Asynchronous vs. Synchronous 24 © Jason Lee / Reuters

@gmuecke Non-blocking vs. Blocking 25 © Fritz Geller-Grimm © Dontworry

@gmuecke Reactive vs. Non-Reactive  Responsive  Resilient  Elastic  Message Driven 26

@gmuecke Polyglott vs. Monoglott 27 © Kjp993 © Jacquie Wingate

@gmuecke Microservice vs. Monolith 28 The weaver https://www.amazon.com/Wenger-16999-Swiss-Knife-Giant/dp/B001DZTJRQ

@gmuecke Verticles  Contain your processing code  Provide actor-like concurrency  Send/Receive messages  Verticles unit of deployment 30

@gmuecke Event Loop 31 Verticle Verticle Verticle EventI/O

@gmuecke Event Loop 32 Photo: Andreas Praefcke

@gmuecke Event Loop and Verticles 33 Photo: RokerHRO 3rd Floor, Verticle A 2nd Floor, Verticle B 1st Floor, Verticle C

@gmuecke Event Bus 36 Verticle Verticle Verticle Eventbus Message

@gmuecke Event Bus 37  https://www.youtube.com/watch?v=Kr_4yLhIJ_I Disclaimer: I am not affiliated with Heineken. I simply liked the commercial. Nevertheless: Drink responsibly!

@gmuecke CPU Multi-Reactor 38 Core Core Core Core Eventbus Other Vert.x Instance Browser Verticle Verticle

@gmuecke Event & Worker Verticles Event Driven Verticles Worker Verticles 39 Verticle Verticle Verticle Thread Pool Thread Pool Verticle Verticle Verticle Verticle Verticle

@gmuecke Implementing the datasource  Http Verticle  Routing requests & sending responses  Verticles querying the DB  Searching timeseries labels  Annotation  Timeseries data points  Optional Verticles for Post Processing 41

@gmuecke What is the challenge?  Optimization  Queries can be optimized  Large datasets have to be searched, read and transported  Source data can not be modified VS data redundancy  Sizing  How to size the analysis system without knowing the query-times?  How to size thread pools or database pools if most of the queries will take 100ms – 30s ? 42

Analysing Data from a Database 43

@gmuecke CPU Datasource Architecture 44 HTTP Service Eventbus Timeseries HTTP Request HTTP Response DB Labels Annotations

@gmuecke Step 1 – The naive approach  Find all datapoints within range 45

@gmuecke CPU Datasource Architecture 46 HTTP Service Eventbus Query Database HTTP Request HTTP Response DB

@gmuecke Step 2 – Split Request  Split request into chunks (#chunks = #cores)  Use multiple Verticle Instance in parallel (#instances = #cores) ? 47 CPU

@gmuecke CPU Datasource Architecture 48 HTTP Service Split/ Merge Request Eventbus Query Database Query Database Query Database Query Database HTTP Request HTTP Response DB

@gmuecke Step 3 – Aggregate Datapoints  Use Mongo Aggregation Pipeline  Reduce Datapoints returned to service 49

@gmuecke Step 4 – Percentiles (CPU)  Fetch all data  Calculate percentiles in service 51 CPU

@gmuecke Step 4 – Percentiles (DB)  Build aggregation pipeline to calculate percentiles  Algorithm, see http://www.dummies.com/education/math/statistics/how-to- calculate-percentiles-in-statistics/ 52 DB

@gmuecke Step 5 - Postprocessing  Apply additional computation on the result from the database 54

@gmuecke CPUCPU Datasource Architecture (final) 55 HTTP Service Split Request Eventbus Query Database Query Database Query Database Query Database Merge Result HTTP Request HTTP Response DB Post Process Post Process Post Process Post Process Eventbus

@gmuecke Adding more stats & calculation  Push Calculation to DB if possible  Add more workers / node for complex (post-) processing  Aggregate results before post-processing  DB performance is king 56

@gmuecke Let’s read a large data file  Datafile is large (> 1GB)  Every line of the file is a datapoint  The first 10 characters are a timestamp  The dataset is sorted  The datapoints are not equally distributed  Grafana requires reads ~1900 datapoints per chart request 59

@gmuecke The Challenges (pick one)  How to randomly access 1900 datapoints without reading the entire file into memory?  How to read a huge file efficiently into memory? 60 Index + Lazy refinement Index + Lazy load

@gmuecke Let’s build an index  Indexes can be build using a tree- datastructure  Node: Timestamp  Leaf: offset position in file or the datapoint  Red-Black Trees provide fast access  Read/Insert O(log n)  Space n 61 © Cburnett, Wikipedia

@gmuecke  java.util.TreeMap is a red-black tree based implementation*  TreeMap<Long,Long> index = new TreeMap<>(); 63

@gmuecke How to build an index (fast)?  Read datapoint from offset positions  Build a partial index 64 Dataset

@gmuecke On next query  Locate Block  Refine Block  Update Index 65

@gmuecke CPUCPU Datasource Architecture (again) 66 HTTP Service Split Request Eventbus Read File Read File Read File Read File Merge Result HTTP Request HTTP Response Dataset Post Process Post Process Post Process Post Process Eventbus

@gmuecke Tradeoffs Block Size Index Size Startup Time Heap Size Request Size 67

@gmuecke Takeaways  Vert.x is  Reactive, Non-Blocking, Asynchronous, Scalable  Running on JVM  Polyglott  Fun  Valid Choice for Data Stream Processing 68 Source code on: https://github.com/gmuecke/grafana-vertx-datasource

Thank you! FEEDBACK APRECIATED! 69

Reactive data analysis with vert.x

More Related Content

Similar to Reactive data analysis with vert.x

Recently uploaded

Reactive data analysis with vert.x