Lambda architecture

Lambda Architecture Una soluzione per i Big Data Mario A. Santini

A solution born in Twitter Nathan Marz Author of Big Data: http://www.manning.com/marz/

When big is big? ● OpenStreetMap.org ~1,5 M users, ~2,2 nodes (http://j.mp/OSM-stats) http://j.mp/OSM-stats ● Wikipedia 32 M pages, 20 M users (http://en.wikipedia.org/wiki/Wikipedia:Statistics) http://en.wikipedia.org/wiki/Wikipedia:Statistics ● Facebook 1.3 G users (http://www.statisticbrain.com/facebook-statistics/) http://www.statisticbrain.com/facebook-statistics/ ● Twitter 645 M users (http://www.statisticbrain.com/twitter-statistics/) ● But also: – – Monitoring systems Any near real time system

Lambda query = function(allData);

Input data Lambda Architecture query

Batch View All Data Batch Layer Batch View Batch View Serving Layer Query

Batch Layer ● Store an immutable input data set ● Computing continuosly the batch view ● Simple & Distributed

Serving Layer ● Indexing the batch views ● Access to the batch views ● Updated by Batch Layer ● Trivial read only database: – Quick – Very simple

Batch Layer + Service Layer ● Robust and fault tollerant ● Scalable ● General ● Extensible ● Allow ad hoc queries ● Minimal maintenance ● Debuggable

What's miss? While Batch Layer compute the query on the full data set a pretty big chunk of data just arrived and be stored. Should we wait a couple of hours to query this data?

Speed Layer Near real time views New Data Speed Layer Near real time views Near real time views Query

All together now! Serving Layer Batch View All Data Batch Layer Batch View Query New Data Near real time views Speed Layer Near real time views

How all this mess should work? ● ● All new data are sent to both: batch and speed layer (data are raw and immutalble, append only) The batch layer precompute the query functions continuosly to all the dataset, to produce the batch views ● The serving layer indexes the batch views ● At the end the data are a couple of hours old

How all this mess should work? ● ● ● ● The speed layer will process only the new data It use fast read/write database and incremental processing algorithms Produce the near real time views The query will merge real time and batch views results to resolve the queries

Batch Layer - tools ● Hadoop – YARN: framework to schedule jobs and cluster management – Map / Reduce: a way to parallel processing of huge amount of data, based on YARN – HDFS: distributed file system with an high throughput access to application data – And even more...

Serving Layer – tools ● ElephantDB – ● ● Readonly database, very little, very fast Here we need anything that has the same features Cloudera Impala

Speed Layer - tools ● Storm project – Very fast distributed computed system ● Apache Hbase ● MongoDB

Query - tools ● Cloudera Impala

Lambda architecture

More Related Content

What's hot

Similar to Lambda architecture

More from Mario Alexandro Santini

Recently uploaded

Lambda architecture