Lambda Architecture Una soluzione per i Big Data Mario A. Santini
A solution born in Twitter Nathan Marz Author of Big Data: http://www.manning.com/marz/
When big is big? ● OpenStreetMap.org ~1,5 M users, ~2,2 nodes (http://j.mp/OSM-stats) http://j.mp/OSM-stats ● Wikipedia 32 M pages, 20 M users (http://en.wikipedia.org/wiki/Wikipedia:Statistics) http://en.wikipedia.org/wiki/Wikipedia:Statistics ● Facebook 1.3 G users (http://www.statisticbrain.com/facebook-statistics/) http://www.statisticbrain.com/facebook-statistics/ ● Twitter 645 M users (http://www.statisticbrain.com/twitter-statistics/) ● But also: – – Monitoring systems Any near real time system
Lambda query = function(allData);
Input data Lambda Architecture query
Batch View All Data Batch Layer Batch View Batch View Serving Layer Query
Batch Layer ● Store an immutable input data set ● Computing continuosly the batch view ● Simple & Distributed
Serving Layer ● Indexing the batch views ● Access to the batch views ● Updated by Batch Layer ● Trivial read only database: – Quick – Very simple
Batch Layer + Service Layer ● Robust and fault tollerant ● Scalable ● General ● Extensible ● Allow ad hoc queries ● Minimal maintenance ● Debuggable
What's miss? While Batch Layer compute the query on the full data set a pretty big chunk of data just arrived and be stored. Should we wait a couple of hours to query this data?
Speed Layer Near real time views New Data Speed Layer Near real time views Near real time views Query
All together now! Serving Layer Batch View All Data Batch Layer Batch View Query New Data Near real time views Speed Layer Near real time views
How all this mess should work? ● ● All new data are sent to both: batch and speed layer (data are raw and immutalble, append only) The batch layer precompute the query functions continuosly to all the dataset, to produce the batch views ● The serving layer indexes the batch views ● At the end the data are a couple of hours old
How all this mess should work? ● ● ● ● The speed layer will process only the new data It use fast read/write database and incremental processing algorithms Produce the near real time views The query will merge real time and batch views results to resolve the queries
Batch Layer - tools ● Hadoop – YARN: framework to schedule jobs and cluster management – Map / Reduce: a way to parallel processing of huge amount of data, based on YARN – HDFS: distributed file system with an high throughput access to application data – And even more...
Serving Layer – tools ● ElephantDB – ● ● Readonly database, very little, very fast Here we need anything that has the same features Cloudera Impala
Speed Layer - tools ● Storm project – Very fast distributed computed system ● Apache Hbase ● MongoDB
Query - tools ● Cloudera Impala

Lambda architecture

  • 1.
    Lambda Architecture Una soluzioneper i Big Data Mario A. Santini
  • 3.
    A solution bornin Twitter Nathan Marz Author of Big Data: http://www.manning.com/marz/
  • 4.
    When big isbig? ● OpenStreetMap.org ~1,5 M users, ~2,2 nodes (http://j.mp/OSM-stats) http://j.mp/OSM-stats ● Wikipedia 32 M pages, 20 M users (http://en.wikipedia.org/wiki/Wikipedia:Statistics) http://en.wikipedia.org/wiki/Wikipedia:Statistics ● Facebook 1.3 G users (http://www.statisticbrain.com/facebook-statistics/) http://www.statisticbrain.com/facebook-statistics/ ● Twitter 645 M users (http://www.statisticbrain.com/twitter-statistics/) ● But also: – – Monitoring systems Any near real time system
  • 5.
  • 6.
  • 7.
    Batch View All Data BatchLayer Batch View Batch View Serving Layer Query
  • 8.
    Batch Layer ● Store animmutable input data set ● Computing continuosly the batch view ● Simple & Distributed
  • 9.
    Serving Layer ● Indexing thebatch views ● Access to the batch views ● Updated by Batch Layer ● Trivial read only database: – Quick – Very simple
  • 10.
    Batch Layer +Service Layer ● Robust and fault tollerant ● Scalable ● General ● Extensible ● Allow ad hoc queries ● Minimal maintenance ● Debuggable
  • 11.
    What's miss? While BatchLayer compute the query on the full data set a pretty big chunk of data just arrived and be stored. Should we wait a couple of hours to query this data?
  • 12.
    Speed Layer Near realtime views New Data Speed Layer Near real time views Near real time views Query
  • 13.
    All together now! ServingLayer Batch View All Data Batch Layer Batch View Query New Data Near real time views Speed Layer Near real time views
  • 14.
    How all thismess should work? ● ● All new data are sent to both: batch and speed layer (data are raw and immutalble, append only) The batch layer precompute the query functions continuosly to all the dataset, to produce the batch views ● The serving layer indexes the batch views ● At the end the data are a couple of hours old
  • 15.
    How all thismess should work? ● ● ● ● The speed layer will process only the new data It use fast read/write database and incremental processing algorithms Produce the near real time views The query will merge real time and batch views results to resolve the queries
  • 16.
    Batch Layer -tools ● Hadoop – YARN: framework to schedule jobs and cluster management – Map / Reduce: a way to parallel processing of huge amount of data, based on YARN – HDFS: distributed file system with an high throughput access to application data – And even more...
  • 17.
    Serving Layer –tools ● ElephantDB – ● ● Readonly database, very little, very fast Here we need anything that has the same features Cloudera Impala
  • 18.
    Speed Layer -tools ● Storm project – Very fast distributed computed system ● Apache Hbase ● MongoDB
  • 19.