Hadoop-based architecture approaches

2015 Miraj Godha 6/5/2015 Hadoop Architecture Approaches

1 Table of Contents EXECUTIVE SUMMARY ................................................................................................................................. 2 Big data Classification ................................................................................................................................... 3 Hadoop-‐based architecture approaches ...................................................................................................... 5 Data Lake .................................................................................................................................................. 5 Lambda ..................................................................................................................................................... 5 Choosing the correct architecture ............................................................................................................ 5 Data Lake Architecture ................................................................................................................................. 9 Generic Data lake Architecture .............................................................................................................. 11 Steps Involved .................................................................................................................................... 12 Lambda Architecture .................................................................................................................................. 13 Batch Layer ............................................................................................................................................. 14 Serving layer ........................................................................................................................................... 14 Speed layer ............................................................................................................................................. 14 Generic Lambda Architecture ................................................................................................................ 16 References .................................................................................................................................................. 17

2 EXECUTIVE SUMMARY Apache Hadoop didn’t disrupt the datacenter, the data did. Shortly after Corporate IT functions within enterprises adopted large scale systems to manage data then the Enterprise Data Warehouse (EDW) emerged as the logical home of all enterprise data. Today, every enterprise has a Data Warehouse that serves to model and capture the essence of the business from their enterprise systems. The explosion of new types of data in recent years – from inputs such as the web and connected devices, or just sheer volumes of records – has put tremendous pressure on the EDW. In response to this disruption, an increasing number of organizations have turned to Apache Hadoop to help manage the enormous increase in data whilst maintaining coherence of the Data Warehouse. This POV discusses Apache Hadoop, its capabilities as a data platform and data processing. How the core of Hadoop and its surrounding ecosystems provides the enterprise requirements to integrate alongside the Data Warehouse and other enterprise data systems as part of a modern data architecture. A step on the journey toward delivering an enterprise ‘Data Lake’ or Lambda Architecture (Immutable data + views). An enterprise data lake provides the following core benefits to an enterprise: New efficiencies for data architecture through a significantly lower cost of storage, and through optimization of data processing workloads such as data transformation and integration. New opportunities for business through flexible ‘schema-‐on-‐read’ access to all enterprise data, and through multi-‐use and multi-‐workload data processing on the same sets of data: from batch to real-‐time. Apache Hadoop provides both reliable storage (HDFS) and a processing system (MapReduce) for large data sets across clusters of computers. MapReduce is a batch query processor that is targeted at long-‐ running background processes. Hadoop can handle Volume. But to handle Velocity, we need real-‐time processing tools that can compensate for the high-‐latency of batch systems, and serve the most recent data continuously, as new data arrives and older data is progressively integrated into the batch framework. And the answer to the problem is Lambda Architecture.

3 Big data Classification Processing Type Batch Processing Methodology Near Real time Real Time + Batch Prescriptive Predictive Diagnostic Descriptive Data Frequency On demand Continuous Real Time Batch Data Type Transactional Historical Master data Meta data Content Format Structured Unstructured:-‐Images, Text, Videos, Documents, emails etc. Semi-‐Structured: -‐ XML, JSON Data Sources Machine generated Web & Social media IOT Human Generated Transactional data Via other data providers

4 It's helpful to look at the characteristics of the big data along certain lines — for example, how the data is collected, analyzed, and processed. Once the data and its processing are classified, it can be matched with the appropriate big data analysis architecture: • Processing type -‐ Whether the data is analyzed in real time or batched for later analysis. Give careful consideration to choosing the analysis type, since it affects several other decisions about products, tools, hardware, data sources, and expected data frequency. A mix of both types ‘Near real time or micro batch” may also be required by the use case. • Processing methodology -‐ The type of technique to be applied for processing data (e.g., predictive, analytical, ad-‐hoc query, and reporting). Business requirements determine the appropriate processing methodology. A combination of techniques can be used. The choice of processing methodology helps identify the appropriate tools and techniques to be used in your big data solution. • Data frequency and size -‐ How much data is expected and at what frequency does it arrive. Knowing frequency and size helps determine the storage mechanism, storage format, and the necessary preprocessing tools. Data frequency and size depend on data sources: • On demand, as with social media data • Continuous feed, real-‐time (weather data, transactional data) • Time series (time-‐based data) • Data type -‐ Type of data to be processed — transactional, historical, master data, and others. Knowing the data type helps segregate the data in storage. • Content format -‐ Format of incoming data — structured (RDMBS, for example), unstructured (audio, video, and images, for example), or semi-‐structured. Format determines how the incoming data needs to be processed and is key to choosing tools and techniques and defining a solution from a business perspective. • Data source -‐ Sources of data (where the data is generated) — web and social media, machine-‐ generated, human-‐generated, etc. Identifying all the data sources helps determine the scope from a business perspective.

5 Hadoop-‐based architecture approaches Data Lake A data lake is a set of centralized repositories containing vast amounts of raw data (either structured or unstructured), described by metadata, organized into identifiable data sets, and available on demand. Data in the lake supports discovery, analytics, and reporting, usually by deploying cluster tools like Hadoop. Lambda Lambda architecture is a data-‐processing architecture designed to handle massive quantities of data by taking advantage of both batch-‐ and stream-‐processing methods. This approach to architecture attempts to balance latency, throughput, and fault-‐tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-‐time stream processing to provide views of online data. The two view outputs may be joined before presentation. The rise of lambda architecture is correlated with the growth of big data, real-‐time analytics, and the drive to mitigate the latencies of map-‐reduce. Choosing the correct architecture

6 Parameter Data Lake Lambda Simultaneous access to Real time and Batch data Data Lake can use real time processing technologies like Storm to return real time results, however in such a scenario historical results cannot be made available. If we use technologies like Spark to process data, real time data and historical data, on request there can be significant delays in response time to clients as compared to Lambda architecture. Lambda Architecture’s Serving Layer merges the output of Batch Layer and Speed Layer, before sending the results of user queries. As data is already processed into views at both the layers, the response time is significantly less. Latency Latency is high as compared to Lambda, as real time data need to be processed with historical data on-‐demand or as a part of batch. Low-‐latency real time results are processed by Speed layer and Batch results are pre-‐ processed in Batch layer. On request, both the results are just merged, there by resulting low latency time for real time processing. Ease of Data Governance Data lake is coined to convey the concept of centralized repository containing virtually inexhaustible amounts of raw data (or minimally curated) data that is readily made available anytime to anyone authorized to perform analytical activities. Lambda architecture’s serving layer gives access to processed and analyzed data. As uses get access to processed data directly, it can lead to top down data governance issues. Updates in source data As data lake stores only raw data, updates are just appended to raw data, thereby makes life of business users difficult to write business logic, in such a way that latest updated records are considered in calculations. Batch Views are always computed from starch in Lambda Architecture. As a result, updates can be easily incorporated in calculated Views in each reprocess batch cycle. Fault tolerance against human errors Data Scientist or business users, running business logic on relevant raw data in Data Lake might lead to human errors. Although, re-‐covering from those errors is not difficult as it’s just a matter of re-‐running the logic. However, the reprocessing time for large datasets might lead to some delays. Lambda architecture assures fault tolerance not only against hardware failures but against human errors. Re-‐computation of views every time from raw data in batch layer, insures that any human errors in business logic would not be cascaded to a level where it’s unrecoverable. Ease of business users Data is stored in raw format, Data is processed and available

7 with data definitions and sometime groomed to make digestible by data management tools. At times, it difficult for business users to use data in as-‐ is conditions. from Serving makes life easy for business users. Accuracy for real time results Irrespective of any scenario, users accessing data from Data Lake has access to immutable raw data, they can do exact computations, thereby always get the accurate results. In scenarios, where real time calculations need to access historical data, which is not possible, Lambda architecture would return you estimated results. For example, calculation of mean value, cannot be achieved until whole historical data and real time data is referenced at one go. In such a scenario, serving layer would return estimated results. Infrastructure Cost Data lake architecture process the data as and when need and thereby the cluster cost can be much less as compared to Lambda. Moreover, it only persist the raw data however Lambda architecture not only persist the raw data but processed data too. This leads to extra storage cost in Lambda architecture. Lambda architecture data processing life cycle is designed in such a fashion that as soon the one cycle of batch process is finished, it starts a new cycle of batch processing which includes the recently inserted data. Simultaneously, the speed layer is always processing the real time data. OLAP Unlike data marts, which are optimized for data analysis by storing only some attributes and dropping data below the level aggregation, a data lake is designed to retain all attributes, especially so when you do not yet know what the scope of data or its use will be. As Lambda exposes the processed views from serving layer, all the attributes of data would not be available to Data Scientist for running an analytical queries at times. Historical data reference for processing OLAP & OLTP queries access the raw or groomed data directly from the data lake, making it feasible to access and refer the historical data while processing data for given time interval. Speed layer do not have reference to historical data stored in batch layer, make it difficult to run queries which refer historical data. For e.g. ‘Unique count’ type of queries cannot return correct results from Speed layer. However, ‘calculating average’ type of

8 query calculations be done easily on Serving layer, by generating the average of results returned from Speed and Batch layer on the fly. Slowly Changing Dimensions Although, data lake has records of changed dimension attributes, however extra business logic need to be written by business uses to cater it. Lambda architecture can easily cater the slowly changing dimensions by creating surrogate keys parallel to natural keys in case of any change detected in dimension attributes while batch layer processing cycle. Slowly changing Facts However, in Data Lake both the versions of facts are available for users to look at, this would lead to good analytical results if fact life cycle is an attribute in business logic for data analytics. Although it’s easy to change the facts in Lambda architecture, but this will lead to loss in information of fact life cycle. As the previous state of fact in case of slowly changing facts is not available to Data Scientist, the analytical queries might not give desired results on views exposed by Serving Layer. Frequently changing business logic Changes in processing code need to be done. But there is no clear solution, of how the historically processed data need to be handled. As data is re-‐processed from starch, even if business logic changes frequently the historical data problem is resolved automatically. Implementation lifecycle Data lake is fast to implement as it eliminates the dependency of data modeling upfront Processing logic need to be implemented at batch and speed layer, leading to significant implementation time as comparted to Data Lake Adding new data sources Very easy to add Need to be incorporated in processing layers and would require code changes

9 IF YOU THINK OF A DATAMART AS A STORE OF BOTTLED WATER – CLEANSED AND PACKAGED AND STRUCTURED FOR EASY CONSUMPTION – THE DATA LAKE IS A LARGE BODY OF WATER IN A MORE NATURAL STATE. THE CONTENTS OF THE DATA LAKE STREAM IN FROM A SOURCE TO FILL THE LAKE, AND VARIOUS USERS OF THE LAKE CAN COME TO EXAMINE, DIVE IN, OR TAKE SAMPLES. BY: JAMES DIXON (PENTAHO CTO) Data Lake Architecture Much of today's research and decision making are based on knowledge and insight that can be gained from analyzing and contextualizing the vast (and growing) amount of “open” or “raw” data. The concept that the large number of data sources available today facilitates analyses on combinations of heterogeneous information that would not be achievable via “siloed” data maintained in warehouses is very powerful. The term data lake has been coined to convey the concept of a centralized repository containing virtually inexhaustible amounts of raw (or minimally curated) data that is readily made available anytime to anyone authorized to perform analytical activities. A data lake is a set of centralized repositories containing vast amounts of raw data (either structured or unstructured), described by metadata, organized into identifiable data sets, and available on demand. Data in the lake supports discovery, analytics, and reporting, usually by deploying cluster tools like Hadoop. Unlike traditional warehouses, the format of the data is not described (that is, its schema is not available) until the data is needed. By delaying the categorization of data from the point of entry to the point of use, analytical operations that transcend the rigid format of an adopted schema become possible. Query and search operations on the data can be performed using traditional database technologies (when structured), as well as via alternate means such as indexing and NoSQL derivatives. Key Features • Stores Raw data – Single source of truth • Data accessible to anyone authorized • Polyglot Persistence • Support multiple applications & Workloads • Low Cost, High Performance storage • Flexible, easy to use data organization • Self-‐service end-‐user • More Flexible to answer new questions • Easy to add new data sources • Loosely coupled architecture – enables flexibility of analysis • Eliminating dependency of data modeling upfront – thereby fast to implement • Storage is highly optimized as raw data is stored Disadvantages • High Latency for composite analysis view of both real time and historical data • Raw data does not provide relational structure that is not friendly for business analytis on the fly

10 In a practical sense, a data lake is characterized by three key attributes: • Collect everything: A data lake contains all data, both raw sources over extended periods of time as well as any processed data. • Dive in anywhere: A data lake enables users across multiple business units to refine, explore and enrich data on their terms. • Flexible access: A data lake enables multiple data access patterns across a shared infrastructure: batch, interactive, online, search, in-‐memory and other processing engines.

11 Generic Data lake Architecture H Data Sources Real Time Micro Batch Mega Batch Desktop & Mobile Social Media and cloud Operational Systems Internet of Things (IOT) Ingestion Tier Query Interface SQL No SQL Extern al Storag e Centralized Management System monitoring System management Unified Data Management Tier Data mgmt. Data Access Processing Tier Workflow Management HDFS storage Unstructured and structured data In-‐memory MapReduce/ Hive/MPP Flexible Actions Real-‐time insights Interactive insights Batch insights Schematic Metadata Grooming Data Processed Data Raw Data Processed Data Processed Data

12 Steps Involved • Procuring data – Process of obtaining data and metadata and preparing them for eventual inclusion in a data lake. • Obtaining data –Transferring the data physically from source to Data Lake. • Describing data – Data scientist searching a data lake for useful data must be able to find the data relevant to his or her need, for the same they require metadata for the data. Schematic metadata for this data set would include information about how the data is formatted and information about the schema. • Grooming data – Although we were talking about raw data is made consumable by analytics applications. However, in some scenarios grooming process use schematic metadata to transform raw data, into data that can be processed by standard data management tools. • Provisioning data – Authentication and authorization policies by which consumers take out data from Data Lake. • Preserving data – Managing Data Lake also require attention to maintenance issues such as staleness, expiration, decommissions and renewals.

13 LAMBDA ARCHITECTURE IS A DATA-‐ PROCESSING ARCHITECTURE DESIGNED TO HANDLE MASSIVE QUANTITIES OF DATA BY TAKING ADVANTAGE OF BOTH BATCH-‐ AND STREAM-‐PROCESSING METHODS. THIS APPROACH TO ARCHITECTURE ATTEMPTS TO BALANCE LATENCY, THROUGHPUT, AND FAULT-‐TOLERANCE BY USING BATCH PROCESSING TO PROVIDE COMPREHENSIVE AND ACCURATE VIEWS OF BATCH DATA, WHILE SIMULTANEOUSLY USING REAL-‐TIME STREAM PROCESSING TO PROVIDE VIEWS OF ONLINE DATA. THE TWO VIEW OUTPUTS MAY BE JOINED BEFORE PRESENTATION. Lambda Architecture The Lambda architecture is split into three layers, the batch layer, the serving layer, and the speed layer. 1. Batch layer (Apache Hadoop) 2. Serving layer (Cloudera Impala, Spark) 3. Speed layer (Storm, Spark, Apache HBase, Cassandra) Key Features • Low latency simultaneous analysis of the (near) real-‐ time information extracted from a continuous inflow of data and persisting analysis of a massive volume of data. • Fault tolerant not against hardware failure but against human error too • Mistakes are corrected by re-‐computations • Storage is highly optimized as raw data is stored

14 Batch Layer The batch layer is responsible for two things. The first is to store the immutable, constantly growing master dataset (HDFS), and the second is to compute arbitrary views from this dataset (MapReduce). Computing the views is a continuous operation, so when new data arrives it will be aggregated into the views when they are recomputed during the next MapReduce iteration. The views should be computed from the entire dataset and therefore the batch layer is not expected to update the views frequently. Depending on the size of your dataset and cluster, each iteration could take hours. Serving layer The output from the batch layer is a set of flat files containing the precomputed views. The serving layer is responsible for indexing and exposing the views so that they can be queried. Although, the batch and serving layers alone do not satisfy any realtime requirement because MapReduce (by design) is high latency and it could take a few hours for new data to be represented in the views and propagated to the serving layer. This is why we need the speed layer. Speed layer In essence the speed layer is the same as the batch layer in that it computes views from the data it receives. The speed layer is needed to compensate for the high latency of the batch layer and it does this by computing realtime views in Storm. The realtime views contain only the delta results to supplement the batch views. Whilst the batch layer is designed to continuously recompute the batch views from scratch, the speed layer uses an incremental model whereby the realtime views are incremented as and when new data is received. What’s clever about the speed layer is the realtime views are intended to be transient and as soon as the data propagates through the batch and serving layers the corresponding results in the Disadvantages • Maintaining copies code that needs to produce the same result in two complex distributed systems • Could return estimated or approx. results. • Expensive full recomputation is required for fault tolerance • Requires high cluster up-‐time, as batch data need to be processed continuously. • Requires more implementation time, as duplicate code need to be written in separate technologies to process real time and batch data. • Time taken to process batch is linearly

15 realtime views can be discarded. This is referred to as “complexity isolation”, meaning that the most complex part of the architecture is pushed into the layer whose results are only temporary. Realtime views are discarded once the data they contain is represented in batch view Now Batch Batch Batch Realtime Realtime Realtime Time

16 Generic Lambda Architecture Batch Layer Serving Layer Speed Layer All Data (HDFS) Pre-‐computed Views & Summarized data Batch Precompute Data Stream Data Stream Data Stream Data Stream Process Stream Increment views / Stream Summarization Query V V V V V V Near real time -‐ Increment Real time views Batch Views Storm or Spark MR / Hive/ Pig Data Management & Access

17 References http://www.ibm.com/developerworks/library/bd-‐archpatterns1/ http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf https://en.wikipedia.org/wiki/Lambda_architecture http://voltdb.com/blog/simplifying-‐complex-‐lambda-‐architecture http://en.wiktionary.org/wiki/data_lake

Hadoop-based architecture approaches

More Related Content

What's hot

Similar to Hadoop-based architecture approaches

More from Miraj Godha

Hadoop-based architecture approaches