What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training | Edureka

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Agenda for Today ❖ Traditional Way of Processing ❖ Big Data Growth Drivers ❖ Problem Associated with Big Data ❖ Hadoop: Solution to Big Data Problem ❖ What is Hadoop? ❖ HDFS ❖ MapReduce ❖ Hadoop Ecosystem ❖ Hadoop Case Study - Orbitz

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Traditional Way of Processing

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Mostly Append BI Reports + Interactive Apps RDBMS (Aggregated Data) ETL Compute Grid Storage only Grid (Original Raw Data) Collection Instrumentation Premature data death Moving data to compute doesn’t scale Can’t explore original high fidelity raw data A meagre 10% of the ~2PB data is available for BI Storage Processing 90% of the ~2PB archived Hadoop Case Study – Sears

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Case Study – Orbitz Worldwide warehouse Users Orbitz.com P R O C E S S I N G 500 GB log data per day 1.5 Million Flight & 1 Million Hotel searches every day

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data Growth Drivers

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Data Generated Every Minute!

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING IOT: 50 Billion Devices by 2020 Rapid adoption rate of digital infrastructure 5x faster than electricity & telephony 50 Billion SmartObjects World Population Inflection Point 2003 2008 2010 2015 2020 6.307 6.721 6.894 7.347 7.83 Tablets, Laptops, Phones “~6 things online” per person Sensors, Smart, Objects, Device Clustered Systems

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING What is Big Data?

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING What is Big Data? “23 Exabytes of information was recorded and replicated in 2002. We now record and transfer that much information every 7 days” “Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” Traditional System

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Problems Associated with Big Data

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data Problems ▪ Data generated in past 2 years is more than the previous human race history in total ▪ By 2020, total digital data will grow to 44 Zettabytes or 44 trillion Gigabytes approximately ▪ By 2020, about 1.7 MB of new info will be created every second for every person 1. Storing huge and exponentially growing datasets :

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING  Structured  Semi - Structured  Unstructured ▪ Organized data format ▪ Data schema is fixed ▪ Ex: RDBMS data, etc. ▪ Partial organized data ▪ Lacks formal structure of a data model ▪ Ex: XML & JSON files, etc. ▪ Un-organized data ▪ Unknown schema ▪ Ex: multi-media files, etc. Big Data Problems 2. Processing data having complex structure:

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING ➢ The data is growing at much faster rate than that of disk read/write speed Slave A Slave B Slave C Slave D Slave E Master Data  ➢ Bringing huge amount of data to computation unit becomes a bottleneck Big Data Problems 3. Processing data faster: Source: Tom’s Hardware

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Before moving to the solution for Big Data problems, let us understand What is DFS?

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING DFS – Distributed File System ▪ Storing and managing data, i.e. files or folders across multiple computers or servers ▪ Provides abstraction of single and large file system Before DFS: After DFS Consolidation: Server1/accounts Server2/finance Server3/customer Server4/reports ▪ Edureka/accounts ▪ Edureka/finance ▪ Edureka/customer ▪ Edureka/reports

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Why DFS?

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop: Solution to Big Data Problem

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING What is Hadoop? Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion Hadoop Cluster HDFS (Storage) MapReduce (Processing) Allows parallel processing of the data stored in HDFS Master Slaves Allows to dump any kind of data across the cluster

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop to Rescue How?

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop to Rescue Solution: HDFS ▪ Storage unit of Hadoop ▪ It is a Distributed File System ▪ Divide files (input data) into smaller chunks and stores it across the cluster ▪ Scalable as per requirement Problem 1: Storing huge and exponentially growing datasets 512 MB File 128 MB 128 MB 128 MB 128 MB

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop to Rescue Solution: HDFS ▪ Allows to store any kind of data, be it structured, semi-structured or unstructured ▪ Follows WORM (Write Once Read Many) ▪ No schema validation is done while dumping data Problem 2: Storing unstructured data HDFS ReadWrite

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop to Rescue Solution: Hadoop MapReduce ▪ Provides parallel processing of data present in HDFS ▪ Allows to process data locally i.e. each node works with a part of data which is stored on it Problem 3: Processing data faster 1 2 File to be processed 4 hr. 1 hr.

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Core Components

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Distributed File System HDFS Components HDFS Architecture HDFS Blocks Rack Awareness HDFS Read/Write Mechanism

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Components NameNode ▪ Master daemon ▪ Maintains and Manages DataNodes ▪ Records metadata e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. ▪ Receives heartbeat and block report from all the DataNodes DataNode ▪ Slave daemons ▪ Stores actual data ▪ Serves read and write requests from the clients

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Architecture HDFS Components HDFS Architecture HDFS Blocks Rack Awareness HDFS Read/Write Mechanism

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Architecture

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Components HDFS Architecture HDFS Blocks Rack Awareness HDFS Read/Write Mechanism Let Us Talk About, How Data is Stored in HDFS?

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Blocks • Each file is stored on HDFS as blocks • The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) • Let us say, I have a file example.txt of size 248 MB. Below is the representation of how it will be stored on HDFS

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Blocks • Each file is stored on HDFS as blocks • The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) • Let us say, I have a file example.txt of size 248 MB. Below is the representation of how it will be stored on HDFS How many blocks will be created if a file of size 514 MB is copied to HDFS ?

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Components HDFS Architecture HDFS Blocks Rack Awareness HDFS Read/Write Mechanism How Blocks are Placed in HDFS?

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Architecture: Rack Awareness

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Read/Write Mechanism HDFS Components HDFS Architecture HDFS Blocks Rack Awareness HDFS Read/Write Mechanism

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Write Mechanism – Pipeline Setup

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Write Mechanism – Writing a Block

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Write Mechanism - Acknowledgement

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Multi-Block Write Mechanism For Block A: 1A -> 2A -> 3A -> 4A For Block B: 1B -> 2B -> 3B -> 4B -> 5B -> 6B

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Read Mechanism

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Anatomy of MapReduce MapReduce Example YARN Components MapReduce Job Workflow Hadoop YARN (MapReduce 2.0)

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Anatomy of MapReduce

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Anatomy of MapReduce MapReduce Example YARN Components MapReduce Job Workflow MapReduce Example

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Example

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Anatomy of MapReduce MapReduce Example YARN Components MapReduce Job Workflow YARN Components

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING YARN Components Node Manager: ▪ One per DataNode ▪ Monitors resources on DataNode ApplicationMaster: ▪ One per application ▪ Short life ▪ Coordinates and Manages MapReduce Jobs ▪ Negotiates with Resource Manager to schedule tasks Container: ▪ Created by NM when requested ▪ Allocates certain amount of resources (memory, CPU etc.) on a slave node Resource Manager: ▪ Cluster Level resource manager ▪ Long Life, High Quality Hardware

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Anatomy of MapReduce MapReduce Example YARN Components MapReduce Job Workflow MapReduce Job Workflow

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Client Node JVM MR Code MR Job 1. run job RM Node ResourceManager JVM 2. Submit job 3. Get application ID AppMaster JVM 4.2 Launch AppMaster NodeManager 5. Allocate Resources 6.1 Start container 4.1 Start container task JVM NodeManager YARN child MR Task 6.2 launch 7. run MapReduce Application Workflow

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 1 RM NM AMClient

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 2 1 RM NM AMClient

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 3 2 1 RM NM AMClient

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 3 2 44 1 RM NM AMClient 4

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 3 2 4 5 4 5 1 RM NM AMClient 4 5

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 6 3 2 4 5 4 5 1 RM NM AMClient 4 5

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status 6 3 2 4 5 4 5 1 RM NM AMClient 4 5 7

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status 8. AM unregisters with RM 6 3 2 4 5 4 5 1 RM NM AMClient 4 5 7 8

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Ecosystem

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Case Study – Orbitz Worldwide

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Challenges: ▪ Current data infrastructure is not capable of storing and processing the data generated by users everyday ▪ Enhancing current infrastructure was very expensive Hadoop Case Study – Orbitz Worldwide warehouse users orbitz.com P R O C E S S I N G 500 GB log data per day 1.5 Million Flight & 1 Million Hotel searches every day

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Case Study – Orbitz Worldwide ▪ Efficient and long term Storage System that can store any kind of data ▪ Analytical tool for making important business decision ▪ Cost Effective ▪ Open Source framework that used to store and process huge data sets ▪ Easily scalable as per the need ▪ Comes with various analytical tools Solution: Apache HadoopRequirement:

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING What is Apache Hive?

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Case Study – Orbitz Worldwide Apache Hive is a data warehousing tool on top of Hadoop that allows you to perform analytics on huge datasets using HiveQL which is very similar to SQL Hive Query Language SQL MapReduce HDFS Hive Select name from cust_details; MapReduce Name Age John 24 Mike 37 Ashley 29 Name John Mike Ashley query output Example:

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING ? Large amount of unstructured log data generated every day Can store any type of data 2 Can parallelly process data faster Output Structured Data 4 Hive Query Language 6 Write fancy query to analyze hotel position in search bar using log data Analytical Report 8 5 7 3 1 1 Hadoop Case Study – Orbitz Worldwide

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Deployment: load data Local log data Apache Hive query result source MapReduce for processing uncleaned data Hadoop Cluster Hadoop Case Study – Orbitz Worldwide Analyst

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Let Us See An example to Understand Hadoop & Hive Implementation at Orbitz

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING 1. Impression List: ▪ It contains the ranking of each hotel in the search bar along with the session id of the visitor who has clicked on it. ▪ Format of Impression List: (session_id, hotel_id, position, rate) Types of Website Logs Hadoop Case Study – Orbitz Worldwide

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING 2. WebTrends Log: ▪ It contains the customer details who has booked a hotel on the website. ▪ Format of WebTrends Log: (session_id, visitors_ip, hotel_id, booking_date, number_of_guests, booking_time) Hadoop Case Study – Orbitz Worldwide

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Query for analyzing the Hotel position in Search bar on the website: Steps: ▪ Clean the website log data using MapReduce ▪ Load the cleaned data into Hive ▪ Compare the ranking of a hotel in the search list with its booking frequency using Hive query. Website Log (Uncleaned) MapReduce (Data Cleaning) HDFS Hive (Analytics) Hadoop Case Study – Orbitz Worldwide

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING ▪ Comparison of performance of previous methodology to Hadoop implementation ▪ Months worth of data is archived easily ▪ Earlier process took 109m 14s for extracting and processing logs whereas MapReduce process took 25m 58s only ▪ Allows them to easily derive various metrics for analytics which was a tedious task earlier Accomplishment with Hadoop: Hadoop Case Study – Orbitz Worldwide

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Thank You… Questions/Queries/Feedback

What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training | Edureka

More Related Content

What's hot

Similar to What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training | Edureka

More from Edureka!

Recently uploaded

In this document

What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training | Edureka