Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election PROBLEM STATEMENT: In the US Primary Election 2016, Hillary Clinton was nominated over Bernie Sanders from Democrats and on the other hand, Donald Trump was nominated from Republican Party to contest for the presidential position. As an analyst, you have been tasked to understand different factors that led to the winning of Hillary Clinton and Donald Trump in the primary elections based on demographic features to plan their next initiatives and campaigns. Republican Democrat

Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election Dataset US Primary Election Data Set US Demographic Features (County-wise) Data Set Now as a data analyst you have 2 datasets available :

Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election Dataset state: List of US states state_abbreviation: Abbreviation of each US state county: List of counties in each US states fips: FIPS county code is a Federal Information Processing Standards (FIPS) code which uniquely identifies counties party: Different parties in US (i.e. Republican & Democrat) candidate: candidates in US primary election from different parties votes: number of votes gained by a candidate fraction_votes: total number of votes gained by a candidate/ total votes gained by the party

Copyright © 2017, edureka and/or its affiliates. All rights reserved. US County Demographic Features Dataset DETAILS: Population, 2014 estimate Population, 2010 (April 1) estimates base Population, percent change - April 1, 2010 to July 1, 2014 Population, 2010 Persons under 5 years, percent, 2014 Persons under 18 years, percent, 2014 Persons 65 years and over, percent, 2014 Female persons, percent, 2014 White alone, percent, 2014 …

Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy US Primary Election Dataset Storing Data in HDFS Processing Data Using Spark Components Transforming Data Using Spark SQL Clustering Data Using Spark MLlib (K-Means) Visualizing the Result Using Zeppelin 1 2 3 456

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups PROBLEM STATEMENT: A US cab service start-up wants to meet the demands in an optimum manner and maximize the profit. Thus, they hired you as a data analyst to interpret the available Uber’s data set and find out the beehive customer pick-up points & peak hours for meeting the demand in a profitable manner.

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Uber Pick-Up Locations Dataset Predictions Lat Lon Transforming DataSet K-Means Clustering On Latitude & Longitude 1 3 4 Storing Data in HDFS 2

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Fundamentals Road Map Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-Cases

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Spark Introduction to Hadoop & Spark Hadoop is a framework that allows you to store and process large data sets in parallel and distributed fashion. Apache Spark is an open-source cluster-computing framework for real time processing ❖ Provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance ❖ Built on top of YARN and it extends the YARN model to efficiently use more types of computations ❖ Hadoop has two core components: ▪ HDFS: Allows to dump any kind of data across the cluster ▪ YARN: Allows parallel processing of the data stored in HDFS

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark processes data 100 times faster than MapReduce Spark & Hadoop 1 Spark Applications can run on YARN leveraging Hadoop cluster2 Apache Spark can use HDFS as its storage 3 Faster Analytic Cost Optimization Avoid Duplication Challenges Addressed : Combining Spark’s ability, i.e. high processing speed, advance analytics and multiple integration support with Hadoop’s low cost operation on commodity hardware gives the best results

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Kafka Big Data Use-Cases Solution Architecture Apache HIVE MapReduce Apache Spark SolutionOptions Storing Big Data on HDFS Processing through YARN framework Tools used for processing

Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS ❖ HDFS stands for Hadoop Distributed File System ❖ HDFS is the storage unit of Hadoop HDFS creates an abstraction layer over the distributed storage resources, from where we can see the whole HDFS as a single unit. DataNodeDataNode DataNode NameNode Secondary NameNode

Copyright © 2017, edureka and/or its affiliates. All rights reserved. NameNode NameNode Secondary NameNode NameNode • Master daemon • Maintains and Manages DataNodes • Records metadata e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. • Receives heartbeat and block report from all the DataNodes

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Secondary NameNode NameNode Secondary NameNode Secondary NameNode • Checkpointing is a process of combining edit logs with FsImage • Allows faster Failover as we have a back up of the metadata • Checkpointing happens periodically (default: 1 hour)

Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Architecture in Detail NameNode Metadata (Name, replicas, …): /hdfs/foo/data, 3, … Client Client Replication Block ops DataNodesDataNodes Read Write Metadata ops Rack 2Rack 1

Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Data Block 380 MB How many blocks will be created if a file of size 500 MB is copied to HDFS? • Each file is stored on HDFS as block • The default size of each block is 128 MB • Let us say, I have a file example.txt of size 380 MB: 128 MB 128 MB 124 MB Block 1 Block 2 Block 3

Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Data Block • Each file is stored on HDFS as block • The default size of each block is 128 MB • Let us say, I have a file example.txt of size 500 MB: How many blocks will be created if a file of size 500 MB is copied to HDFS? 128 MB 128 MB 128 MB 500 MB Block 1 Block 2 Block 3 116 MB Block 4 380 MB 128 MB 128 MB 124 MB Block 1 Block 2 Block 3

Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Block Replication NameNode DataNodeDataNode DataNode Secondary NameNode 128 MB 120 MB 248 MB Block 1 Block 2 Each data blocks are replicated (thrice by default) and are distributed across different DataNodes Replication Factor = 3

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness • Rack Awareness Algorithm reduces latency as well as provide fault tolerance by replicating data block • Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack & the next two replicas will be stored on a different (remote) rack

Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Fault Tolerance If a DataNode fails, the data blocks can be recovered and retrieved from the replicas stored on another DataNodes. Replication Factor = 3 NameNode DataNodeDataNode DataNode Secondary NameNode

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Start Hadoop Daemons ./sbin/start-all.sh 1 ./sbin/stop-all.sh 2 jps 3 Starts all the Hadoop daemons(HDFS & YARN) Stops all the Hadoop daemons Checks all the daemons running on you machines

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Writing & Deleting a File in Hadoop hdfs fs –put /test.txt / 1 hdfs dfs –ls / 2 hdfs fs –rm /test.txt / 3 Coping a file from local file system to HDFS Lists all the HDFS files/directories Deleting the file from HDFS

Copyright © 2017, edureka and/or its affiliates. All rights reserved. What is YARN ? • Hadoop 2.0 came up with new framework YARN ( Yet Another Resource Negotiator ), which provides ability to run Non- MapReduce application. • It provides a paradigm for parallel processing over Hadoop. • YARN framework is responsible for integration of different tools with Hadoop like Spark, Hive, Pig.

Copyright © 2017, edureka and/or its affiliates. All rights reserved. ResourceManager ResourceManager ResourceManager • Receives the processing requests • Passes the requests to corresponding NodeManagers

Copyright © 2017, edureka and/or its affiliates. All rights reserved. NodeManager NodeManagerNodeManager NodeManager NodeManager • Installed on every DataNode • Responsible for execution of task on every single DataNode

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM 1. Run Job

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 2. Submit Job YARN child task JVM

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 3. Get application ID YARN child task JVM

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 4.1. Start Container 4.2. Launch AppMaster YARN child task JVM

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM5. Allocate Resources

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 6.1. Start Container 6.2. Launch YARN child task JVM

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM 7. Execute

Copyright © 2017, edureka and/or its affiliates. All rights reserved. 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status 8. AM unregisters with RM 6 3 2 4 5 4 5 1 RM NM AMClient 4 5 7 8 YARN Application Workflow

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster Hardware Specification ▪ RAM: 64 GB, ▪ Hard disk: 1 TB ▪ Processor: Xenon with 8 Cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux ▪ Power: Redundant Power Supply ▪ RAM: 32 GB ▪ Hard disk: 1 TB ▪ Processor: Xenon with 4 Cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux ▪ Power: Redundant Power Supply ▪ RAM: 16GB ▪ Hard disk: 6 x 2TB ▪ Processor: Xenon with 2 cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux NameNode Secondary NameNode DataNode

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster : Facebook Use Case 21 PB of storage in a single HDFS cluster 2000 Machines Per Cluster 12 TB of Data Per Machine 1200 machines with 8 cores each + 800 machines with 16 cores each 32 GB of RAM per machine 15 mapreduce tasks per machine That's a total of more than 21 PB of configured storage capacity, this is larger than the previously known Yahoo!'s cluster of 14 PB

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster : Spotify Use Case Use Hadoop for generating music recommendations 1650 node cluster ~ 65 PB storage 70 TB RAM +25,000 daily Hadoop jobs 43,000 virtualised cores

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark Core Spark Core is the base engine for large-scale parallel and distributed data processing It is responsible for: ▪ Memory management and fault recovery ▪ Scheduling, distributing and monitoring jobs on a cluster ▪ Interacting with storage systems

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark SQL • Spark SQL integrates relational processing with Spark’s functional programming • Provides support for various data sources and makes it possible to weave SQL queries with code transformations CSV JSON JDBC Data Source API Data Frame API DataFrame DSL Spark SQL and HQL

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Start Spark Daemons ./sbin/start-all.sh 1 jps 2 ./bin/spark-shell 3 Starts all the Spark daemons(Master & Worker) Checks all the daemons running on you machines Starts the Spark Shell

Copyright © 2017, edureka and/or its affiliates. All rights reserved. K-Means Clustering The process by which objects are classified into a predefined number of groups so that they are as much dissimilar as possible from one group to another group, but as much similar as possible within each group ▪ The objects in group 1 should be as similar as possible ▪ But there should be much difference between an object in group 1 and group 2 ▪ The attributes of the objects are allowed to determine which objects should be grouped together Total population Group 1 Group 2 Group 3 Group 4

Copyright © 2017, edureka and/or its affiliates. All rights reserved. K-Means Clustering ▪ Consider a comparison on Income & Balance: CurrentBalance High High Medium Medium Low Low Gross Monthly Income Example Cluster 1 High Balance Low Income Example Cluster 2 High Income Low Balance The objects in Cluster 1 have similar characteristics (High Income and Low balance) Also the objects in Cluster 2 have the same characteristic (High Balance and Low Income)

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Example ▪ The plot of students in an area is as given below I need to find specific locations to build schools in this area so that the students doesn’t have to travel much

Copyright © 2017, edureka and/or its affiliates. All rights reserved. What is Zeppelin? ▪ A completely open web-based notebook that enables interactive data analytics ▪ Web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop & Spark ▪ The various languages are supported via Zeppelin language interpreters

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-case 1

Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy US Primary Election Dataset Storing Data in HDFS Processing Data Using Spark Components Transforming Data Using Scala & Spark SQL Clustering Data Using Spark MLlib (K-Means) Visualizing the Result Using Zeppelin 1 2 3 456

Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-case 12

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

More Related Content

What's hot

Similar to Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

More from Edureka!

Recently uploaded

In this document

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka