Hadoop Tutorial
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use-Cases US Primary Election Analysis Market Analysis for US Cab Startup 1 2
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election Analysis 1 2
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions STEP 3: General Elections
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions STEP 4: Electoral CollegeSTEP 3: General Elections
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election PROBLEM STATEMENT: In the US Primary Election 2016, Hillary Clinton was nominated over Bernie Sanders from Democrats and on the other hand, Donald Trump was nominated from Republican Party to contest for the presidential position. As an analyst, you have been tasked to understand different factors that led to the winning of Hillary Clinton and Donald Trump in the primary elections based on demographic features to plan their next initiatives and campaigns. Republican Democrat
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election Dataset US Primary Election Data Set US Demographic Features (County-wise) Data Set Now as a data analyst you have 2 datasets available :
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Primary Election Dataset state: List of US states state_abbreviation: Abbreviation of each US state county: List of counties in each US states fips: FIPS county code is a Federal Information Processing Standards (FIPS) code which uniquely identifies counties party: Different parties in US (i.e. Republican & Democrat) candidate: candidates in US primary election from different parties votes: number of votes gained by a candidate fraction_votes: total number of votes gained by a candidate/ total votes gained by the party
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US County Demographic Features Dataset DETAILS: Population, 2014 estimate Population, 2010 (April 1) estimates base Population, percent change - April 1, 2010 to July 1, 2014 Population, 2010 Persons under 5 years, percent, 2014 Persons under 18 years, percent, 2014 Persons 65 years and over, percent, 2014 Female persons, percent, 2014 White alone, percent, 2014 …
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy US Primary Election Dataset 1
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Storing Data in HDFS 2
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Processing Data Using Spark Components 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Transforming Data Using Spark SQL 4
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Clustering Data Using Spark MLlib (K-Means) 5
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Visualizing the Result Using Zeppelin 6
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy US Primary Election Dataset Storing Data in HDFS Processing Data Using Spark Components Transforming Data Using Spark SQL Clustering Data Using Spark MLlib (K-Means) Visualizing the Result Using Zeppelin 1 2 3 456
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Visualization of Result 1 2
Copyright © 2017, edureka and/or its affiliates. All rights reserved. 2 Market Analysis for US Cab Start-Ups 1
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups PROBLEM STATEMENT: A US cab service start-up wants to meet the demands in an optimum manner and maximize the profit. Thus, they hired you as a data analyst to interpret the available Uber’s data set and find out the beehive customer pick-up points & peak hours for meeting the demand in a profitable manner.
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Uber Dataset • Date/Time – Pickup Date & Time • Lat – Latitude of Pickup • Lon – Longitude of Pickup • Base – TLC Base Code
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Uber Pick-Up Locations Dataset Predictions 1 Lat Lon
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Predictions Storing Data in HDFS 2 Lat Lon
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Predictions Transforming Dataset 3 Lat Lon
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Predictions K-Means Clustering On Latitude & Longitude 4 Lat Lon
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Uber Pick-Up Locations Dataset Predictions Lat Lon Transforming DataSet K-Means Clustering On Latitude & Longitude 1 3 4 Storing Data in HDFS 2
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Let Us Know What It Takes…
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Fundamentals Road Map Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Spark Introduction to Hadoop & Spark Hadoop is a framework that allows you to store and process large data sets in parallel and distributed fashion. Apache Spark is an open-source cluster-computing framework for real time processing ❖ Provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance ❖ Built on top of YARN and it extends the YARN model to efficiently use more types of computations ❖ Hadoop has two core components: ▪ HDFS: Allows to dump any kind of data across the cluster ▪ YARN: Allows parallel processing of the data stored in HDFS
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark Complementing Hadoop
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark processes data 100 times faster than MapReduce Spark & Hadoop 1 Spark Applications can run on YARN leveraging Hadoop cluster2 Apache Spark can use HDFS as its storage 3 Faster Analytic Cost Optimization Avoid Duplication Challenges Addressed : Combining Spark’s ability, i.e. high processing speed, advance analytics and multiple integration support with Hadoop’s low cost operation on commodity hardware gives the best results
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use-Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use-Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Big Data Use-Cases
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Kafka Big Data Use-Cases Solution Architecture Apache HIVE MapReduce Apache Spark SolutionOptions Storing Big Data on HDFS Processing through YARN framework Tools used for processing
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage)
Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS ❖ HDFS stands for Hadoop Distributed File System ❖ HDFS is the storage unit of Hadoop HDFS creates an abstraction layer over the distributed storage resources, from where we can see the whole HDFS as a single unit. DataNodeDataNode DataNode NameNode Secondary NameNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved. NameNode NameNode Secondary NameNode NameNode • Master daemon • Maintains and Manages DataNodes • Records metadata e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. • Receives heartbeat and block report from all the DataNodes
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Secondary NameNode NameNode Secondary NameNode Secondary NameNode • Checkpointing is a process of combining edit logs with FsImage • Allows faster Failover as we have a back up of the metadata • Checkpointing happens periodically (default: 1 hour)
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Secondary NameNode NameNode Secondary NameNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved. DataNode DataNode • Slave daemons • Stores actual data • Serves read and write requests DataNodeDataNode DataNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Architecture in Detail NameNode Metadata (Name, replicas, …): /hdfs/foo/data, 3, … Client Client Replication Block ops DataNodesDataNodes Read Write Metadata ops Rack 2Rack 1
Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Block & Replication
Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Data Block 380 MB How many blocks will be created if a file of size 500 MB is copied to HDFS? • Each file is stored on HDFS as block • The default size of each block is 128 MB • Let us say, I have a file example.txt of size 380 MB: 128 MB 128 MB 124 MB Block 1 Block 2 Block 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Data Block • Each file is stored on HDFS as block • The default size of each block is 128 MB • Let us say, I have a file example.txt of size 500 MB: How many blocks will be created if a file of size 500 MB is copied to HDFS? 128 MB 128 MB 128 MB 500 MB Block 1 Block 2 Block 3 116 MB Block 4 380 MB 128 MB 128 MB 124 MB Block 1 Block 2 Block 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Block Replication NameNode DataNodeDataNode DataNode Secondary NameNode 128 MB 120 MB 248 MB Block 1 Block 2 Each data blocks are replicated (thrice by default) and are distributed across different DataNodes Replication Factor = 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness • Rack Awareness Algorithm reduces latency as well as provide fault tolerance by replicating data block • Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack & the next two replicas will be stored on a different (remote) rack
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Fault Tolerance If a DataNode fails, the data blocks can be recovered and retrieved from the replicas stored on another DataNodes. Replication Factor = 3 NameNode DataNodeDataNode DataNode Secondary NameNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved. HDFS Fault Tolerance If a DataNode fails, the data blocks can be recovered and retrieved from the replicas stored on another DataNodes. Replication Factor = 3 NameNode DataNodeDataNode DataNode Secondary NameNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Start Hadoop Daemons ./sbin/start-all.sh 1 ./sbin/stop-all.sh 2 jps 3 Starts all the Hadoop daemons(HDFS & YARN) Stops all the Hadoop daemons Checks all the daemons running on you machines
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Writing & Deleting a File in Hadoop hdfs fs –put /test.txt / 1 hdfs dfs –ls / 2 hdfs fs –rm /test.txt / 3 Coping a file from local file system to HDFS Lists all the HDFS files/directories Deleting the file from HDFS
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing)
Copyright © 2017, edureka and/or its affiliates. All rights reserved. What is YARN ? • Hadoop 2.0 came up with new framework YARN ( Yet Another Resource Negotiator ), which provides ability to run Non- MapReduce application. • It provides a paradigm for parallel processing over Hadoop. • YARN framework is responsible for integration of different tools with Hadoop like Spark, Hive, Pig.
Copyright © 2017, edureka and/or its affiliates. All rights reserved. ResourceManager ResourceManager ResourceManager • Receives the processing requests • Passes the requests to corresponding NodeManagers
Copyright © 2017, edureka and/or its affiliates. All rights reserved. NodeManager NodeManagerNodeManager NodeManager NodeManager • Installed on every DataNode • Responsible for execution of task on every single DataNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved. YARN Architecture in Detail
Copyright © 2017, edureka and/or its affiliates. All rights reserved. YARN Workflow
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM 1. Run Job
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 2. Submit Job YARN child task JVM
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 3. Get application ID YARN child task JVM
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 4.1. Start Container 4.2. Launch AppMaster YARN child task JVM
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM5. Allocate Resources
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 6.1. Start Container 6.2. Launch YARN child task JVM
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM 7. Execute
Copyright © 2017, edureka and/or its affiliates. All rights reserved. 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status 8. AM unregisters with RM 6 3 2 4 5 4 5 1 RM NM AMClient 4 5 7 8 YARN Application Workflow
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster Architecture = HDFS + YARN
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster Architecture MasterSlave HDFS YARN
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster Hardware Specification
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster Hardware Specification ▪ RAM: 64 GB, ▪ Hard disk: 1 TB ▪ Processor: Xenon with 8 Cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux ▪ Power: Redundant Power Supply ▪ RAM: 32 GB ▪ Hard disk: 1 TB ▪ Processor: Xenon with 4 Cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux ▪ Power: Redundant Power Supply ▪ RAM: 16GB ▪ Hard disk: 6 x 2TB ▪ Processor: Xenon with 2 cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux NameNode Secondary NameNode DataNode
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Real Time Hadoop Cluster Deployment
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster : Facebook Use Case 21 PB of storage in a single HDFS cluster 2000 Machines Per Cluster 12 TB of Data Per Machine 1200 machines with 8 cores each + 800 machines with 16 cores each 32 GB of RAM per machine 15 map- reduce tasks per machine That's a total of more than 21 PB of configured storage capacity, this is larger than the previously known Yahoo!'s cluster of 14 PB
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Hadoop Cluster : Spotify Use Case Use Hadoop for generating music recommendations 1650 node cluster ~ 65 PB storage 70 TB RAM +25,000 daily Hadoop jobs 43,000 virtualised cores
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark Core Components
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark Core Spark Core is the base engine for large-scale parallel and distributed data processing It is responsible for: ▪ Memory management and fault recovery ▪ Scheduling, distributing and monitoring jobs on a cluster ▪ Interacting with storage systems
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark Architecture
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Spark SQL • Spark SQL integrates relational processing with Spark’s functional programming • Provides support for various data sources and makes it possible to weave SQL queries with code transformations CSV JSON JDBC Data Source API Data Frame API DataFrame DSL Spark SQL and HQL
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Start Spark Daemons ./sbin/start-all.sh 1 jps 2 ./bin/spark-shell 3 Starts all the Spark daemons(Master & Worker) Checks all the daemons running on you machines Starts the Spark Shell
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin
Copyright © 2017, edureka and/or its affiliates. All rights reserved. K-Means Clustering The process by which objects are classified into a predefined number of groups so that they are as much dissimilar as possible from one group to another group, but as much similar as possible within each group ▪ The objects in group 1 should be as similar as possible ▪ But there should be much difference between an object in group 1 and group 2 ▪ The attributes of the objects are allowed to determine which objects should be grouped together Total population Group 1 Group 2 Group 3 Group 4
Copyright © 2017, edureka and/or its affiliates. All rights reserved. K-Means Clustering ▪ Consider a comparison on Income & Balance: CurrentBalance High High Medium Medium Low Low Gross Monthly Income Example Cluster 1 High Balance Low Income Example Cluster 2 High Income Low Balance The objects in Cluster 1 have similar characteristics (High Income and Low balance) Also the objects in Cluster 2 have the same characteristic (High Balance and Low Income)
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Example ▪ The plot of students in an area is as given below I need to find specific locations to build schools in this area so that the students doesn’t have to travel much
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Example ▪ Using k-means clustering we got output as:
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Apache Zeppelin
Copyright © 2017, edureka and/or its affiliates. All rights reserved. What is Zeppelin? ▪ A completely open web-based notebook that enables interactive data analytics ▪ Web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop & Spark ▪ The various languages are supported via Zeppelin language interpreters
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-case 1
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Analyzing Data using SparkSQL Storing Data in HDFS Visualizing the data US County Solution HDFS Zeppelin
Copyright © 2017, edureka and/or its affiliates. All rights reserved. US Election Solution Strategy US Primary Election Dataset Storing Data in HDFS Processing Data Using Spark Components Transforming Data Using Scala & Spark SQL Clustering Data Using Spark MLlib (K-Means) Visualizing the Result Using Zeppelin 1 2 3 456
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-case 12
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Uber Pick-Up Locations Dataset Predictions Lat Lon Transforming DataSet K-Means Clustering On Latitude & Longitude 1 3 4 Storing Data in HDFS 2
Copyright © 2017, edureka and/or its affiliates. All rights reserved. Edureka LMS
Copyright © 2017, edureka and/or its affiliates. All rights reserved. LMS: Getting Started
Copyright © 2017, edureka and/or its affiliates. All rights reserved. LMS: Pre-Recorded Session
Copyright © 2017, edureka and/or its affiliates. All rights reserved. LMS: Course Content
Copyright © 2017, edureka and/or its affiliates. All rights reserved. LMS: Projects
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Copyright © 2017, edureka and/or its affiliates. All rights reserved.
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

  • 1.
  • 2.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Big Data Use Cases
  • 3.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Big Data Use-Cases US Primary Election Analysis Market Analysis for US Cab Startup 1 2
  • 4.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Primary Election Analysis 1 2
  • 5.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses
  • 6.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions
  • 7.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions STEP 3: General Elections
  • 8.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions STEP 4: Electoral CollegeSTEP 3: General Elections
  • 9.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Primary Election PROBLEM STATEMENT: In the US Primary Election 2016, Hillary Clinton was nominated over Bernie Sanders from Democrats and on the other hand, Donald Trump was nominated from Republican Party to contest for the presidential position. As an analyst, you have been tasked to understand different factors that led to the winning of Hillary Clinton and Donald Trump in the primary elections based on demographic features to plan their next initiatives and campaigns. Republican Democrat
  • 10.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Primary Election Dataset US Primary Election Data Set US Demographic Features (County-wise) Data Set Now as a data analyst you have 2 datasets available :
  • 11.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Primary Election Dataset state: List of US states state_abbreviation: Abbreviation of each US state county: List of counties in each US states fips: FIPS county code is a Federal Information Processing Standards (FIPS) code which uniquely identifies counties party: Different parties in US (i.e. Republican & Democrat) candidate: candidates in US primary election from different parties votes: number of votes gained by a candidate fraction_votes: total number of votes gained by a candidate/ total votes gained by the party
  • 12.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US County Demographic Features Dataset DETAILS: Population, 2014 estimate Population, 2010 (April 1) estimates base Population, percent change - April 1, 2010 to July 1, 2014 Population, 2010 Persons under 5 years, percent, 2014 Persons under 18 years, percent, 2014 Persons 65 years and over, percent, 2014 Female persons, percent, 2014 White alone, percent, 2014 …
  • 13.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Election Solution Strategy US Primary Election Dataset 1
  • 14.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Storing Data in HDFS 2
  • 15.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Processing Data Using Spark Components 3
  • 16.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Transforming Data Using Spark SQL 4
  • 17.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Clustering Data Using Spark MLlib (K-Means) 5
  • 18.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Election Solution Strategy Visualizing the Result Using Zeppelin 6
  • 19.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Election Solution Strategy US Primary Election Dataset Storing Data in HDFS Processing Data Using Spark Components Transforming Data Using Spark SQL Clustering Data Using Spark MLlib (K-Means) Visualizing the Result Using Zeppelin 1 2 3 456
  • 20.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Visualization of Result 1 2
  • 21.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. 2 Market Analysis for US Cab Start-Ups 1
  • 22.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups PROBLEM STATEMENT: A US cab service start-up wants to meet the demands in an optimum manner and maximize the profit. Thus, they hired you as a data analyst to interpret the available Uber’s data set and find out the beehive customer pick-up points & peak hours for meeting the demand in a profitable manner.
  • 23.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Uber Dataset • Date/Time – Pickup Date & Time • Lat – Latitude of Pickup • Lon – Longitude of Pickup • Base – TLC Base Code
  • 24.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Uber Pick-Up Locations Dataset Predictions 1 Lat Lon
  • 25.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Predictions Storing Data in HDFS 2 Lat Lon
  • 26.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Predictions Transforming Dataset 3 Lat Lon
  • 27.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Predictions K-Means Clustering On Latitude & Longitude 4 Lat Lon
  • 28.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Uber Pick-Up Locations Dataset Predictions Lat Lon Transforming DataSet K-Means Clustering On Latitude & Longitude 1 3 4 Storing Data in HDFS 2
  • 29.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Let Us Know What It Takes…
  • 30.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Fundamentals Road Map Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-Cases
  • 31.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark
  • 32.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Hadoop Spark Introduction to Hadoop & Spark Hadoop is a framework that allows you to store and process large data sets in parallel and distributed fashion. Apache Spark is an open-source cluster-computing framework for real time processing ❖ Provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance ❖ Built on top of YARN and it extends the YARN model to efficiently use more types of computations ❖ Hadoop has two core components: ▪ HDFS: Allows to dump any kind of data across the cluster ▪ YARN: Allows parallel processing of the data stored in HDFS
  • 33.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Spark Complementing Hadoop
  • 34.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Spark processes data 100 times faster than MapReduce Spark & Hadoop 1 Spark Applications can run on YARN leveraging Hadoop cluster2 Apache Spark can use HDFS as its storage 3 Faster Analytic Cost Optimization Avoid Duplication Challenges Addressed : Combining Spark’s ability, i.e. high processing speed, advance analytics and multiple integration support with Hadoop’s low cost operation on commodity hardware gives the best results
  • 35.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Big Data Use-Cases
  • 36.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Big Data Use-Cases
  • 37.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Big Data Use-Cases
  • 38.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Kafka Big Data Use-Cases Solution Architecture Apache HIVE MapReduce Apache Spark SolutionOptions Storing Big Data on HDFS Processing through YARN framework Tools used for processing
  • 39.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage)
  • 40.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. HDFS ❖ HDFS stands for Hadoop Distributed File System ❖ HDFS is the storage unit of Hadoop HDFS creates an abstraction layer over the distributed storage resources, from where we can see the whole HDFS as a single unit. DataNodeDataNode DataNode NameNode Secondary NameNode
  • 41.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. NameNode NameNode Secondary NameNode NameNode • Master daemon • Maintains and Manages DataNodes • Records metadata e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. • Receives heartbeat and block report from all the DataNodes
  • 42.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Secondary NameNode NameNode Secondary NameNode Secondary NameNode • Checkpointing is a process of combining edit logs with FsImage • Allows faster Failover as we have a back up of the metadata • Checkpointing happens periodically (default: 1 hour)
  • 43.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Secondary NameNode NameNode Secondary NameNode
  • 44.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. DataNode DataNode • Slave daemons • Stores actual data • Serves read and write requests DataNodeDataNode DataNode
  • 45.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. HDFS Architecture in Detail NameNode Metadata (Name, replicas, …): /hdfs/foo/data, 3, … Client Client Replication Block ops DataNodesDataNodes Read Write Metadata ops Rack 2Rack 1
  • 46.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. HDFS Block & Replication
  • 47.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. HDFS Data Block 380 MB How many blocks will be created if a file of size 500 MB is copied to HDFS? • Each file is stored on HDFS as block • The default size of each block is 128 MB • Let us say, I have a file example.txt of size 380 MB: 128 MB 128 MB 124 MB Block 1 Block 2 Block 3
  • 48.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. HDFS Data Block • Each file is stored on HDFS as block • The default size of each block is 128 MB • Let us say, I have a file example.txt of size 500 MB: How many blocks will be created if a file of size 500 MB is copied to HDFS? 128 MB 128 MB 128 MB 500 MB Block 1 Block 2 Block 3 116 MB Block 4 380 MB 128 MB 128 MB 124 MB Block 1 Block 2 Block 3
  • 49.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. HDFS Block Replication NameNode DataNodeDataNode DataNode Secondary NameNode 128 MB 120 MB 248 MB Block 1 Block 2 Each data blocks are replicated (thrice by default) and are distributed across different DataNodes Replication Factor = 3
  • 50.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Rack Awareness • Rack Awareness Algorithm reduces latency as well as provide fault tolerance by replicating data block • Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack & the next two replicas will be stored on a different (remote) rack
  • 51.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 52.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 53.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 54.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 55.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 56.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Rack Awareness Rack - 1 Rack - 2 Rack - 3
  • 57.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. HDFS Fault Tolerance If a DataNode fails, the data blocks can be recovered and retrieved from the replicas stored on another DataNodes. Replication Factor = 3 NameNode DataNodeDataNode DataNode Secondary NameNode
  • 58.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. HDFS Fault Tolerance If a DataNode fails, the data blocks can be recovered and retrieved from the replicas stored on another DataNodes. Replication Factor = 3 NameNode DataNodeDataNode DataNode Secondary NameNode
  • 59.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Start Hadoop Daemons ./sbin/start-all.sh 1 ./sbin/stop-all.sh 2 jps 3 Starts all the Hadoop daemons(HDFS & YARN) Stops all the Hadoop daemons Checks all the daemons running on you machines
  • 60.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Writing & Deleting a File in Hadoop hdfs fs –put /test.txt / 1 hdfs dfs –ls / 2 hdfs fs –rm /test.txt / 3 Coping a file from local file system to HDFS Lists all the HDFS files/directories Deleting the file from HDFS
  • 61.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing)
  • 62.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. What is YARN ? • Hadoop 2.0 came up with new framework YARN ( Yet Another Resource Negotiator ), which provides ability to run Non- MapReduce application. • It provides a paradigm for parallel processing over Hadoop. • YARN framework is responsible for integration of different tools with Hadoop like Spark, Hive, Pig.
  • 63.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. ResourceManager ResourceManager ResourceManager • Receives the processing requests • Passes the requests to corresponding NodeManagers
  • 64.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. NodeManager NodeManagerNodeManager NodeManager NodeManager • Installed on every DataNode • Responsible for execution of task on every single DataNode
  • 65.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. YARN Architecture in Detail
  • 66.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. YARN Workflow
  • 67.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM 1. Run Job
  • 68.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 2. Submit Job YARN child task JVM
  • 69.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 3. Get application ID YARN child task JVM
  • 70.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 4.1. Start Container 4.2. Launch AppMaster YARN child task JVM
  • 71.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM5. Allocate Resources
  • 72.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task 6.1. Start Container 6.2. Launch YARN child task JVM
  • 73.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Application Submission in YARN MR Code MR Job Client Mode Resource Manager RM Node NodeManager AppMaster JVM NodeManager MR Task YARN child task JVM 7. Execute
  • 74.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status 8. AM unregisters with RM 6 3 2 4 5 4 5 1 RM NM AMClient 4 5 7 8 YARN Application Workflow
  • 75.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Hadoop Cluster Architecture = HDFS + YARN
  • 76.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Hadoop Cluster Architecture MasterSlave HDFS YARN
  • 77.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Hadoop Cluster Hardware Specification
  • 78.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Hadoop Cluster Hardware Specification ▪ RAM: 64 GB, ▪ Hard disk: 1 TB ▪ Processor: Xenon with 8 Cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux ▪ Power: Redundant Power Supply ▪ RAM: 32 GB ▪ Hard disk: 1 TB ▪ Processor: Xenon with 4 Cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux ▪ Power: Redundant Power Supply ▪ RAM: 16GB ▪ Hard disk: 6 x 2TB ▪ Processor: Xenon with 2 cores ▪ Ethernet: 3 x 10 GB/s ▪ OS: 64-bit CentOS / Linux NameNode Secondary NameNode DataNode
  • 79.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Real Time Hadoop Cluster Deployment
  • 80.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Hadoop Cluster : Facebook Use Case 21 PB of storage in a single HDFS cluster 2000 Machines Per Cluster 12 TB of Data Per Machine 1200 machines with 8 cores each + 800 machines with 16 cores each 32 GB of RAM per machine 15 map- reduce tasks per machine That's a total of more than 21 PB of configured storage capacity, this is larger than the previously known Yahoo!'s cluster of 14 PB
  • 81.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Hadoop Cluster : Spotify Use Case Use Hadoop for generating music recommendations 1650 node cluster ~ 65 PB storage 70 TB RAM +25,000 daily Hadoop jobs 43,000 virtualised cores
  • 82.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark
  • 83.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Spark Core Components
  • 84.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Spark Core Spark Core is the base engine for large-scale parallel and distributed data processing It is responsible for: ▪ Memory management and fault recovery ▪ Scheduling, distributing and monitoring jobs on a cluster ▪ Interacting with storage systems
  • 85.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Spark Architecture
  • 86.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Spark SQL • Spark SQL integrates relational processing with Spark’s functional programming • Provides support for various data sources and makes it possible to weave SQL queries with code transformations CSV JSON JDBC Data Source API Data Frame API DataFrame DSL Spark SQL and HQL
  • 87.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Start Spark Daemons ./sbin/start-all.sh 1 jps 2 ./bin/spark-shell 3 Starts all the Spark daemons(Master & Worker) Checks all the daemons running on you machines Starts the Spark Shell
  • 88.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin
  • 89.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. K-Means Clustering The process by which objects are classified into a predefined number of groups so that they are as much dissimilar as possible from one group to another group, but as much similar as possible within each group ▪ The objects in group 1 should be as similar as possible ▪ But there should be much difference between an object in group 1 and group 2 ▪ The attributes of the objects are allowed to determine which objects should be grouped together Total population Group 1 Group 2 Group 3 Group 4
  • 90.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. K-Means Clustering ▪ Consider a comparison on Income & Balance: CurrentBalance High High Medium Medium Low Low Gross Monthly Income Example Cluster 1 High Balance Low Income Example Cluster 2 High Income Low Balance The objects in Cluster 1 have similar characteristics (High Income and Low balance) Also the objects in Cluster 2 have the same characteristic (High Balance and Low Income)
  • 91.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Example ▪ The plot of students in an area is as given below I need to find specific locations to build schools in this area so that the students doesn’t have to travel much
  • 92.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Example ▪ Using k-means clustering we got output as:
  • 93.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Apache Zeppelin
  • 94.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. What is Zeppelin? ▪ A completely open web-based notebook that enables interactive data analytics ▪ Web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop & Spark ▪ The various languages are supported via Zeppelin language interpreters
  • 95.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-case 1
  • 96.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Analyzing Data using SparkSQL Storing Data in HDFS Visualizing the data US County Solution HDFS Zeppelin
  • 97.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. US Election Solution Strategy US Primary Election Dataset Storing Data in HDFS Processing Data Using Spark Components Transforming Data Using Scala & Spark SQL Clustering Data Using Spark MLlib (K-Means) Visualizing the Result Using Zeppelin 1 2 3 456
  • 98.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Introduction to Hadoop & Spark HDFS (Hadoop Storage) YARN (Hadoop Processing) Apache Spark K-Means & Zeppelin Solution of Use-case 12
  • 99.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Market Analysis for US Cab Start-Ups Solution Strategy Uber Pick-Up Locations Dataset Predictions Lat Lon Transforming DataSet K-Means Clustering On Latitude & Longitude 1 3 4 Storing Data in HDFS 2
  • 100.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. Edureka LMS
  • 101.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. LMS: Getting Started
  • 102.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. LMS: Pre-Recorded Session
  • 103.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. LMS: Course Content
  • 104.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved. LMS: Projects
  • 105.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved.
  • 106.
    Copyright © 2017,edureka and/or its affiliates. All rights reserved.