www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Agenda 1. 5 V’s of Big Data 2. Problems with Big Data 3. Hadoop-as-a solution 4. What is Hadoop? 5. HDFS 6. YARN 7. MapReduce 8. Hadoop Ecosystem
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING 5V’s of Big Data
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Value? Different kinds of data is being generated from various sources Data is being generated at an alarming rate Mechanism to bring the correct meaning out of the data Uncertainty and inconsistencies in the data Data is being generated at an accelerating speed 5 V’s of Big Data Volume Variety Velocity Value Veracity
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Problems with Big Data Processing
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Problems with Big Data Highly Scalable Storing huge and exponentially growing datasets Processing data having complex structure (structured, un-structured, semi- structured) Bringing huge amount of data to computation unit becomes a bottleneck
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING So for Big Data problem statement, Hadoop emerged as a solution…. What is Hadoop?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion Allows to dump any kind of data across the cluster Allows parallel processing of the data stored in HDFS HDFS (Storage) MapReduce (Processing)
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop 1 hr. HDFS ReadWrite Storing exponentially growing huge datasets Storing unstructured data Processing data faster Allows to store any kind of data, be it structured, semi- structured or unstructured Provides parallel processing of data present in HDFS Allows to process data locally i.e. each node works with a part of data which is stored on it HDFS, storage unit of Hadoop is a Distributed File System 2 31
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Distributed File System (HDFS)
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS NameNode DataNode DataNodeDataNode Slave Node Master NodeHDFS ▪ Storage unit of Hadoop ▪ Distributed File System ▪ Divide files (input data) into smaller chunks and stores it across the cluster ▪ Horizontal Scaling as per requirement ▪ Stores any kind of data ▪ No schema validation is done while dumping data
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS Block file.xml 128 MB 128 MB 128 MB 128 MB HDFS Cluster HDFS Blocks moving to HDFS • HDFS stores the data in form of blocks • Block size can be configured base on requirements Note: The default Block Size is 128 MB
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING NameNode • Master daemon • Maintains and Manages DataNodes • Records metadata e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. • Receives heartbeat and block report from all the DataNodes NameNode NameNode DataNode DataNodeDataNode Secondary NameNode
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING DataNode DataNode ▪ Slave daemons ▪ Stores actual data ▪ Serves read and write requests NameNode DataNode DataNodeDataNode Secondary NameNode
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Secondary NameNode Secondary NameNode • Checkpointing is a process of combining edit logs with FsImage • Allows faster Failover as we have a back up of the metadata • Checkpointing happens periodically (default: 1 hour) NameNode DataNode DataNodeDataNode Secondary NameNode
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Distributed File System NameNode DataNode DataNodeDataNode Secondary NameNode Secondary NameNode NameNode editLog editLog fsImage fsImage editLog (new) FsImage (final) First time copy Temporary During checkpoint
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING YARN (Yet Another Resource Negotiator)
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING YARN Resource Manager Node Manager Node Manager Node Manager ResourceManager • Receives the processing requests • Passes the parts of requests to corresponding NodeManagers NodeManagers • Installed on every DataNode • Responsible for execution of task on every single DataNode
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING YARN Architecture Resource Manager Node Manager Node Manager container App Master App Master container Node Manager App Master container Client Node Status Resource Request MapReduce Status App Manager ResourceManager has two components: Schedulers & ApplicationsManager NodeManager has two components: ApplicationMaster & Container
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING YARN Architecture Resource Manager Node Manager container App Master NodeStatus ResourceRequest App Manager ApplicationsManager • ApplicationsManager accepts the job submission • Negotiates to containers for executing the application specific ApplicationMaster and monitoring the progress ApplicationsMaster • ApplicationMasters are the deamons which reside on DataNode • Communicates to containers for execution of tasks on each DataNode
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Architecture Bigger Picture
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Architecture Secondary NameNode NameNode ResourceManager DataNode DataNode NodeManger NodeManager container App Master container App Master NodeManger NodeManager container App Master container App Master DataNode DataNode JobHistory Server HDFS YARN
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce MapReduce is a software framework which helps in writing applications that processes large data sets using distributed and parallel algorithms inside Hadoop environment.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Job Workflow IND ENG AUS NZ NZ ENG AUS IND AUS IND SL NZ IND, ENG, AUS, NZ AUS, IND, SL, NZ NZ, ENG, AUS, IND IND, 1 ENG, 1 AUS, 1 NZ, 1 AUS, 1 IND, 1 SL, 1 NZ, 1 NZ, 1 ENG, 1 AUS, 1 IND, 1 IND, (1,1,1) ENG, (1,1) AUS, (1,1,1) NZ, (1,1,1) SL, (1) IND, 3 ENG, 2 AUS, 3 NZ, 3 SL, 1 INPUT SPLITTING MAPPING SHUFFLING FINAL RESULT IND, 3 ENG, 2 AUS, 3 NZ, 3 SL, 1 REDUCING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Ecosystem
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Ecosystem
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING

What Is Hadoop | Hadoop Tutorial For Beginners | Edureka