Introduction to Hadoop
• Apache initiated the project for developing storage and processing framework for Big Data storage and processing. • Doug Cutting and Machael J. Cafarelle the creators named that framework as Hadoop. • It consists of two components • One of them is for data store in blocks in the cluster • the other is computations at each individual cluster in parallel with another.
• Hadoop components are written in Java with part of native code in c. • The command line utilities are written in shell scripts. • Hadoop is a computing environment in which input data stores, processes and store the results. • The complete system consists of a scalable distributed set of clusters. • Infrastructure consists of cloud for clusters. • Hadoop platform provides a low cost Big data platform, which is open source and uses cloud services.
• Hadoop enables distributed processing of large datasets (above 10 million bytes) across clusters of computers using a programming model called MapReduce. • The system characteristics are scalable, self-manageable, self-healing and distributed file system. • Scalable: can be scaled up by adding storage and processing units as per the requirements. • self-manageable: means creation of storage and processing resources which are used, scheduled and reduced or increased with the help of the system itself. • self-healing: means that in case of faults, they are taken care of the system itself.
Hadoop core components
• Hadoop Common : the common module contain the libraries and utilities that are required by the other modules of the Hadoop. • Hadoop common provides various components and interface for distributed file system and general input/output. This includes serialization, Java RPC and file based structures. • HDFS: A Java based distributed file system which can store all kinds of data on the disks at the cluster. • MapReduce v1: software programming model in Hadoop using Mapper and Reducer. The v1 processes large set of data in parallel and in batches. • YARN (Yet Another Resource Negotiator) : software for managing resources for computing. (Scheduling and handles the request)
Features of Hadoop • Fault-efficient scalable, flexible and modular design • System provides servers at high scalability by adding new node to handle larger data. • Hadoop helpful in storing, managing, processing and analysing Big data. • Modular function makes the system flexible. • One can add or replace components at ease. • Modularity allows replacing its components for a different software tool.
• Robust design of HDFS • Execution of Big data applications continue even when an individual server or cluster fails. • This is because of Hadoop provisions for backup (replication at least at 3 times for each block)
• Store and process Big data: processes Big data of 3V characteristics. • Distributed clusters computing model with data locality: • Processes Big data at high speed as the application tasks and sub-task submit to the data nodes(Server). • One can achieve more computing power by increasing the number of computing nodes. • The processing splits across multiple Data nodes and thus fast processing and aggregated result.
• Hardware fault-tolerant: • A fault does not affect data and application processing. • If a node goes down, another node take care of the residue. • This is because multiple copies of all data blocks which replicate automatically.
• Open-source frame work: • Open source access and cloud services enable large data store. • Hadoop uses a cluster of multiple inexpensive server or the cloud. • Java and Linux based: • Hadoop uses Java interface. • Hadoop base is Linux but has its own set of shell commands support.
Hadoop ecosystem components
• The four layers are as follows: • Distributed storage layer • Resource-manager layer for job or application sub-tasks scheduling and execution. • Processing framework layer consisting of Mapper and reducer for the MapReduce process flow. • API’s at application support layer.
Hadoop streaming • HDFS with MapReduce and YARN-based system enables parallel processing of large datasets. • Spark provides in-memory processing of data thus improving the processing speed. • Flink is emerging as a powerful tool, it improves the overall performance as it provides single run-time for streaming as well as batch processing.
Hadoop pipes • Hadoop pipes are the C++ pipes which interface with MapReduce. • Pipes: pipes means data streaming into the system at Mapper input and aggregated results flowing out at outputs. • Pipes do not use the standard I/O when communicating with Mapper and Reducer codes.
Hadoop Distributed File System • HDFS is a core components of Hadoop • HDFS is designed to run on a cluster of computers and servers at cloud based utility services. • HDFS stores the data in a distributed manner in order to compute fast • HDFS stores data in any format regardless of schema.
HDFS Data storage • Hadoop data store concept implies storing the data a number of clusters. • Each cluster has a number of data stores called racks. • Each rack stores a number of DataNodes. • Each DataNode has a large number of data blocks. • The racks distribute across a cluster. • The nodes have processing and storage capabilities. • The nodes have data in data blocks to run the application task. • The data blocks replicate by default at least on three DataNodes in same or remote nodes. • The data block default size is 64-MB.
• Hadoop HDFS features are as follows: • Create, append, delete, rename and attribute modification functions • Content of individual file cannot be modified or replaced but appended with new data at the end of the file. • Write once but read many times during usages and processing. • Average file size can be more than 500MB.
Hadoop cluster example Consider a data storage for University students. Each student data, stuData which is in a file of size less than 64MB. A data block stores the full file data for a student of stuData_idN, where N=1 to 500. 1) how the files of each student will be distributed at a Hadoop cluster? How many student data can be stored at one cluster? Assume that each rack has two DataNodes for processing each of 64MB memory. Assume that cluster consists of 120 rack and thus 240 DataNodes. 2)What is the total memory capacity of the cluster in TB and DataNodes in each rack? 3) show the distributed blocks for students with ID=96 and 1025. assume default replication in the DataNode=3. 4)What shall be the changes when a stuData file size <= 128MB?
Solution Q.1 • Data block default size is 64MB. Each students file size is less than 64MB. Therefore for each student file one data block suffice. • A data block is in a DataNode. Assume for simplicity each rack has two nodes each of memory capacity=64MB. • Each node can thus store 64GB/64MB=1024 data blocks=1024 student files. • Each rack thus store 2X64GB/64MB=2048 data blocks=2048 student files. Each data block default replicates three times in the data node. • Therefore the number of students whose data can be stored in the • cluster = number of racks multiplied by number of files divided by 3 • =120 X 2048 / 3=81920. • Therefore the maximum number of 81920 stuData_IDN files can be distributed per cluster with N=1 to 81920
Solution Q.2 • Total memory capacity of the cluster = 120 X 128MB =15360 GB=15TB. • Total memory capacity of each DataNode in the rack = 1024 X 64 MB= 64GB.
Solution Q.3
Solution Q.4 • Changes will be that each node will have half the number of data blocks.
Hadoop Physical Organisation
• HDFS use the NameNodes and DataNodes. • NameNode stores the file’s meta data. Meta data gives the information about the file of user application. • DataNode stores the actual data files in the data blocks. • Few nodes in a Hadoop cluster act as NameNodes. • These nodes are termed as MasterNodes or simply masters.
MasterNodes • The masters have a different configuration supporting high DRAM and processing power. • The masters have much less local storage. DataNodes: • Majority of the nodes in Hadoop cluster act as DataNodes and TaskTrackers. • These nodes are referred to as nodes or slaves. • The slaves have lots of disk storage and moderate amount of processing capabilities and DRAM. • Slaves are responsible to store the data and process the computation tasks submitted by the clients.
• Clients as the user run the application with the help of Hadoop ecosystem projects. • Eg: Hive, Mahout and pig are the ecosystem’s projects. • A single MasterNode provides HDFS, MapReduce and Hbase using threads in small to medium sized clusters.
• Few nodes in a Hadoop cluster act as NameNode • Nodes are termed as MasterNoode or slaveNode • MasterNode: • Fundamentally plays the role of a coordinator. • Receives the client connections, maintains the description of the global file system namespace and the allocation of the file blocks. • It also monitors the state of the system in order to detect any failure.
• The Masters consists of three components NameNode, SecondaryNode and JobTracker. • NameNode stores All the file system related information such as • The file section is stored in which part of the cluster • Last access time for the files • User permission like which user has access to the file. • SecondaryNode: is an alternate for NameNode • Secondary node keeps a copy of NameNode meta data. • Thus stored meta data can be rebuilt easily in case of NameNode failure. • JobTracker : • Coordinates the parallel processing of data. • Zookeeper: functions as a centralized repository for distributed applications. • Zookeeper uses synchronization, serialization and coordination activities • It enables functioning of a distributed system as a single function
HDFS Commands • mkdir: $Hadoop hdfs-mkdir/user/stu_files_dir • Creates the directory named stu_ files_dir • -put: $Hadoop hdfs-put stuData_id96/user/stu_files_dir • Copies file for student of id96 into stu_files_dir • -ls: assume all files to be listed. • $hdfs hdfs dfs-ls • -cp: assume stuData_id96 to be copied from stu_files_dir to new student’s directory newstu_filesDir. Then • Hadoop hdfs-cp stuData_id96/user/stu_filesdir newstu_filesdir
MapReduce framework and programming model • MapReduce is a programming model for distributed computing. • Mapper means software for doing the assigned task after organizing the data blocks imported using the keys. • A key specifies in a common line of Mapper. • The command maps the key to the data which an application uses.
• Reducer means a software for reducing the mapped data by using the aggregation, query or user-specified function. The reducer provides a concise cohesive response for the application. • Aggregation function means the function that groups the values of multiple rows together to result a single value of more significant meaning or measurement. • Eg: function such as count, sum, maximum, minimum, deviation and standard deviation. • Querying function means a function that finds the desired values • Eg: function for finding a best student of a class who has shown the best performance in examination.
Features of MapReduce framework : • Provides automatic parallelization and distribution of computation based on several processors. • Processes data stored on distributed clusters of DataNodes and racks. • Allows processing large amount of data in parallel. • Provides scalability for usages of large number of servers. • Provides MapReduce batch oriented programming model in Hadoop version1. • Provides additional processing modes • Eg: queries, graph database, streaming data, messages, OLAP
• Shown below is a MapReduce example to count the frequency of each word in a given input text. Our input text is, “Big data comes in various formats. This data can be stored in multiple data servers.”
MapReduce programming Model • MapReduce program can be written in any language including Java, C++PIPE or Python. • MapReduce program do mapping to compute the data and convert the data into other data sets. • After the Mapper computation finish, the reducer function collects the result of Map and generates the final output result. • MapReduce program can be applied to any type of data (structured or Unstructured)
Hadoop YARN • It is a resource management platform. It manages computer resources. • The platform is responsible for providing computational resources, such as CPUs, memory, network I/O. • It separates the resource management and processing components. • It enables running of multi-threaded applications.
Hadoop 2 Execution Model
• The figure show the YARN components-client, Resource Manager(RM), Node Manager(NM), Application Master (AM) and containers. • List of actions: • A MasterNode has two components:1. Job History server 2. Resource Manager(RM) • The client node submits the request of an application to the RM. RM is the master. One RM exists per cluster. • The RM keeps information of all the slave NMs.
• An NM creates the AM(application manager) instance, AMI • AMI initializes itself and register with the RM. Multiple AMIs can be created in an AM. • All active NM send the controlling signal periodically to the RM signalling their presence. • Each NM includes several containers for uses by the subtasks of the application.
Hadoop Ecosystem Tools • Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate between themselves and maintain shared data with robust synchronization techniques. • The common services provided by ZooKeeper are as follows − • Naming service − Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes. • Configuration management − Latest and up-to-date configuration information of the system for a joining node. • Cluster management − Joining / leaving of a node in a cluster and node status at real time. • Leader election − Electing a node as leader for coordination purpose. • Locking and synchronization service − Locking the data while modifying it. This mechanism helps you in automatic fail recovery while connecting other distributed applications like Apache HBase. • Highly reliable data registry − Availability of data even when one or a few nodes are down.
oozie • Apache Oozie is a scheduler system to run and manage Hadoop jobs in a distributed environment. • It allows to combine multiple complex jobs to be run in a sequential order to achieve a bigger task. • Within a sequence of task, two or more jobs can also be programmed to run parallel to each other.
• Following three types of jobs are common in Oozie − • Oozie Workflow Jobs − These are represented as Directed Acyclic Graphs (DAGs) to specify a sequence of actions to be executed. • Oozie Coordinator Jobs − These consist of workflow jobs triggered by time and data availability. • Oozie Bundle − These can be referred to as a package of multiple coordinator and workflow jobs.
• Oozie provisions for following • Integrates multiple jobs in a sequential manner • Stores and supports Hadoop jobs for MapReduce, Hive,pig,sqoop • Runs workflow jobs based on time and data triggers • Manages batch coordinator for the applications
sqoop • Sqoop is a tool used to transfer bulk data between Hadoop and external datastores, such as relational databases (MS SQL Server, MySQL).
• Sqoop Import • The import tool imports individual tables from RDBMS to HDFS. • Each row in a table is treated as a record in HDFS. • All records are stored as text data in text files or as binary data in Avro and Sequence files. • Sqoop Export • The export tool exports a set of files from HDFS back to an RDBMS. • The files given as input to Sqoop contain records, which are called as rows in table • Those are read and parsed into a set of records and delimited with user- specified delimiter.
Flume • Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log data, events (etc...) from various web serves to a centralized data store. • It is a highly reliable, distributed, and configurable tool that is principally designed to transfer streaming data from various sources to HDFS.
•Data generators (such as Facebook, Twitter) generate data which gets collected by individual Flume agents running on them. •Thereafter, a data collector (which is also an agent) collects the data from the agents which is aggregated and pushed into a centralized store such as HDFS or HBase.
• Flume Agent • An agent is an independent process (JVM) in Flume. It receives the data (events) from clients or other agents and forwards it to its next destination (sink or agent).
• As shown in the diagram a Flume Agent contains three main components namely, source, channel, and sink. • A source is the component of an Agent which receives data from the data generators and transfers it to one or more channels in the form of Flume events. • A channel is a transient store which receives the events from the source and buffers them till they are consumed by sinks. It acts as a bridge between the sources and the sinks. Example − JDBC channel, File system channel, Memory channel, etc. • A sink stores the data into centralized stores like HBase and HDFS. It consumes the data (events) from the channels and delivers it to the destination. The destination of the sink might be another agent or the central stores. Example − HDFS sink
Ambari • Apache Ambari is an open-source administration tool deployed on top of Hadoop clusters, • It is responsible for keeping track of the running applications and their status • Apache Ambari can be referred to as a web-based management tool that manages, monitors, and provisions the health of Hadoop clusters.
Features of Ambari • Simplification of installation, configuration and management • Enable easy, efficient, repeatable and automated creation of clusters. • Manages and Monitors scalable clustering • Enables detection of faulty node links • Provides extensibility and cutomizability
Hbase • HBase is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS). • HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data use cases. • HBase is scalable, distributed, and NoSQL database • HBase, provide real-time access to read or write data in HDFS
Components of Hbase • There are two HBase Components namely- HBase Master and RegionServer. • HBase Master:It is not part of the actual data storage but negotiates load balancing across all RegionServer. • Maintain and monitor the Hadoop cluster. • Performs administration (interface for creating, updating and deleting tables.) • Controls the failover. • HMaster handles DDL operation.
• RegionServer:It is the worker node which handles read, writes, updates and delete requests from clients. • Region server process runs on every node in Hadoop cluster. • Region server runs on HDFS DateNode.
Storage Mechanism in HBase • HBase is a column-oriented database and the tables in it are sorted by row. • The table schema defines only column families, which are the key value pairs. • A table have multiple column families and each column family can have any number of columns
• In short, in an HBase: • Table is a collection of rows. • Row is a collection of column families. • Column family is a collection of columns. • Column is a collection of key value pairs.
Features of HBase • HBase is linearly scalable. • It has automatic failure support. • It provides consistent read and writes. • It integrates with Hadoop, both as a source and a destination. • It has easy java API for client. • It provides data replication across clusters.
Hive • Hive is an ETL and Data warehousing tool used to query or analyze large datasets stored within the Hadoop ecosystem • Hive has three main functions: data summarization, query, and analysis of unstructured and semi-structured data in Hadoop.
Features of Hive • It stores schema in a database and processed data into HDFS. • It is designed for OLAP. • It provides SQL type language for querying called HiveQL or HQL. • It is familiar, fast, scalable, and extensible.
• User Interface • Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. • The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server). • Meta StoreHive • chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping.
• HiveQL Process Engine • HiveQL is similar to SQL for querying on schema info on the Metastore. • It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it. • Execution Engine • The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. • Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce. • HDFS or HBASE • Hadoop distributed file system or HBASE are the data storage techniques to store data into file system.
pig • Apache Pig is an abstraction over MapReduce. • It is a tool/platform which is used to analyze larger sets of data representing them as data flows. • Pig is generally used with Hadoop; • we can perform all the data manipulation operations in Hadoop using Pig.
Features of Pig • Rich set of operators − It provides many operators to perform operations like join, sort, filer, etc. • Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL. • Optimization opportunities − The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus only on semantics of the language. • Extensibility − Using the existing operators, users can develop their own functions to read, process, and write data. • UDF’s − Pig provides the facility to create User-defined Functions in other programming languages such as Java and invoke or embed them in Pig Scripts. • Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS.
• To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded). • After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output.
• Parser • Initially the Pig Scripts are handled by the Parser. • It checks the syntax of the script, does type checking, and other miscellaneous checks. • The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. • In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges.
• Optimizer • The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown. • Compiler • The compiler compiles the optimized logical plan into a series of MapReduce jobs. • Execution engine • Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.
Mahout • Apache Mahout is an open source project that is primarily used for creating scalable machine learning algorithms. It implements popular machine learning techniques such as: • Recommendation • Classification • Clustering
Features of Mahout • The algorithms of Mahout are written on top of Hadoop, so it works well in distributed environment. • Mahout uses the Apache Hadoop library to scale effectively in the cloud. • Mahout offers the coder a ready-to-use framework for doing data mining tasks on large volumes of data. • Mahout lets applications to analyze large sets of data effectively and in quick time. • Includes several MapReduce enabled clustering implementations such as k-means, fuzzy k- means, Canopy, Dirichlet, and Mean-Shift. • Supports Distributed Naive Bayes and Complementary Naive Bayes classification implementations. • Comes with distributed fitness function capabilities for evolutionary programming. • Includes matrix and vector libraries.

Introduction to Hadoop Apache initiatedd

  • 1.
  • 2.
    • Apache initiatedthe project for developing storage and processing framework for Big Data storage and processing. • Doug Cutting and Machael J. Cafarelle the creators named that framework as Hadoop. • It consists of two components • One of them is for data store in blocks in the cluster • the other is computations at each individual cluster in parallel with another.
  • 3.
    • Hadoop componentsare written in Java with part of native code in c. • The command line utilities are written in shell scripts. • Hadoop is a computing environment in which input data stores, processes and store the results. • The complete system consists of a scalable distributed set of clusters. • Infrastructure consists of cloud for clusters. • Hadoop platform provides a low cost Big data platform, which is open source and uses cloud services.
  • 4.
    • Hadoop enablesdistributed processing of large datasets (above 10 million bytes) across clusters of computers using a programming model called MapReduce. • The system characteristics are scalable, self-manageable, self-healing and distributed file system. • Scalable: can be scaled up by adding storage and processing units as per the requirements. • self-manageable: means creation of storage and processing resources which are used, scheduled and reduced or increased with the help of the system itself. • self-healing: means that in case of faults, they are taken care of the system itself.
  • 5.
  • 6.
    • Hadoop Common: the common module contain the libraries and utilities that are required by the other modules of the Hadoop. • Hadoop common provides various components and interface for distributed file system and general input/output. This includes serialization, Java RPC and file based structures. • HDFS: A Java based distributed file system which can store all kinds of data on the disks at the cluster. • MapReduce v1: software programming model in Hadoop using Mapper and Reducer. The v1 processes large set of data in parallel and in batches. • YARN (Yet Another Resource Negotiator) : software for managing resources for computing. (Scheduling and handles the request)
  • 7.
    Features of Hadoop •Fault-efficient scalable, flexible and modular design • System provides servers at high scalability by adding new node to handle larger data. • Hadoop helpful in storing, managing, processing and analysing Big data. • Modular function makes the system flexible. • One can add or replace components at ease. • Modularity allows replacing its components for a different software tool.
  • 8.
    • Robust designof HDFS • Execution of Big data applications continue even when an individual server or cluster fails. • This is because of Hadoop provisions for backup (replication at least at 3 times for each block)
  • 9.
    • Store andprocess Big data: processes Big data of 3V characteristics. • Distributed clusters computing model with data locality: • Processes Big data at high speed as the application tasks and sub-task submit to the data nodes(Server). • One can achieve more computing power by increasing the number of computing nodes. • The processing splits across multiple Data nodes and thus fast processing and aggregated result.
  • 10.
    • Hardware fault-tolerant: •A fault does not affect data and application processing. • If a node goes down, another node take care of the residue. • This is because multiple copies of all data blocks which replicate automatically.
  • 11.
    • Open-source framework: • Open source access and cloud services enable large data store. • Hadoop uses a cluster of multiple inexpensive server or the cloud. • Java and Linux based: • Hadoop uses Java interface. • Hadoop base is Linux but has its own set of shell commands support.
  • 12.
  • 13.
    • The fourlayers are as follows: • Distributed storage layer • Resource-manager layer for job or application sub-tasks scheduling and execution. • Processing framework layer consisting of Mapper and reducer for the MapReduce process flow. • API’s at application support layer.
  • 14.
    Hadoop streaming • HDFSwith MapReduce and YARN-based system enables parallel processing of large datasets. • Spark provides in-memory processing of data thus improving the processing speed. • Flink is emerging as a powerful tool, it improves the overall performance as it provides single run-time for streaming as well as batch processing.
  • 15.
    Hadoop pipes • Hadooppipes are the C++ pipes which interface with MapReduce. • Pipes: pipes means data streaming into the system at Mapper input and aggregated results flowing out at outputs. • Pipes do not use the standard I/O when communicating with Mapper and Reducer codes.
  • 16.
    Hadoop Distributed FileSystem • HDFS is a core components of Hadoop • HDFS is designed to run on a cluster of computers and servers at cloud based utility services. • HDFS stores the data in a distributed manner in order to compute fast • HDFS stores data in any format regardless of schema.
  • 17.
    HDFS Data storage •Hadoop data store concept implies storing the data a number of clusters. • Each cluster has a number of data stores called racks. • Each rack stores a number of DataNodes. • Each DataNode has a large number of data blocks. • The racks distribute across a cluster. • The nodes have processing and storage capabilities. • The nodes have data in data blocks to run the application task. • The data blocks replicate by default at least on three DataNodes in same or remote nodes. • The data block default size is 64-MB.
  • 18.
    • Hadoop HDFSfeatures are as follows: • Create, append, delete, rename and attribute modification functions • Content of individual file cannot be modified or replaced but appended with new data at the end of the file. • Write once but read many times during usages and processing. • Average file size can be more than 500MB.
  • 19.
    Hadoop cluster example Considera data storage for University students. Each student data, stuData which is in a file of size less than 64MB. A data block stores the full file data for a student of stuData_idN, where N=1 to 500. 1) how the files of each student will be distributed at a Hadoop cluster? How many student data can be stored at one cluster? Assume that each rack has two DataNodes for processing each of 64MB memory. Assume that cluster consists of 120 rack and thus 240 DataNodes. 2)What is the total memory capacity of the cluster in TB and DataNodes in each rack? 3) show the distributed blocks for students with ID=96 and 1025. assume default replication in the DataNode=3. 4)What shall be the changes when a stuData file size <= 128MB?
  • 20.
    Solution Q.1 • Datablock default size is 64MB. Each students file size is less than 64MB. Therefore for each student file one data block suffice. • A data block is in a DataNode. Assume for simplicity each rack has two nodes each of memory capacity=64MB. • Each node can thus store 64GB/64MB=1024 data blocks=1024 student files. • Each rack thus store 2X64GB/64MB=2048 data blocks=2048 student files. Each data block default replicates three times in the data node. • Therefore the number of students whose data can be stored in the • cluster = number of racks multiplied by number of files divided by 3 • =120 X 2048 / 3=81920. • Therefore the maximum number of 81920 stuData_IDN files can be distributed per cluster with N=1 to 81920
  • 21.
    Solution Q.2 • Totalmemory capacity of the cluster = 120 X 128MB =15360 GB=15TB. • Total memory capacity of each DataNode in the rack = 1024 X 64 MB= 64GB.
  • 22.
  • 23.
    Solution Q.4 • Changeswill be that each node will have half the number of data blocks.
  • 24.
  • 25.
    • HDFS usethe NameNodes and DataNodes. • NameNode stores the file’s meta data. Meta data gives the information about the file of user application. • DataNode stores the actual data files in the data blocks. • Few nodes in a Hadoop cluster act as NameNodes. • These nodes are termed as MasterNodes or simply masters.
  • 26.
    MasterNodes • The mastershave a different configuration supporting high DRAM and processing power. • The masters have much less local storage. DataNodes: • Majority of the nodes in Hadoop cluster act as DataNodes and TaskTrackers. • These nodes are referred to as nodes or slaves. • The slaves have lots of disk storage and moderate amount of processing capabilities and DRAM. • Slaves are responsible to store the data and process the computation tasks submitted by the clients.
  • 27.
    • Clients asthe user run the application with the help of Hadoop ecosystem projects. • Eg: Hive, Mahout and pig are the ecosystem’s projects. • A single MasterNode provides HDFS, MapReduce and Hbase using threads in small to medium sized clusters.
  • 28.
    • Few nodesin a Hadoop cluster act as NameNode • Nodes are termed as MasterNoode or slaveNode • MasterNode: • Fundamentally plays the role of a coordinator. • Receives the client connections, maintains the description of the global file system namespace and the allocation of the file blocks. • It also monitors the state of the system in order to detect any failure.
  • 29.
    • The Mastersconsists of three components NameNode, SecondaryNode and JobTracker. • NameNode stores All the file system related information such as • The file section is stored in which part of the cluster • Last access time for the files • User permission like which user has access to the file. • SecondaryNode: is an alternate for NameNode • Secondary node keeps a copy of NameNode meta data. • Thus stored meta data can be rebuilt easily in case of NameNode failure. • JobTracker : • Coordinates the parallel processing of data. • Zookeeper: functions as a centralized repository for distributed applications. • Zookeeper uses synchronization, serialization and coordination activities • It enables functioning of a distributed system as a single function
  • 30.
    HDFS Commands • mkdir:$Hadoop hdfs-mkdir/user/stu_files_dir • Creates the directory named stu_ files_dir • -put: $Hadoop hdfs-put stuData_id96/user/stu_files_dir • Copies file for student of id96 into stu_files_dir • -ls: assume all files to be listed. • $hdfs hdfs dfs-ls • -cp: assume stuData_id96 to be copied from stu_files_dir to new student’s directory newstu_filesDir. Then • Hadoop hdfs-cp stuData_id96/user/stu_filesdir newstu_filesdir
  • 31.
    MapReduce framework andprogramming model • MapReduce is a programming model for distributed computing. • Mapper means software for doing the assigned task after organizing the data blocks imported using the keys. • A key specifies in a common line of Mapper. • The command maps the key to the data which an application uses.
  • 32.
    • Reducer meansa software for reducing the mapped data by using the aggregation, query or user-specified function. The reducer provides a concise cohesive response for the application. • Aggregation function means the function that groups the values of multiple rows together to result a single value of more significant meaning or measurement. • Eg: function such as count, sum, maximum, minimum, deviation and standard deviation. • Querying function means a function that finds the desired values • Eg: function for finding a best student of a class who has shown the best performance in examination.
  • 33.
    Features of MapReduceframework : • Provides automatic parallelization and distribution of computation based on several processors. • Processes data stored on distributed clusters of DataNodes and racks. • Allows processing large amount of data in parallel. • Provides scalability for usages of large number of servers. • Provides MapReduce batch oriented programming model in Hadoop version1. • Provides additional processing modes • Eg: queries, graph database, streaming data, messages, OLAP
  • 36.
    • Shown belowis a MapReduce example to count the frequency of each word in a given input text. Our input text is, “Big data comes in various formats. This data can be stored in multiple data servers.”
  • 38.
    MapReduce programming Model •MapReduce program can be written in any language including Java, C++PIPE or Python. • MapReduce program do mapping to compute the data and convert the data into other data sets. • After the Mapper computation finish, the reducer function collects the result of Map and generates the final output result. • MapReduce program can be applied to any type of data (structured or Unstructured)
  • 39.
    Hadoop YARN • Itis a resource management platform. It manages computer resources. • The platform is responsible for providing computational resources, such as CPUs, memory, network I/O. • It separates the resource management and processing components. • It enables running of multi-threaded applications.
  • 40.
  • 41.
    • The figureshow the YARN components-client, Resource Manager(RM), Node Manager(NM), Application Master (AM) and containers. • List of actions: • A MasterNode has two components:1. Job History server 2. Resource Manager(RM) • The client node submits the request of an application to the RM. RM is the master. One RM exists per cluster. • The RM keeps information of all the slave NMs.
  • 42.
    • An NMcreates the AM(application manager) instance, AMI • AMI initializes itself and register with the RM. Multiple AMIs can be created in an AM. • All active NM send the controlling signal periodically to the RM signalling their presence. • Each NM includes several containers for uses by the subtasks of the application.
  • 43.
    Hadoop Ecosystem Tools •Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate between themselves and maintain shared data with robust synchronization techniques. • The common services provided by ZooKeeper are as follows − • Naming service − Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes. • Configuration management − Latest and up-to-date configuration information of the system for a joining node. • Cluster management − Joining / leaving of a node in a cluster and node status at real time. • Leader election − Electing a node as leader for coordination purpose. • Locking and synchronization service − Locking the data while modifying it. This mechanism helps you in automatic fail recovery while connecting other distributed applications like Apache HBase. • Highly reliable data registry − Availability of data even when one or a few nodes are down.
  • 44.
    oozie • Apache Oozieis a scheduler system to run and manage Hadoop jobs in a distributed environment. • It allows to combine multiple complex jobs to be run in a sequential order to achieve a bigger task. • Within a sequence of task, two or more jobs can also be programmed to run parallel to each other.
  • 45.
    • Following threetypes of jobs are common in Oozie − • Oozie Workflow Jobs − These are represented as Directed Acyclic Graphs (DAGs) to specify a sequence of actions to be executed. • Oozie Coordinator Jobs − These consist of workflow jobs triggered by time and data availability. • Oozie Bundle − These can be referred to as a package of multiple coordinator and workflow jobs.
  • 46.
    • Oozie provisionsfor following • Integrates multiple jobs in a sequential manner • Stores and supports Hadoop jobs for MapReduce, Hive,pig,sqoop • Runs workflow jobs based on time and data triggers • Manages batch coordinator for the applications
  • 47.
    sqoop • Sqoop isa tool used to transfer bulk data between Hadoop and external datastores, such as relational databases (MS SQL Server, MySQL).
  • 48.
    • Sqoop Import •The import tool imports individual tables from RDBMS to HDFS. • Each row in a table is treated as a record in HDFS. • All records are stored as text data in text files or as binary data in Avro and Sequence files. • Sqoop Export • The export tool exports a set of files from HDFS back to an RDBMS. • The files given as input to Sqoop contain records, which are called as rows in table • Those are read and parsed into a set of records and delimited with user- specified delimiter.
  • 49.
    Flume • Apache Flumeis a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log data, events (etc...) from various web serves to a centralized data store. • It is a highly reliable, distributed, and configurable tool that is principally designed to transfer streaming data from various sources to HDFS.
  • 50.
    •Data generators (suchas Facebook, Twitter) generate data which gets collected by individual Flume agents running on them. •Thereafter, a data collector (which is also an agent) collects the data from the agents which is aggregated and pushed into a centralized store such as HDFS or HBase.
  • 51.
    • Flume Agent •An agent is an independent process (JVM) in Flume. It receives the data (events) from clients or other agents and forwards it to its next destination (sink or agent).
  • 52.
    • As shownin the diagram a Flume Agent contains three main components namely, source, channel, and sink. • A source is the component of an Agent which receives data from the data generators and transfers it to one or more channels in the form of Flume events. • A channel is a transient store which receives the events from the source and buffers them till they are consumed by sinks. It acts as a bridge between the sources and the sinks. Example − JDBC channel, File system channel, Memory channel, etc. • A sink stores the data into centralized stores like HBase and HDFS. It consumes the data (events) from the channels and delivers it to the destination. The destination of the sink might be another agent or the central stores. Example − HDFS sink
  • 53.
    Ambari • Apache Ambariis an open-source administration tool deployed on top of Hadoop clusters, • It is responsible for keeping track of the running applications and their status • Apache Ambari can be referred to as a web-based management tool that manages, monitors, and provisions the health of Hadoop clusters.
  • 54.
    Features of Ambari •Simplification of installation, configuration and management • Enable easy, efficient, repeatable and automated creation of clusters. • Manages and Monitors scalable clustering • Enables detection of faulty node links • Provides extensibility and cutomizability
  • 55.
    Hbase • HBase isa column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS). • HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data use cases. • HBase is scalable, distributed, and NoSQL database • HBase, provide real-time access to read or write data in HDFS
  • 57.
    Components of Hbase •There are two HBase Components namely- HBase Master and RegionServer. • HBase Master:It is not part of the actual data storage but negotiates load balancing across all RegionServer. • Maintain and monitor the Hadoop cluster. • Performs administration (interface for creating, updating and deleting tables.) • Controls the failover. • HMaster handles DDL operation.
  • 58.
    • RegionServer:It isthe worker node which handles read, writes, updates and delete requests from clients. • Region server process runs on every node in Hadoop cluster. • Region server runs on HDFS DateNode.
  • 59.
    Storage Mechanism inHBase • HBase is a column-oriented database and the tables in it are sorted by row. • The table schema defines only column families, which are the key value pairs. • A table have multiple column families and each column family can have any number of columns
  • 60.
    • In short,in an HBase: • Table is a collection of rows. • Row is a collection of column families. • Column family is a collection of columns. • Column is a collection of key value pairs.
  • 61.
    Features of HBase •HBase is linearly scalable. • It has automatic failure support. • It provides consistent read and writes. • It integrates with Hadoop, both as a source and a destination. • It has easy java API for client. • It provides data replication across clusters.
  • 62.
    Hive • Hive isan ETL and Data warehousing tool used to query or analyze large datasets stored within the Hadoop ecosystem • Hive has three main functions: data summarization, query, and analysis of unstructured and semi-structured data in Hadoop.
  • 63.
    Features of Hive •It stores schema in a database and processed data into HDFS. • It is designed for OLAP. • It provides SQL type language for querying called HiveQL or HQL. • It is familiar, fast, scalable, and extensible.
  • 65.
    • User Interface •Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. • The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server). • Meta StoreHive • chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping.
  • 66.
    • HiveQL ProcessEngine • HiveQL is similar to SQL for querying on schema info on the Metastore. • It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it. • Execution Engine • The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. • Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce. • HDFS or HBASE • Hadoop distributed file system or HBASE are the data storage techniques to store data into file system.
  • 67.
    pig • Apache Pigis an abstraction over MapReduce. • It is a tool/platform which is used to analyze larger sets of data representing them as data flows. • Pig is generally used with Hadoop; • we can perform all the data manipulation operations in Hadoop using Pig.
  • 68.
    Features of Pig •Rich set of operators − It provides many operators to perform operations like join, sort, filer, etc. • Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL. • Optimization opportunities − The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus only on semantics of the language. • Extensibility − Using the existing operators, users can develop their own functions to read, process, and write data. • UDF’s − Pig provides the facility to create User-defined Functions in other programming languages such as Java and invoke or embed them in Pig Scripts. • Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS.
  • 70.
    • To performa particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded). • After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output.
  • 71.
    • Parser • Initiallythe Pig Scripts are handled by the Parser. • It checks the syntax of the script, does type checking, and other miscellaneous checks. • The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. • In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges.
  • 72.
    • Optimizer • Thelogical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown. • Compiler • The compiler compiles the optimized logical plan into a series of MapReduce jobs. • Execution engine • Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.
  • 73.
    Mahout • Apache Mahoutis an open source project that is primarily used for creating scalable machine learning algorithms. It implements popular machine learning techniques such as: • Recommendation • Classification • Clustering
  • 74.
    Features of Mahout •The algorithms of Mahout are written on top of Hadoop, so it works well in distributed environment. • Mahout uses the Apache Hadoop library to scale effectively in the cloud. • Mahout offers the coder a ready-to-use framework for doing data mining tasks on large volumes of data. • Mahout lets applications to analyze large sets of data effectively and in quick time. • Includes several MapReduce enabled clustering implementations such as k-means, fuzzy k- means, Canopy, Dirichlet, and Mean-Shift. • Supports Distributed Naive Bayes and Complementary Naive Bayes classification implementations. • Comes with distributed fitness function capabilities for evolutionary programming. • Includes matrix and vector libraries.

Editor's Notes

  • #41 * Information about the location (rack awareness), number of resources (data blocks and server) they have.
  • #42 AMI estimates the resources requirement for running an application program.