HADOOP ECOSYSTEM In the previous blog on Hadoop Tutorial, we discussed about Hadoop, its features and core components. Now, the next step forward is to understand Hadoop Ecosystem. It is an essential topic to understand before you start working with Hadoop. This Hadoop ecosystem blog will familiarize you with industry-wide used Big Data frameworks, required for Hadoop Certification.
 HDFS ->Hadoop Distributed File System  YARN -> Yet Another Resource Negotiator  MapReduce -> Data processing using programming  Spark -> In-memory Data Processing  PIG, HIVE-> Data Processing Services using Query (SQL-like)  HBase -> NoSQL Database  Mahout, Spark MLlib -> Machine Learning  Apache Drill -> SQL on Hadoop  Zookeeper -> Managing Cluster
HDFS  Hadoop Distributed File System is the core component or you can say, the backbone of Hadoop Ecosystem.  HDFS is the one, which makes it possible to store different types of large data sets (i.e. structured, unstructured and semi structured data).  HDFS creates a level of abstraction over the resources, from where we can see the whole HDFS as a single unit.  It helps us in storing our data across various nodes and maintaining the log file about the stored data (metadata).
 HDFS has two core components, i.e. NameNode and DataNode. ◦ The NameNode is the main node and it doesn’t store the actual data. It contains metadata, just like a log file or you can say as a table of content. Therefore, it requires less storage and high computational resources. ◦ On the other hand, all your data is stored on the DataNodes and hence it requires more storage resources. These DataNodes are commodity hardware (like your laptops and desktops) in the distributed environment. That’s the reason, why Hadoop solutions are very cost effective. ◦ You always communicate to the NameNode while writing the data. Then, it internally sends a request to the client to store and replicate data on various DataNodes.
YARN  Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing activities by allocating resources and scheduling tasks.  It has two major components, i.e. ResourceManager and NodeManager. ◦ ResourceManager is again a main node in the processing department. ◦ It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place.
◦ NodeManagers are installed on every DataNode. It is responsible for execution of task on every single DataNode. ◦ Schedulers: Based on your application resource requirements, Schedulers perform scheduling algorithms and allocates the resources. ◦ ApplicationsManager: While ApplicationsManager accepts the job submission, negotiates to containers (i.e. the Data node environment where process executes) for executing the application specific ApplicationMaster and monitoring the progress. ApplicationMasters are the deamons which reside on DataNode and communicates to containers for execution of tasks on each DataNode.ResourceManager has two components,
MAPREDUCE  It is the core component of processing in a Hadoop Ecosystem as it provides the logic of processing. In other words, MapReduce is a software framework which helps in writing applications that processes large data sets using distributed and parallel algorithms inside Hadoop environment.  In a MapReduce program, Map() and Reduce() are two functions. ◦ The Map function performs actions like filtering, grouping and sorting. ◦ While Reduce function aggregates and summarizes the result produced by map function. ◦ The result generated by the Map function is a key value pair (K, V) which acts as the input for Reduce function.
APACHE PIG  But don’t be shocked when I say that at the PIG has two parts: Pig Latin, the language and the pig runtime, for the execution environment. You can better understand it as Java and JVM.  It supports pig latin language, which has SQL like command structure.  As everyone does not belong from a programming background. So, Apache PIG relieves them. You might be curious to know how?  Well, I will tell you an interesting fact:  10 line of pig latin = approx. 200 lines of Map- Reduce Java code  back end of Pig job, a map-reduce job executes.
HIVE:  Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them feel at home while working in a Hadoop Ecosystem.  Basically, HIVE is a data warehousing component which performs reading, writing and managing large data sets in a distributed environment using SQL-like interface.  HIVE + SQL = HQL  The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.  It has 2 basic components: Hive Command Line and JDBC/ODBC driver.  The Hive Command line interface is used to execute HQL commands.  While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC) is used to establish connection from data storage.  Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set processing (i.e. Batch query processing) and uery processing).  It supports all primitive data types of SQL.  You can use predefined functions, or write tailored user defined functions (UDF) also to accomplish your specific needs.  As an alternative, you may go to this comprehensive video tutorial where each tool present in Hadoop Ecosystem has been discussed:  Hadoop Ecosystem | Edureka
 APACHE MAHOUT Now, let us talk about Mahout which is renowned for machine learning. Mahout provides an environment for creating machine learning applications which are scalable.
 Collaborative filtering: Mahout mines user behaviors, their patterns and their characteristics and based on that it predicts and make recommendations to the users. The typical use case is E-commerce website.  Clustering: It organizes a similar group of data together like articles can contain blogs, news, research papers etc.  Classification: It means classifying and categorizing data into various sub-departments like articles can be categorized into blogs, news, essay, research papers and other categories.  Frequent item set missing: Here Mahout checks, which objects are likely to be appearing together and make suggestions, if they are missing. For example, cell phone and cover are brought together in general. So, if you search for a cell phone, it will also recommend you the cover and cases.  Mahout provides a command line to invoke various algorithms. It has a predefined set of library which
APACHE SPARK  Apache Spark is a framework for real time data analytics in a distributed computing environment.  The Spark is written in Scala and was originally developed at the University of California, Berkeley.  It executes in-memory computations to increase speed of data processing over Map- Reduce.  It is 100x faster than Hadoop for large scale data processing by exploiting in-memory computations and other
APACHE HBASE  HBase is an open source, non-relational distributed database. In other words, it is a NoSQL database.  It supports all types of data and that is why, it’s capable of handling anything and everything inside a Hadoop ecosystem.  It is modelled after Google’s BigTable, which is a distributed storage system designed to cope up with large data sets.  The HBase was designed to run on top of HDFS and provides BigTable like capabilities.  It gives us a fault tolerant way of storing sparse data, which is common in most Big Data use cases.  The HBase is written in Java, whereas HBase applications can be written in REST, Avro and Thrift APIs.
APACHE DRILL  As the name suggests, Apache Drill is used to drill into any kind of data. It’s an open source application which works with distributed environment to analyze large data sets.  It is a replica of Google Dremel.  It supports different kinds NoSQL databases and file systems, which is a powerful feature of Drill. For example: Azure Blob Storage, Google Cloud Storage, HBase, MongoDB, MapR-DB HDFS, MapR-FS, Amazon S3, Swift, NAS and local files.  So, basically the main aim behind Apache Drill is to provide scalability so that we can process petabytes and exabytes of data efficiently (or you can say in minutes).  The main power of Apache Drill lies in combining a variety of data stores just by using a single query.  Apache Drill basically follows the ANSI SQL.  It has a powerful scalability factor in supporting millions of users and serve their query requests over large scale data.

Bigdata ppt

  • 2.
    HADOOP ECOSYSTEM In theprevious blog on Hadoop Tutorial, we discussed about Hadoop, its features and core components. Now, the next step forward is to understand Hadoop Ecosystem. It is an essential topic to understand before you start working with Hadoop. This Hadoop ecosystem blog will familiarize you with industry-wide used Big Data frameworks, required for Hadoop Certification.
  • 3.
     HDFS ->HadoopDistributed File System  YARN -> Yet Another Resource Negotiator  MapReduce -> Data processing using programming  Spark -> In-memory Data Processing  PIG, HIVE-> Data Processing Services using Query (SQL-like)  HBase -> NoSQL Database  Mahout, Spark MLlib -> Machine Learning  Apache Drill -> SQL on Hadoop  Zookeeper -> Managing Cluster
  • 4.
    HDFS  Hadoop DistributedFile System is the core component or you can say, the backbone of Hadoop Ecosystem.  HDFS is the one, which makes it possible to store different types of large data sets (i.e. structured, unstructured and semi structured data).  HDFS creates a level of abstraction over the resources, from where we can see the whole HDFS as a single unit.  It helps us in storing our data across various nodes and maintaining the log file about the stored data (metadata).
  • 5.
     HDFS hastwo core components, i.e. NameNode and DataNode. ◦ The NameNode is the main node and it doesn’t store the actual data. It contains metadata, just like a log file or you can say as a table of content. Therefore, it requires less storage and high computational resources. ◦ On the other hand, all your data is stored on the DataNodes and hence it requires more storage resources. These DataNodes are commodity hardware (like your laptops and desktops) in the distributed environment. That’s the reason, why Hadoop solutions are very cost effective. ◦ You always communicate to the NameNode while writing the data. Then, it internally sends a request to the client to store and replicate data on various DataNodes.
  • 6.
    YARN  Consider YARNas the brain of your Hadoop Ecosystem. It performs all your processing activities by allocating resources and scheduling tasks.  It has two major components, i.e. ResourceManager and NodeManager. ◦ ResourceManager is again a main node in the processing department. ◦ It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place.
  • 7.
    ◦ NodeManagers areinstalled on every DataNode. It is responsible for execution of task on every single DataNode. ◦ Schedulers: Based on your application resource requirements, Schedulers perform scheduling algorithms and allocates the resources. ◦ ApplicationsManager: While ApplicationsManager accepts the job submission, negotiates to containers (i.e. the Data node environment where process executes) for executing the application specific ApplicationMaster and monitoring the progress. ApplicationMasters are the deamons which reside on DataNode and communicates to containers for execution of tasks on each DataNode.ResourceManager has two components,
  • 8.
    MAPREDUCE  It isthe core component of processing in a Hadoop Ecosystem as it provides the logic of processing. In other words, MapReduce is a software framework which helps in writing applications that processes large data sets using distributed and parallel algorithms inside Hadoop environment.  In a MapReduce program, Map() and Reduce() are two functions. ◦ The Map function performs actions like filtering, grouping and sorting. ◦ While Reduce function aggregates and summarizes the result produced by map function. ◦ The result generated by the Map function is a key value pair (K, V) which acts as the input for Reduce function.
  • 9.
    APACHE PIG  Butdon’t be shocked when I say that at the PIG has two parts: Pig Latin, the language and the pig runtime, for the execution environment. You can better understand it as Java and JVM.  It supports pig latin language, which has SQL like command structure.  As everyone does not belong from a programming background. So, Apache PIG relieves them. You might be curious to know how?  Well, I will tell you an interesting fact:  10 line of pig latin = approx. 200 lines of Map- Reduce Java code  back end of Pig job, a map-reduce job executes.
  • 10.
    HIVE:  Facebook createdHIVE for people who are fluent with SQL. Thus, HIVE makes them feel at home while working in a Hadoop Ecosystem.  Basically, HIVE is a data warehousing component which performs reading, writing and managing large data sets in a distributed environment using SQL-like interface.  HIVE + SQL = HQL  The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.  It has 2 basic components: Hive Command Line and JDBC/ODBC driver.  The Hive Command line interface is used to execute HQL commands.  While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC) is used to establish connection from data storage.  Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set processing (i.e. Batch query processing) and uery processing).  It supports all primitive data types of SQL.  You can use predefined functions, or write tailored user defined functions (UDF) also to accomplish your specific needs.  As an alternative, you may go to this comprehensive video tutorial where each tool present in Hadoop Ecosystem has been discussed:  Hadoop Ecosystem | Edureka
  • 11.
     APACHE MAHOUT Now,let us talk about Mahout which is renowned for machine learning. Mahout provides an environment for creating machine learning applications which are scalable.
  • 12.
     Collaborative filtering:Mahout mines user behaviors, their patterns and their characteristics and based on that it predicts and make recommendations to the users. The typical use case is E-commerce website.  Clustering: It organizes a similar group of data together like articles can contain blogs, news, research papers etc.  Classification: It means classifying and categorizing data into various sub-departments like articles can be categorized into blogs, news, essay, research papers and other categories.  Frequent item set missing: Here Mahout checks, which objects are likely to be appearing together and make suggestions, if they are missing. For example, cell phone and cover are brought together in general. So, if you search for a cell phone, it will also recommend you the cover and cases.  Mahout provides a command line to invoke various algorithms. It has a predefined set of library which
  • 13.
    APACHE SPARK  ApacheSpark is a framework for real time data analytics in a distributed computing environment.  The Spark is written in Scala and was originally developed at the University of California, Berkeley.  It executes in-memory computations to increase speed of data processing over Map- Reduce.  It is 100x faster than Hadoop for large scale data processing by exploiting in-memory computations and other
  • 14.
    APACHE HBASE  HBaseis an open source, non-relational distributed database. In other words, it is a NoSQL database.  It supports all types of data and that is why, it’s capable of handling anything and everything inside a Hadoop ecosystem.  It is modelled after Google’s BigTable, which is a distributed storage system designed to cope up with large data sets.  The HBase was designed to run on top of HDFS and provides BigTable like capabilities.  It gives us a fault tolerant way of storing sparse data, which is common in most Big Data use cases.  The HBase is written in Java, whereas HBase applications can be written in REST, Avro and Thrift APIs.
  • 15.
    APACHE DRILL  Asthe name suggests, Apache Drill is used to drill into any kind of data. It’s an open source application which works with distributed environment to analyze large data sets.  It is a replica of Google Dremel.  It supports different kinds NoSQL databases and file systems, which is a powerful feature of Drill. For example: Azure Blob Storage, Google Cloud Storage, HBase, MongoDB, MapR-DB HDFS, MapR-FS, Amazon S3, Swift, NAS and local files.  So, basically the main aim behind Apache Drill is to provide scalability so that we can process petabytes and exabytes of data efficiently (or you can say in minutes).  The main power of Apache Drill lies in combining a variety of data stores just by using a single query.  Apache Drill basically follows the ANSI SQL.  It has a powerful scalability factor in supporting millions of users and serve their query requests over large scale data.