Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
The document provides an overview of the Hadoop ecosystem and its various components such as HDFS, YARN, MapReduce, Spark, Pig, Hive, and more, which are essential for storing, processing, and analyzing big data. Each component is described with its functionality, including data ingestion by Flume and Sqoop, machine learning capabilities through Mahout, and job scheduling with Oozie. Additional details on related tools like Zookeeper and Ambari for cluster management are also included.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING HDFS (Hadoop Distributed File System) - Storage Stores different types of large data sets (i.e. structured, unstructured and semi structured data) HDFS creates a level of abstraction over the resources, from where we can see the whole HDFS as a single unit Stores data across various nodes and maintains the log file about the stored data (metadata) HDFS has two core components, i.e. NameNode and DataNode
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING YARN (Yet Another Resource Negotiator) Performs all your processing activities by allocating resources and scheduling tasks Two services: ResourceManager and NodeManager ResourceManager: Manages resources and schedule applications running on top of YARN NodeManager: Manages containers and monitors resource utilization in each container
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING MapReduce: Data Processing Using Programming Core component in a Hadoop Ecosystem for processing Helps in writing applications that processes large data sets using distributed and parallel algorithms In a MapReduce program, Map() and Reduce() are two functions Map function performs actions like filtering, grouping and sorting Reduce function aggregates and summarizes the result produced by map function
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING PIG: Data Processing Service using Query PIG has two parts: Pig Latin, the language and the pig runtime, for the execution environment 1 line of pig latin = approx. 100 lines of Map-Reduce job The compiler internally converts pig latin to MapReduce It gives you a platform for building data flow for ETL (Extract, Transform and Load) PIG first loads the data, then performs various functions like grouping, filtering, joining, sorting, etc. and finally dumps the data on the screen or stores in HDFS.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING HIVE: Data Processing Service using Query A data warehousing component which analyses data sets in a distributed environment using SQL-like interface The query language of Hive is called Hive Query Language(HQL) 2 basic components: Hive Command Line and JDBC/ODBC driver Supports user defined functions (UDF) to accomplish specific needs
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Mahout: Machine Learning Provides an environment for creating machine learning applications It performs collaborative filtering, clustering and classification Provides a command line to invoke various algorithms. It has a predefined set of library which already contains different inbuilt algorithms for different use cases.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Spark: In-memory Data Processing A framework for real time data analytics in a distributed computing environment. Written in Scala and was originally developed at the University of California, Berkeley. It executes in-memory computations to increase speed of data processing over Map-Reduce. 100x faster than Hadoop for large scale data processing by exploiting in-memory computations
18.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Spark: In-memory Data Processing Spark comes packed with high-level libraries Provides various services like MLlib, GraphX, SQL + Data Frames, Streaming services Supports various languages like R, SQL, Python, Scala, Java Seamlessly integrates in complex workflow
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Hbase: NoSQL Database An open source, non-relational distributed database - a NoSQL database Supports all types of data and that is why, it’s capable of handling anything and everything It is modelled after Google’s BigTable It gives us a fault tolerant way of storing sparse data It is written in Java, and HBase applications can be written in REST, Avro and Thrift APIs
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Apache Drill: SQL on Hadoop An open source application which works with distributed environment to analyze large data sets Follows the ANSI SQL Supports different kinds NoSQL databases and file systems For example: Azure Blob Storage, Google Cloud Storage, HBase, MongoDB, MapR-DB HDFS, MapR-FS, Amazon S3, Swift, NAS and local files Combines a variety of data stores just by using a single query
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Oozie: Job Scheduler Oozie is a job scheduler in Hadoop ecosystem Two kinds of Oozie jobs: Oozie workflow and Oozie Coordinator Oozie workflow: Sequential set of actions to be executed Oozie Coordinator: Oozie jobs which are triggered when the data is made available to it or even triggered based on time
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Apache Flume: Data Ingesting Service Ingests unstructured and semi-structured data into HDFS. It helps us in collecting, aggregating and moving large amount of data sets. It helps us to ingest online streaming data from various sources like network traffic, social media, email messages, log files etc. in HDFS.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Apache Sqoop: Data Ingesting Service Another data ingesting service Sqoop can import as well as export structured data from RDBMS Flume only ingests unstructured data or semi- structured data into HDFS
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Apache Solr and Lucene Two services which are used for searching and indexing in Hadoop Ecosystem Apache Lucene is based on Java, which also helps in spell checking Apache Lucene is the engine, Apache Solr is a complete application built around Lucene Solr uses Apcahe Lucene Java search library for searching and indexing
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING ZooKeeper: Coordinator An open-source server which enables highly reliable distributed coordination Apache Zookeeper coordinates with various Hadoop services in a distributed environment Performs synchronization, configuration maintenance, grouping and naming
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Apache Ambari: Cluster Manager Software for provisioning, managing and monitoring Apache Hadoop clusters Gives us step by step process for installing Hadoop services Handles configuration of Hadoop services Provides a central management service for starting, stopping and re-configuring Hadoop services Monitors health and status of the Hadoop cluster