www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Agenda for today’s Session Hadoop Ecosystem  HDFS -> Hadoop Distributed File System  YARN -> Yet Another Resource Negotiator  MapReduce -> Data processing using programming  Spark -> In-memory Data Processing  PIG, HIVE-> Data Processing Services using Query (SQL-like)  HBase -> NoSQL Database  Mahout, Spark MLlib -> Machine Learning  Apache Drill -> SQL on Hadoop  Zookeeper -> Managing Cluster  Oozie -> Job Scheduling  Flume, Sqoop -> Data Ingesting Services  Solr & Lucene -> Searching & Indexing  Ambari -> Provision, Monitor and Maintain cluster
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hadoop Ecosystem
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HDFS (Hadoop Distributed File System) - Storage  Stores different types of large data sets (i.e. structured, unstructured and semi structured data)  HDFS creates a level of abstraction over the resources, from where we can see the whole HDFS as a single unit  Stores data across various nodes and maintains the log file about the stored data (metadata)  HDFS has two core components, i.e. NameNode and DataNode
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING YARN (Yet Another Resource Negotiator)  Performs all your processing activities by allocating resources and scheduling tasks  Two services: ResourceManager and NodeManager  ResourceManager: Manages resources and schedule applications running on top of YARN  NodeManager: Manages containers and monitors resource utilization in each container
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce: Data Processing Using Programming  Core component in a Hadoop Ecosystem for processing  Helps in writing applications that processes large data sets using distributed and parallel algorithms  In a MapReduce program, Map() and Reduce() are two functions  Map function performs actions like filtering, grouping and sorting  Reduce function aggregates and summarizes the result produced by map function
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING PIG: Data Processing Service using Query  PIG has two parts: Pig Latin, the language and the pig runtime, for the execution environment  1 line of pig latin = approx. 100 lines of Map-Reduce job  The compiler internally converts pig latin to MapReduce  It gives you a platform for building data flow for ETL (Extract, Transform and Load)  PIG first loads the data, then performs various functions like grouping, filtering, joining, sorting, etc. and finally dumps the data on the screen or stores in HDFS.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING HIVE: Data Processing Service using Query  A data warehousing component which analyses data sets in a distributed environment using SQL-like interface  The query language of Hive is called Hive Query Language(HQL)  2 basic components: Hive Command Line and JDBC/ODBC driver  Supports user defined functions (UDF) to accomplish specific needs
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Mahout: Machine Learning  Provides an environment for creating machine learning applications  It performs collaborative filtering, clustering and classification  Provides a command line to invoke various algorithms.  It has a predefined set of library which already contains different inbuilt algorithms for different use cases.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Spark: In-memory Data Processing  A framework for real time data analytics in a distributed computing environment.  Written in Scala and was originally developed at the University of California, Berkeley.  It executes in-memory computations to increase speed of data processing over Map-Reduce.  100x faster than Hadoop for large scale data processing by exploiting in-memory computations
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Spark: In-memory Data Processing  Spark comes packed with high-level libraries  Provides various services like MLlib, GraphX, SQL + Data Frames, Streaming services  Supports various languages like R, SQL, Python, Scala, Java  Seamlessly integrates in complex workflow
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Hbase: NoSQL Database  An open source, non-relational distributed database - a NoSQL database  Supports all types of data and that is why, it’s capable of handling anything and everything  It is modelled after Google’s BigTable  It gives us a fault tolerant way of storing sparse data  It is written in Java, and HBase applications can be written in REST, Avro and Thrift APIs
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Drill: SQL on Hadoop  An open source application which works with distributed environment to analyze large data sets  Follows the ANSI SQL  Supports different kinds NoSQL databases and file systems  For example: Azure Blob Storage, Google Cloud Storage, HBase, MongoDB, MapR-DB HDFS, MapR-FS, Amazon S3, Swift, NAS and local files  Combines a variety of data stores just by using a single query
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Oozie: Job Scheduler  Oozie is a job scheduler in Hadoop ecosystem  Two kinds of Oozie jobs: Oozie workflow and Oozie Coordinator  Oozie workflow: Sequential set of actions to be executed  Oozie Coordinator: Oozie jobs which are triggered when the data is made available to it or even triggered based on time
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Flume: Data Ingesting Service  Ingests unstructured and semi-structured data into HDFS.  It helps us in collecting, aggregating and moving large amount of data sets.  It helps us to ingest online streaming data from various sources like network traffic, social media, email messages, log files etc. in HDFS.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Sqoop: Data Ingesting Service  Another data ingesting service  Sqoop can import as well as export structured data from RDBMS  Flume only ingests unstructured data or semi- structured data into HDFS
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Solr and Lucene  Two services which are used for searching and indexing in Hadoop Ecosystem  Apache Lucene is based on Java, which also helps in spell checking  Apache Lucene is the engine, Apache Solr is a complete application built around Lucene  Solr uses Apcahe Lucene Java search library for searching and indexing
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING ZooKeeper: Coordinator  An open-source server which enables highly reliable distributed coordination  Apache Zookeeper coordinates with various Hadoop services in a distributed environment  Performs synchronization, configuration maintenance, grouping and naming
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Ambari: Cluster Manager  Software for provisioning, managing and monitoring Apache Hadoop clusters  Gives us step by step process for installing Hadoop services  Handles configuration of Hadoop services  Provides a central management service for starting, stopping and re-configuring Hadoop services  Monitors health and status of the Hadoop cluster
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Learning Resources  Hadoop Tutorial: www.edureka.co/blog/hadoop-tutorial  HDFS Ecosystem: www.edureka.co/blog/hadoop-ecosystem
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Thank You … Questions/Queries/Feedback

Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka