www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Agenda for today’s Session  Entry of Apache Pig  Pig vs MapReduce  Twitter Case Study on Apache Pig  Apache Pig Architecture  Pig Components  Pig Data Model & Operators  Running Pig Commands and Pig Scripts (Log Analysis)
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING MapReduce Way In MapReduce, you need to write a program in Java/Python to process the data.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING What if you are from Non-programming background!! Are your Hadoop days over before they even started? 
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING No need to worry at all! There are multiple tools in Hadoop Ecosystem where you do not need programming background. And in today’s session, I will tell you about one such tool!
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache PIG  An open-source high-level dataflow system  Introduced by Yahoo  Provides abstraction over MapReduce  Two main components – the Pig Latin language and the Pig Execution Fun Fact:  10 lines of pig latin= approx. 200 lines of Map-Reduce Java Program
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Why go for PIG when MR is there? 1/20 the lines of Code 1/16 the development Time 180 160 140 120 100 80 60 40 20 0 MapReduce Pig 0 MapReduce Pig 50 100 150 200 250 300 Minutes
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig vs MapReduce
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig vs MapReduce  High-level data flow tool  No need to write complex programs  Built-in support for data operations like joins, filters, ordering, sorting etc.  Provides nested data types like tuples, bags, and maps  Low-level data processing paradigm  You need write programs in Java/Python etc.  Performing data operations in MapReduce is a humongous task  Nested data types are not there in MapReduce
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Some more reasons to choose Apache Pig
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Why Apache Pig? Provides common data operations filters, joins, ordering, etc. and nested data types tuples, bags, and maps missing from MapReduce. Open source and actively supported by a community of developers. Structured data Semi-Structured data Unstructured data Data Flow Language Reads like a series of steps Java Python JavaScript Ruby An ad-hoc way of creating and executing map-reduce jobs on very large data sets Can take any data Easy to learn, Easy to write and Easy to read Extensible by UDF (User Defined Functions)
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Twitter Case Study
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Twitter Case Study  Twitter’s data was growing at an accelerating rate (i.e. 10 TB/day).  Thus, Twitter decided to move the archived data to HDFS and adopt Hadoop for extracting the business values out of it.  Their major aim was to analyse data stored in Hadoop to come up with the multiple insights on a daily, weekly or monthly basis. Let me talk about one of the insight they wanted to know. Analyzing how many tweets are stored per user, in the given tweet tables?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING High Level Implementation Twitter Database HDFS Tweet Table User Table Put the tables on HDFS Load the data in Pig Process the data in Pig and store the result back on HDFS 1 2 3
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Detailed Implementation Flow
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING 1 2 3 4 5 6 7
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Architecture
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Architecture
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Components
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Components Pig Components Pig Latin Pig Execution Script Grunt Embedded It is made up of a series of operations or transformations that are applied to the input data to produce output. Contains Pig commands in a file (.pig) Interactive shell for running Pig commands Provisioning pig script in Java
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Apache Pig Running Modes
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Pig Running Modes You can run Apache Pig in 2 modes: MapReduce Mode – This is the default mode, which requires access to a Hadoop cluster and HDFS installation. The input and output in this mode are present on HDFS. Command: pig Local Mode – With access to a single machine, all files are installed and run using a local host and file system. Here the local mode is specified using ‘-x flag’ (pig -x local). The input and output in this mode are present on local file system. Command: pig –x local
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Before going to practical, let us understand Data Models in Pig
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Pig Data Model Atom Tuple MapBag Data Model Types
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Pig Data Model – Tuple and Bag  Tuple is an ordered set of fields which may contain different data types for each field. Example of tuple − (1, Linkin Park, 7, California)  A Bag is a collection of a set of tuples and these tuples are subset of rows or entire rows of a table. Example of a bag − {(Linkin Park, 7, California), (Metallica, 8), (Mega Death, Los Angeles)}
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Pig Data Model – Map and Atom  A Map is key-value pairs used to represent data elements. Example of maps− [band#Linkin Park, members#7 ], [band#Metallica, members#8 ]  Atoms are basic data types which are used in all the languages like string, int, float, long, double, char[], byte[]
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Pig Operators
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Pig Operators Operator Description LOAD Load data from the local file system or HDFS storage into Pig FOREACH Generates data transformations based on columns of data FILTER Selects tuples from a relation based on a condition JOIN Join the relations based on the column ORDER BY Sort a relation based on one or more fields STORE Save results to the local file system or HDFS DISTINCT Removes duplicate tuples in a relation GROUP Groups together the tuples with the same group key (key field) COGROUP It is same as GROUP. But COGROUP is used when multiple relations re involved
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Let us execute few Pig commands on grunt shell
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Analysing Logs Using Apache Pig
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Analysing Logs Using Apache Pig  There is an application which processes sampleclass recordings.  Here is a log file which is recording all the events happening when the application is running.  We will analyse this log file to understand what are the types of event happening in this log file and the count of each event.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Create and Run a Pig Script to Analyze the logs
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Learning Resources  Hadoop Tutorial: www.edureka.co/blog/hadoop-tutorial  Pig Tutorial: https://www.edureka.co/blog/pig-tutorial  Operators in Pig: https://www.edureka.co/blog/operators-in-apache-pig/
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Thank You … Questions/Queries/Feedback

Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka