Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
The document outlines a training session on Apache Pig, an open-source high-level dataflow system that simplifies data processing tasks in Hadoop without requiring extensive programming knowledge. It compares Apache Pig and MapReduce, highlighting Pig's ease of use, reduced coding effort, and built-in support for common data operations. Additionally, it includes a case study on Twitter's use of Hadoop for data analysis and discusses Pig's architecture, data models, operators, and practical applications in log analysis.
Overview of Edureka's Hadoop Certification Training and the agenda including topics like Apache Pig and its architecture.
Discusses the need for programming skills in MapReduce, addressing concerns for non-programmers with alternative tools in Hadoop Ecosystem.
Introduction to Apache Pig, its components like Pig Latin and its advantages over MapReduce, including easy syntax and fewer lines of code.Explains Twitter's data growth, the decision to use Hadoop for data analysis, and the process of archiving tweets in HDFS.
Details about the architecture of Apache Pig and its components including Pig Latin and Pig Execution.
Explains the two modes to run Apache Pig: MapReduce Mode and Local Mode, detailing their operational differences.
Introduces Pig's data models including Tuples, Bags, and Maps, with examples to illustrate the concepts.
Describes various Pig operators and their functionalities such as LOAD, FILTER, JOIN, and ORDER BY.
Demonstrates executing Pig commands, analyzing log files using Pig, and creating a script to process logs.Provides links to additional learning resources on Hadoop and Pig, followed by a thank you note for queries.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Agenda for today’s Session Entry of Apache Pig Pig vs MapReduce Twitter Case Study on Apache Pig Apache Pig Architecture Pig Components Pig Data Model & Operators Running Pig Commands and Pig Scripts (Log Analysis)
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Apache PIG An open-source high-level dataflow system Introduced by Yahoo Provides abstraction over MapReduce Two main components – the Pig Latin language and the Pig Execution Fun Fact: 10 lines of pig latin= approx. 200 lines of Map-Reduce Java Program
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Apache Pig vs MapReduce High-level data flow tool No need to write complex programs Built-in support for data operations like joins, filters, ordering, sorting etc. Provides nested data types like tuples, bags, and maps Low-level data processing paradigm You need write programs in Java/Python etc. Performing data operations in MapReduce is a humongous task Nested data types are not there in MapReduce
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Why Apache Pig? Provides common data operations filters, joins, ordering, etc. and nested data types tuples, bags, and maps missing from MapReduce. Open source and actively supported by a community of developers. Structured data Semi-Structured data Unstructured data Data Flow Language Reads like a series of steps Java Python JavaScript Ruby An ad-hoc way of creating and executing map-reduce jobs on very large data sets Can take any data Easy to learn, Easy to write and Easy to read Extensible by UDF (User Defined Functions)
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Twitter Case Study Twitter’s data was growing at an accelerating rate (i.e. 10 TB/day). Thus, Twitter decided to move the archived data to HDFS and adopt Hadoop for extracting the business values out of it. Their major aim was to analyse data stored in Hadoop to come up with the multiple insights on a daily, weekly or monthly basis. Let me talk about one of the insight they wanted to know. Analyzing how many tweets are stored per user, in the given tweet tables?
14.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING High Level Implementation Twitter Database HDFS Tweet Table User Table Put the tables on HDFS Load the data in Pig Process the data in Pig and store the result back on HDFS 1 2 3
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Apache Pig Components Pig Components Pig Latin Pig Execution Script Grunt Embedded It is made up of a series of operations or transformations that are applied to the input data to produce output. Contains Pig commands in a file (.pig) Interactive shell for running Pig commands Provisioning pig script in Java
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Pig Running Modes You can run Apache Pig in 2 modes: MapReduce Mode – This is the default mode, which requires access to a Hadoop cluster and HDFS installation. The input and output in this mode are present on HDFS. Command: pig Local Mode – With access to a single machine, all files are installed and run using a local host and file system. Here the local mode is specified using ‘-x flag’ (pig -x local). The input and output in this mode are present on local file system. Command: pig –x local
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Pig Data Model – Tuple and Bag Tuple is an ordered set of fields which may contain different data types for each field. Example of tuple − (1, Linkin Park, 7, California) A Bag is a collection of a set of tuples and these tuples are subset of rows or entire rows of a table. Example of a bag − {(Linkin Park, 7, California), (Metallica, 8), (Mega Death, Los Angeles)}
26.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Pig Data Model – Map and Atom A Map is key-value pairs used to represent data elements. Example of maps− [band#Linkin Park, members#7 ], [band#Metallica, members#8 ] Atoms are basic data types which are used in all the languages like string, int, float, long, double, char[], byte[]
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Pig Operators Operator Description LOAD Load data from the local file system or HDFS storage into Pig FOREACH Generates data transformations based on columns of data FILTER Selects tuples from a relation based on a condition JOIN Join the relations based on the column ORDER BY Sort a relation based on one or more fields STORE Save results to the local file system or HDFS DISTINCT Removes duplicate tuples in a relation GROUP Groups together the tuples with the same group key (key field) COGROUP It is same as GROUP. But COGROUP is used when multiple relations re involved
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATIONTRAINING Analysing Logs Using Apache Pig There is an application which processes sampleclass recordings. Here is a log file which is recording all the events happening when the application is running. We will analyse this log file to understand what are the types of event happening in this log file and the count of each event.