Big data analytics with Apache Hadoop

BIG DATA ANALYTICS WITH APACHE- HADOOP “Big Data: A Revolution that Will Transform How We Live, Work, and Think” -Viktor Mayer-Schönberger and Kenneth Cukier

Team Members Abhishek Kumar : Y11UC010 Sachin Mittal : Y11UC189 Subodh Rawani : Y11UC230 Suman Saurabh : Y11UC231

Contents 1. What is Big Data ?  Definition  Turning Data to Value: 5v’s 2. Big Data Analytics 3. Big Data and Hadoop  History of Hadoop  About Apache Hadoop  Key Features of Hadoop 4. Hadoop and MapReduce  About MapReduce  MapReduce Architecture  MapReduce Functionality  MapReduce Examples

Definition “Data is the oil of the 21st century, and analytics is the combustion engine” -Peter Sondergaard, Senior Vice President, Gartner Research “Big- Data are high volume, high velocity and high variety of information assets that require new form of processing to enable enhanced decision making insight discovery & process optimisation.” “It is a subjective term, what involves is analysis of data from multiple sources and is joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide”. -Tom White in Hadoop the Definitive Guide Big Data is fuelled by two things: • The increasing ‘datafication’ of the world, allows to generate new data at frightening rates. • Technological advancement to harness those large and complex data and perform analysis using improved techniques.

Big data describes the exponential growth and availability of data, both structured and unstructured. This data are from everywhere: Climate Sensors, Social Media post, Digital files, Buy/Sell transaction records, Cell phone GPS signal and others.

Statistics of Data Generated Big Data in Today’s Business and Technology Environment  235 Terabytes of data has been collected by the U.S. Library of Congress in April 2011. (Source)  Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data. (Source)  Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data. (Source)  More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide. (Source)  In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a day. (Source) The Rapid Growth of Unstructured Data  YouTube users upload 48 hours of new video every minute of the day. (Source)  Brands and organizations on Facebook receive 34,722 Likes every minute of the day. (Source)  Twitter’s sees roughly 175 million tweets every day, and has more than 465 million accounts. (Source)  In late 2011, IDC Digital Universe published a report indicating that some 1.8 zettabytes of data will be created that year. (Source)  In other words, the amount of data in the world today is equal to:  Every person in the world having more than 215m high- resolution MRI scans a day.  More than 200bn HD movies – which would take a person 47m years to watch.

Turning Big Data into Value: 5V’s The Digital Era gives unprecedented amounts of data in terms of Volume, Velocity, Variety and Veracity and properly channelling them to Value. Value Volume: Refers to the Terabytes, Petabytes as well as Zettabytes of data generated every second. Velocity: Speed at which new data is generated every second. E.g. Google, Twitter, Facebook Variety: Different formats data such as text, images, video, video and so on can be stored and processed rather than only Relational Databases. Veracity: Trustworthiness of the data. E.g. Twitter data with hash tags, abbreviations, typos and colloquial speech as well as the reliability and accuracy of content. Though not reliable can also be processed. Value: Having access to big data is no good unless we can turn it into value.

The ‘Datafication’ of our World; •Activities •Conversations •Words •Voice •Social Media •Browser logs •Photos •Videos •Sensors •Etc. Volume Veracity Variety Velocity Analysis Analysing Big Data: •Text analytics •Sentiment analysis •Face recognition •Voice analytics •Movement analytics •Etc. Value Copied from: © 2014 Advanced Performance Institute, BWMC Ltd. New technologies in Distributed Systems and Cloud Computing together with the latest software and analysis approaches allow us to store and process data to Value at massive rate.

Some Big Data Use Case By Industry Telecommunications Network analytics Location-based services Retail Merchandise optimization Supply-Chain Management Banking Fraud Detection Trade Surveillance Media Click- Fraud Prevention Social Graph Analysis Energy Smart Meter Analytics Distribution load forecasting Manufacturing Customer Care Call Centers Customer Relationship Public Threats Detection Cyber Security Healthcare Clinical Trails data Analysis Supply Chain Management Insurance Catastrophe Modelling Claims Fraud

Challenges of big data  How to store and protect Big data?  How to organize and catalog the data that you have backed up?  How to keep costs low while ensuring that all the critical data is available you need it.  Analytical Challenges  Human Resources and Manpower  Technical Challenges  Privacy and Security

Why Big-Data Analytics? • Understand existing data resource. • Process them and uncover pattern, correlations and other useful information that can be used to make better decisions. • With big data analytics, data scientists and others can analyse huge volumes of data that conventional analytics and business intelligence solutions can't touch.

Traditional vs. Big Data Approaches IT Structures the data to answer that question IT Delivers a platform to enable creative discovery Business Explores what questions could be asked Business Users Determine what question to ask Monthly sales reports Profitability analysis Customer surveys Brand sentiment Product strategy Maximum asset utilization Big Data Approach Iterative & Exploratory Analysis Traditional Approach Structured & Repeatable Analysis

Tools Employed For Data Analytics • NoSQL Databases: MongoDB, Cassandra, Hbase, Hypertable. • Storage: S3, Hadoop Distributed File System • Servers: EC2, Google App Engine, Heroku • MapReduce: Hadoop, Hive, Pig, Cascading, S4, MapR • Processing: R, Yahoo! Pipes, Solr/Lucene, BigSheets,

Practical Examples of Data Analytics To better understand and target customers, companies expand their traditional data sets with social media data, browser, text analytics or sensor data to get a more complete picture of their customers. The big objective, in many cases, is to create predictive models. Using big data, Telecom companies can now better predict customer churn; retailers can predict what products will sell, and car insurance companies understand how well their customers actually drive. Better understand and target customers . The computing power of big data analytics enables us to find new cures and better understand and predict disease patterns. We can use all the data from smart watches and wearable devices to better understand links between lifestyles and diseases. Big data analytics also allow us to monitor and predict epidemics and disease outbreaks, simply by listening to what people are saying, i.e. “Feeling rubbish today - in bed with a cold” or searching for on the Internet. Improving Health Copied from: © 2014 Advanced Performance Institute, BWMC Ltd.

Practical Examples of Data Analytics Security services use big data analytics to foil terrorist plots and detect cyber attacks. Police forces use big data tools to catch criminals and even predict criminal activity and credit card companies use big data analytics it to detect fraudulent transactions. Improving Security and Law Enforcement. Big data is used to improve many aspects of our cities and countries. For example, it allows cities to optimize traffic flows based on real time traffic information as well as social media and weather data. A number of cities are currently using big data analytics with the aim of turning themselves into Smart Cities, where the transport infrastructure and utility processes are all joined up. Where a bus would wait for a delayed train and where traffic signals predict traffic volumes and operate to minimize jams. Improving and Optimizing Cities and Countries Copied from: © 2014 Advanced Performance Institute, BWMC Ltd.

Brief history of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project. Nutch was started in 2002, and a working crawler and search system quickly emerged. However their architecture wouldn’t scale to the billions of pages on the Web. In 2003 Google published paper on Google’s Distributed Filesystem (GFS) which was being used in production at Google. Hence in 2004 they implemented Nutch Distributed Filesystem (NDFS) using GFS architecture that would solve their storage needs for very large files generated as a part of the web crawl and indexing process. In 2004, Google published the paper that introduced MapReduce to the world. NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop.

Apache Hadoop  Framework for the distributed processing of large data sets across clusters of computers using simple programming models.  Designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.  Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.

Key Features of Hadoop 1. Flexible 2. Scalable 3. Building more efficient data economy 4. Cost Effective 5. Fault Tolerant

1) Flexible 1. Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. 2. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide. 3. We can develop Map- Reduce programs on Linux, Windows, OS-X in any language like Python, R, C++, Perl, Ruby, etc.

2) Scalable Scalability is one of the primary forces driving popularity and adoption of the Apache Hadoop project. A typical use case for Hadoop is an emerging Web site starting to run a five-node. New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top. 1. Yahoo reportedly ran numerous clusters having 4000+ nodes with four 1 TB drives per node, 15 PB of total storage capacity. 2. Facebook’s 2000-node warehouse cluster is provisioned for 21 PB of total storage capacity. Extrapolating the announced growth rate, its namespace should have close to 200 million objects by now. 3. eBay runs a 700-node cluster. Each node has 24 TB of local disk storage, 72 GB of RAM, and a 12-core CPU. Total cluster size is 16 PB. It is configured to run 26,000 MapReduce tasks simultaneously.

3) Building more efficient data economy Data is the new currency of the modern world. Businesses that successfully maximize its value will have a decisive impact on their own value and on their customers success. Apache Hadoop allows businesses to create highly scalable and cost- efficient data stores. It offers data value at unprecedented scale.

4) Cost Effective Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data. It's a cost-effective alternative to a conventional extract, transform, and load (ETL) process that extracts data from different systems, converts it into a structure suitable for analysis and reporting, and loads into database.

5) Fault tolerant When you lose a node, the system redirects work to another location of the data and continues processing without missing a fright beat. When any node becomes non-functional, then the node present nearby ie. Supernode which is near completion or has already completed its task reassigns itself to the task of that faulty node, The description of which is present in the shared memory. Therefore a faulty node does not have to wait for the Master node to notice about its non- functionality and hence reduce execution time in case any of the node gets faulty.

HDFS Architecture HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. HDFS clusters consist of a NameNode that manages the file system metadata and DataNodes that store the actual data. Uses: • Storage of large imported files from applications outside of the Hadoop ecosystem. • Staging of imported files to be processed by Hadoop applications.

Hive connects the gap between SQL based RDBMS and NoSQL based Hadoop. Datasets from HDFS and HBase can be mapped onto Hive from which queries can be written in an SQL like language called HiveQL. Though Hive may not be the perfect panacea for complex operations, it reduces the difficulty of having to write MapReduce jobs if a programmer knows SQL.. •Hbase: • Hive: Inspired by Google’s BigTable, HBase is a NoSQL distributed column- oriented database that runs on top of HDFS on which random read/write can be performed. HBase enables you to store and retrieve random data in near real-time. It can also be combined with MapReduce to ease bulk operations such as indexing or analysis. •Pig: Apache Pig uses the data flow language Pig Latin. Pig supports relational operations such as join, group and aggregate and it can be scaled across multiple servers simultaneously. Time intensive ETL operations, analytics on sample data, running complex tasks that collates multiple data sources are some of the use cases that can be handled using Pig.

Flume is a distributed system that aggregates streaming data from different sources and adds them to a centralized datastore for Hadoop cluster such as HDFS. Flume facilitates data aggregation which involves importing and processing data for computation into HDFS or storage in databases. • Sqoop: •Flume: Sqoop is the latest Hadoop framework to get enlisted in Bossie award for open source big data tools. Sqoop enables two-way import/export of bulk data between HDFS/Hive/HBase and relational or structured databases. Unlike Flume, Sqoop helps in data transfer of structured datasets. • Mahout: Mahout is a suite of scalable machine learning libraries implemented on top of MapReduce. Commercial use cases of machine learning include predictive analysis via collaborative filtering, clustering and classification. Product/service recommendations, investigative data mining, statistical analysis are some of its generic use cases.

MapReduce  MapReduce is a programming paradigm for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.  The framework is divided into two parts:  Map, allows to parcels out work to different nodes in the distributed cluster.  Reduce, collates the work and resolves the results into a single value.  MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. Master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks.  Although the Hadoop framework is implemented in Java, MapReduce applications can be written in Python, Ruby, R, C++. Eg. Hadoop Streaming, Hadoop Pipes.

Hadoop- MapReduce Architecture

Map Reduce core functionality (I) • Data flow beyond the two key pieces (map and reduce): • Input reader – divides input into appropriate size splits which get assigned to a Map function. • Map function – maps file data to smaller, intermediate <key, value> pairs. • Compare function – input for Reduce is pulled from the Map intermediate output and sorted according to the compare function. • Reduce function – takes intermediate values and reduces to a smaller solution handed back to the framework. • Output writer – writes file output

How MapReduce Works User to do list:  Indicate • input/output files • M: number of map tasks • R: number of reduce tasks • W: number of machines  Write map and reduce functions  Submit the job  Input files are split into M pieces on distributed file system • Typically ~ 64 MB blocks  Intermediate files created from map tasks are written to local disk  A sorted and shuffled output is sent to reduce framework (combiner is also used in most of the cases).  Output files are written to distributed file system.

MAP Reduce Examples 1. WordCount ( Reads the text file and counts how often words occur ). 2. TopN ( To find top-n used words of a text file ).

1. WordCount Reads text files and counts how often each word occur. The input and the output are text files, Need three classes: • WordCount.java: Driver class with main function • WordMapper.java: Mapper class with map method • SumReducer.java: Reducer class with reduce method

WordCount Example (Contd.) WordMapper.java Mapper class with map function For the given sample input assuming two map nodes The sample input is distributed to the maps the first map emits: <Hello, 1> <World, 1> <Bye, 1> <World, 1> The second map emits: <Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop, 1>

WordCount Example (Contd.) SumReducer.java Reducer class with reduce function For the input from two Mappers the reduce method just sums up the values, which are the occurrence counts for each key Thus the output of the job is: <Bye, 1> <Goodbye, 1> <Hadoop, 2> <Hello, 2> <World, 2>

WordCount (Driver) Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); }

public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } Check Input and Output files WordCount (Driver)

Set output (key, value) types public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } WordCount (Driver)

Set Mapper/Reducer classes public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } WordCount (Driver)

Set Input/Output format classes public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } WordCount (Driver)

Set Input/Output paths public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } WordCount (Driver)

Set Driver class public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } WordCount (Driver)

Submit the job to the master node public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } WordCount (Driver)

WordMapper (Mapper class) import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordMapper extends Mapper<Object, Text, Text, IntWritable> { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); @Override public void map(Object key, Text value, Context contex) throws IOException, InterruptedException { // Break line into words for processing StringTokenizer wordList = new StringTokenizer(value.toString()); while (wordList.hasMoreTokens()) { word.set(wordList.nextToken()); contex.write(word, one); } } }

public class WordMapper extends Mapper<Object, Text, Text, IntWritable> { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); @Override public void map(Object key, Text value, Context contex) throws IOException, InterruptedException { // Break line into words for processing StringTokenizer wordList = new StringTokenizer(value.toString()); while (wordList.hasMoreTokens()) { word.set(wordList.nextToken()); contex.write(word, one); } } } Extends mapper class with input/ output keys and values WordMapper (Mapper class)

public class WordMapper extends Mapper<Object, Text, Text, IntWritable> { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); @Override public void map(Object key, Text value, Context contex) throws IOException, InterruptedException { // Break line into words for processing StringTokenizer wordList = new StringTokenizer(value.toString()); while (wordList.hasMoreTokens()) { word.set(wordList.nextToken()); contex.write(word, one); } } } Output (key, value) typesWordMapper (Mapper class)

public class WordMapper extends Mapper<Object, Text, Text, IntWritable> { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); @Override public void map(Object key, Text value, Context contex) throws IOException, InterruptedException { // Break line into words for processing StringTokenizer wordList = new StringTokenizer(value.toString()); while (wordList.hasMoreTokens()) { word.set(wordList.nextToken()); contex.write(word, one); } } } Input (key, value) types Output as Context type WordMapper (Mapper class)

Read words from each line of the input file public class WordMapper extends Mapper<Object, Text, Text, IntWritable> { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); @Override public void map(Object key, Text value, Context contex) throws IOException, InterruptedException { // Break line into words for processing StringTokenizer wordList = new StringTokenizer(value.toString()); while (wordList.hasMoreTokens()) { word.set(wordList.nextToken()); contex.write(word, one); } } } WordMapper (Mapper class)

Count each word public class WordMapper extends Mapper<Object, Text, Text, IntWritable> { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); @Override public void map(Object key, Text value, Context contex) throws IOException, InterruptedException { // Break line into words for processing StringTokenizer wordList = new StringTokenizer(value.toString()); while (wordList.hasMoreTokens()) { word.set(wordList.nextToken()); contex.write(word, one); } } } WordMapper (Mapper class)

Shuffler/Sorter Maps emit (key, value) pairs Shuffler/Sorter of Hadoop framework Sort (key, value) pairs by key Then, append the value to make (key, list of values) pair For example, The first, second maps emit: <Hello, 1> <World, 1> <Bye, 1> <World, 1> <Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop, 1> Shuffler produces and it becomes the input of the reducer <Bye, 1>, <Goodbye, 1>, <Hadoop, <1,1>>, <Hello, <1, 1>>, <World, <1,1>>

SumReducer (Reducer class) import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable totalWordCount = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; Iterator<IntWritable> it=values.iterator(); while (it.hasNext()) { wordCount += it.next().get(); } totalWordCount.set(wordCount); context.write(key, totalWordCount); } }

public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable totalWordCount = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; Iterator<IntWritable> it=values.iterator(); while (it.hasNext()) { wordCount += it.next().get(); } totalWordCount.set(wordCount); context.write(key, totalWordCount); } } Extends Reducer class with input/ output keys and valuesSumReducer (Reducer class)

Set output value type SumReducer (Reducer class) public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable totalWordCount = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; Iterator<IntWritable> it=values.iterator(); while (it.hasNext()) { wordCount += it.next().get(); } totalWordCount.set(wordCount); context.write(key, totalWordCount); } }

Set input (key, list of values) type and output as Context class SumReducer (Reducer class) public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable totalWordCount = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; Iterator<IntWritable> it=values.iterator(); while (it.hasNext()) { wordCount += it.next().get(); } totalWordCount.set(wordCount); context.write(key, totalWordCount); } }

For each word, Count/sum the number of values SumReducer (Reducer class) public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable totalWordCount = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; Iterator<IntWritable> it=values.iterator(); while (it.hasNext()) { wordCount += it.next().get(); } totalWordCount.set(wordCount); context.write(key, totalWordCount); } }

For each word, Total count becomes the value SumReducer (Reducer class) public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable totalWordCount = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; Iterator<IntWritable> it=values.iterator(); while (it.hasNext()) { wordCount += it.next().get(); } totalWordCount.set(wordCount); context.write(key, totalWordCount); } }

Reducer Input: Shuffler produces and it becomes the input of the reducer <Bye, 1>, <Goodbye, 1>, <Hadoop, <1,1>>, <Hello, <1, 1>>, <World, <1,1>> Output <Bye, 1>, <Goodbye, 1>, <Hadoop, 2>, <Hello, 2>, <World, 2> SumReducer

Map() The Mapper implementation, via the map method, processes one line at a time, as provided by the specified TextInputFormat. It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <word>, 1>. For asample input the first map emits: < Deer, 1> < Beer, 1> < River, 1> The second map emits: < Car, 1> < River, 1> < Car, 1> Map() and Reduce() The output of the first map: < Deer, 1> < Beer, 1> < River, 1> The output of the second map: < Car, 2> < River, 1>

Map() and Reduce() (Continued) Reducer() The Reducer implementation, via the reduce method just sums up the values, which are the occurence counts for each key (i.e. words in this example).

2. TopN  We want to find top-n used words of a text file: “Flatland” by E. Abbot.  The input and the output are text files,  Need three classes  TopN.java  Driver class with main function  TopNMapper.java  Mapper class with map method  TopNReducer.java  Reducer class with reduce method

TopN(Driver) Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: TopN <in> <out>"); System.exit(2); } Job job = Job.getInstance(conf); job.setJobName("Top N"); job.setJarByClass(TopN.class); job.setMapperClass(TopNMapper.class); //job.setCombinerClass(TopNReducer.class); job.setReducerClass(TopNReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import java.io.IOException; import java.util.*; public class TopN { public static void main(String[] args) throws Exception {

TopNMapper /** * The mapper reads one line at the time, splits it into an array of single words and emits every * word to the reducers with the value of 1. */ public static class TopNMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private String tokens = "[_|$#<>^=[]*/,;,.-:()?!"']"; @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String cleanLine = value.toString().toLowerCase().replaceAll(tokens, " "); StringTokenizer itr = new StringTokenizer(cleanLine); while (itr.hasMoreTokens()) { word.set(itr.nextToken().trim()); context.write(word, one); } } }

TopNReducer /** * The reducer retrieves every word and puts it into a Map: if the word already exists in the * map, increments its value, otherwise sets it to 1. */ public static class TopNReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private Map<Text, IntWritable> countMap = new HashMap<>(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { //computes the number of occurrences of a single word int sum = 0; for (IntWritable val : values) { sum += val.get(); } // puts the number of occurrences of this word into the map. // We need to create another Text object because the Text instance // we receive is the same for all the words countMap.put(new Text(key), new IntWritable(sum)); }

@Override protected void cleanup(Context context) throws IOException, InterruptedException { Map<Text, IntWritable> sortedMap = sortByValues(countMap); int counter = 0; for (Text key : sortedMap.keySet()) { if (counter++ == 20) { break; } context.write(key, sortedMap.get(key)); } } }

TopN- Results The 2286 Of 1634 And 1098 That 499 You 429 Not 317 But 279 For 267 By 317 In shuffle and sort phase, the partioner will send every single word (the key) with the value “1” to the reducers. All these network transmissions can be minimized if we reduce the data locally the data that the mapper will emit. This is obtained by Combiner.

TopNCombiner /** * The combiner retrieves every word and puts it into a Map: if the word already exists in the * map, increments its value, otherwise sets it to 1. */ public static class TopNCombiner extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { // computes the number of occurrences of a single word int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

Hadoop Output: With and Without Combiner Without Combiner ->  Map input records = 4239  Map output records = 37817  Map output bytes = 359621  Input split bytes = 118  Combine input records = 0  Combine output records = 0  Reduce input groups = 4987  Reduce shuffle bytes = 435261  Reduce input records = 37817  Reduce output records = 20 With Combiner ->  Map input records = 4239  Map output records = 37817  Map output bytes = 359621  Input split bytes = 116  Combine input records = 37817  Combine output records = 20  Reduce input groups = 20  Reduce shuffle bytes = 194  Reduce input records = 20  Reduce output records = 20

Advantages and Disadvantages of using Combiner Advantages -> Network transmission are minimized. Disadvantages -> Hadoop doesn’t guarantee the execution of a combiner: it can be executed 0,1 or multiple times on the same input. Key-value pairs emitted from mapper are stored in local file system, and execution of combiner can cause extensive IO operations.

Sources  http://wikibon.org/blog/big-data-statistics/  https://en.wikipedia.org/wiki/Big_data  http://blog.qburst.com/2014/08/hadoop-big-data-analytics-tools/

Big data analytics with Apache Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Big data analytics with Apache Hadoop

Recently uploaded

In this document

Big data analytics with Apache Hadoop