Java MapReduce Programming on Apache Hadoop Aaron T. Myers, aka ATM with thanks to Sandy Ryza
Introductions ● Software Engineer/Tech Lead for HDFS at Cloudera ● Committer/PMC Member on the Apache Hadoop project ● My work focuses primarily on HDFS and Hadoop security
What is MapReduce? ● A distributed programming paradigm
What is a distributed programming paradigm? Help!
What is a distributed programming paradigm?
Distributed Systems are Hard ● Monitoring ● RPC protocols, serialization ● Fault tolerance ● Deployment ● Scheduling/Resource Management
Writing Data Parallel Programs Should Not Be
MapReduce to the Rescue ● You specify map(...) and reduce(...) functions ○ map = (list(k, v) -> list(k, v)) ○ reduce = (k, list(v) -> k, v) ● The framework does the rest ○ Split up the data ○ Run several mappers over the splits ○ Shuffle the data around for the reducers ○ Run several reducers ○ Store the final results
Map apple apple banana a happy airplane airplane on the runway runway apple runway rumple on the apple apple apple banana a happy airplane airplane on the runway runway apple runway rumple on the apple apple - 1 apple - 1 banana - 1 a - 1 happy - 1 airplane - 1 on - 1 the - 1 runway - 1 runway - 1 runway - 1 apple - 1 rumple - 1 on - 1 the - 1 apple - 1 map() map() map() map() map() Map Inputs Map OutputsInput Data Map Function Shuffle
Reduce reduce() reduce() reduce() reduce() reduce() reduce() reduce() reduce() a - 1 airplane - 1 apple - 4 banana - 1 on - 2 runway - 3 rumple - 1 the - 2 a - 1, 1 airplane - 1 apple - 1, 1, 1, 1 banana - 1 on - 1, 1 runway - 1, 1, 1 rumple - 1 the - 1, 1 Shuffle Reduce Output
What is (Core) Hadoop? ● An open source platform for storing, processing, and analyzing enormous amounts of data ● Consists of… ○ A distributed file system (HDFS) ○ An implementation of the Map/Reduce paradigm (Hadoop MapReduce) ● Written in Java!
What is Hadoop? Traditional Operating System Storage: File System Execution/Scheduling: Processes
What is Hadoop? Hadoop (Distributed operating system) Storage: Hadoop Distributed File System (HDFS) Execution/Scheduling: MapReduce
HDFS (briefly) ● Distributed file system that runs on all nodes in the cluster ○ Co-located with Hadoop MapReduce daemons ● Looks like a pretty normal Unix file system ○ hadoop fs -ls /user/atm/ ○ hadoop fs -cp /user/atm/data.txt /user/atm/data2.txt ○ hadoop fs -rm /user/atm/data.txt ○ … ● Don’t use the normal Java File API ○ Instead use org.apache.hadoop.fs.FileSystem API
Writing MapReduce programs in Java ● Interface to MapReduce in Hadoop is Java API ● WordCount!
Word Count Map Function public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one= new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }
Word Count Reduce Function public static class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
Word Count Driver
InputFormats ● TextInputFormat ○ Each line becomes <LongWritable, Text> = <byte offset in file, whole line> ● KeyValueTextInputFormat ○ Splits lines on delimiter into Text key and Text value ● SequenceFileInputFormat ○ Reads key/value pairs from SequenceFile, a Hadoop format ● DBInputFormat ○ Uses JDBC to connect to a database ● Many more, or write your own!
Serialization ● Writables ○ Native to Hadoop ○ Implement serialization for higher level structures yourself ● Avro ○ Extensible ○ Cross-language ○ Handles serialization of higher level structures for you ● And others… ○ Parquet, Thrift, etc.
Writables public class MyNumberAndStringWritable implements Writable { private int number; private String str; public void write(DataOutput out) throws IOException { out.writeInt(number); out.writeUTF(str); } public void readFields(DataInput in) throws IOException { number = in.readInt(); str = in.readUTF(); } }
Avro protocol MyMapReduceObjects { record MyNumberAndString { string str; int number; } }
Testing MapReduce Programs ● First, write unit tests (duh) with MRUnit ● LocalJobRunner ○ Runs job in single process ● Single-node cluster (Cloudera VM!) ○ Multiple processes on the same machine ● On the real cluster
MRUnit @Test public void testMapper() throws IOException { MapDriver<LongWritable, Text, Text, IntWritable> mapDriver= new MapDriver<LongWritable, Text, Text, IntWritable>(new WordCountMapper()); String line = "apple banana banana carrot"; mapDriver.withInput(new LongWritable(0), new Text(line)); mapDriver.withOutput(new Text("apple"), new IntWritable(1)); mapDriver.withOutput(new Text("banana"), new IntWritable(1)); mapDriver.withOutput(new Text("banana"), new IntWritable(1)); mapDriver.withOutput(new Text("carrot"), new IntWritable(1)); mapDriver.runTest(); }
MRUnit @Test public void testReducer() { ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver= new MapDriver<Text, IntWritable, Text, IntWritable>(new WordCountReducer()); reduceDriver.withInput(new Text("apple"), Arrays.asList(new IntWritable(1), new IntWritable(2))); reduceDriver.withOutput(new Text("apple"), new IntWritable("3")); reduceDriver.runTest(); }
Counters Map-Reduce Framework Map input records=183 Map output records=183 Map output bytes=533563 Map output materialized bytes=534190 Input split bytes=144 Combine input records=0 Combine output records=0 Reduce input groups=183 Reduce shuffle bytes=0 Reduce input records=183 Reduce output records=183 Spilled Records=366 Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=7 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 File System Counters FILE: Number of bytes read=1844866 FILE: Number of bytes written=1927344 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 File Input Format Counters Bytes Read=655137 File Output Format Counters Bytes Written=537484
Counters if (record.isUgly()) { context.getCounter("Ugly Record Counters", "Ugly Records").increment(1); }
Counters Map-Reduce Framework Map input records=183 Map output records=183 Map output bytes=533563 Map output materialized bytes=534190 Input split bytes=144 Combine input records=0 Combine output records=0 Reduce input groups=183 Reduce shuffle bytes=0 Reduce input records=183 Reduce output records=183 Spilled Records=366 Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=7 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 File System Counters FILE: Number of bytes read=1844866 FILE: Number of bytes written=1927344 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 File Input Format Counters Bytes Read=655137 File Output Format Counters Bytes Written=537484 Ugly Record Counters Ugly Records=1024
Distributed Cache We need some data and libraries on all the nodes.
Distributed Cache Map or Reduce Task Map or Reduce Task Local Copy HDFS Distributed CacheMap or Reduce Task Map or Reduce Task Local Copy
Distributed Cache In our driver: DistributedCache .addCacheFile( new URI("/some/path/to/ourfile.txt" ), conf); In our mapper or reducer: @Override public void setup(Context context) throws IOException, InterruptedException { Configuration conf = context.getConfiguration(); localFiles = DistributedCache .getLocalCacheFiles(conf); }
Java Technologies Built on MapReduce
Crunch ● Library on top of MapReduce that makes it easy to write pipelines of jobs in Java ● Contains capabilities like joins and aggregation functions to save programmers from writing these for each job
Crunch public class WordCount { public static void main(String[] args) throws Exception { Pipeline pipeline = new MRPipeline(WordCount.class); PCollection<String> lines = pipeline.readTextFile(args[0]); PCollection<String> words = lines.parallelDo("my splitter", new DoFn<String, String>() { public void process(String line, Emitter<String> emitter) { for (String word : line.split("s+")) { emitter.emit(word); } } }, Writables.strings()); PTable<String, Long> counts= Aggregate.count(words); pipeline.writeTextFile(counts, args[1]); pipeline.run(); } }
Mahout ● Machine Learning on Hadoop ○ Collaborative Filtering ○ User and Item based recommenders ○ K-Means, Fuzzy K-Means clustering ○ Dirichlet process clustering ○ Latent Dirichlet Allocation ○ Singular value decomposition ○ Parallel Frequent Pattern mining ○ Complementary Naive Bayes classifier ○ Random forest decision tree based classifier
Non-Java technologies that use MapReduce ● Hive ○ SQL -> M/R translator, metadata manager ● Pig ○ Scripting DSL -> M/R translator ● Distcp ○ HDFS tool to bulk copy data from one HDFS cluster to another
Thanks! ● Questions?

Hadoop - Introduction to map reduce programming - Reunião 12/04/2014

  • 1.
    Java MapReduce Programming on ApacheHadoop Aaron T. Myers, aka ATM with thanks to Sandy Ryza
  • 2.
    Introductions ● Software Engineer/TechLead for HDFS at Cloudera ● Committer/PMC Member on the Apache Hadoop project ● My work focuses primarily on HDFS and Hadoop security
  • 3.
    What is MapReduce? ●A distributed programming paradigm
  • 4.
    What is adistributed programming paradigm? Help!
  • 5.
    What is adistributed programming paradigm?
  • 6.
    Distributed Systems areHard ● Monitoring ● RPC protocols, serialization ● Fault tolerance ● Deployment ● Scheduling/Resource Management
  • 7.
    Writing Data ParallelPrograms Should Not Be
  • 8.
    MapReduce to theRescue ● You specify map(...) and reduce(...) functions ○ map = (list(k, v) -> list(k, v)) ○ reduce = (k, list(v) -> k, v) ● The framework does the rest ○ Split up the data ○ Run several mappers over the splits ○ Shuffle the data around for the reducers ○ Run several reducers ○ Store the final results
  • 9.
    Map apple apple banana ahappy airplane airplane on the runway runway apple runway rumple on the apple apple apple banana a happy airplane airplane on the runway runway apple runway rumple on the apple apple - 1 apple - 1 banana - 1 a - 1 happy - 1 airplane - 1 on - 1 the - 1 runway - 1 runway - 1 runway - 1 apple - 1 rumple - 1 on - 1 the - 1 apple - 1 map() map() map() map() map() Map Inputs Map OutputsInput Data Map Function Shuffle
  • 10.
    Reduce reduce() reduce() reduce() reduce() reduce() reduce() reduce() reduce() a - 1 airplane- 1 apple - 4 banana - 1 on - 2 runway - 3 rumple - 1 the - 2 a - 1, 1 airplane - 1 apple - 1, 1, 1, 1 banana - 1 on - 1, 1 runway - 1, 1, 1 rumple - 1 the - 1, 1 Shuffle Reduce Output
  • 11.
    What is (Core)Hadoop? ● An open source platform for storing, processing, and analyzing enormous amounts of data ● Consists of… ○ A distributed file system (HDFS) ○ An implementation of the Map/Reduce paradigm (Hadoop MapReduce) ● Written in Java!
  • 12.
    What is Hadoop? TraditionalOperating System Storage: File System Execution/Scheduling: Processes
  • 13.
    What is Hadoop? Hadoop (Distributedoperating system) Storage: Hadoop Distributed File System (HDFS) Execution/Scheduling: MapReduce
  • 14.
    HDFS (briefly) ● Distributedfile system that runs on all nodes in the cluster ○ Co-located with Hadoop MapReduce daemons ● Looks like a pretty normal Unix file system ○ hadoop fs -ls /user/atm/ ○ hadoop fs -cp /user/atm/data.txt /user/atm/data2.txt ○ hadoop fs -rm /user/atm/data.txt ○ … ● Don’t use the normal Java File API ○ Instead use org.apache.hadoop.fs.FileSystem API
  • 15.
    Writing MapReduce programsin Java ● Interface to MapReduce in Hadoop is Java API ● WordCount!
  • 16.
    Word Count MapFunction public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one= new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }
  • 17.
    Word Count ReduceFunction public static class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 18.
  • 19.
    InputFormats ● TextInputFormat ○ Eachline becomes <LongWritable, Text> = <byte offset in file, whole line> ● KeyValueTextInputFormat ○ Splits lines on delimiter into Text key and Text value ● SequenceFileInputFormat ○ Reads key/value pairs from SequenceFile, a Hadoop format ● DBInputFormat ○ Uses JDBC to connect to a database ● Many more, or write your own!
  • 20.
    Serialization ● Writables ○ Nativeto Hadoop ○ Implement serialization for higher level structures yourself ● Avro ○ Extensible ○ Cross-language ○ Handles serialization of higher level structures for you ● And others… ○ Parquet, Thrift, etc.
  • 21.
    Writables public class MyNumberAndStringWritableimplements Writable { private int number; private String str; public void write(DataOutput out) throws IOException { out.writeInt(number); out.writeUTF(str); } public void readFields(DataInput in) throws IOException { number = in.readInt(); str = in.readUTF(); } }
  • 22.
    Avro protocol MyMapReduceObjects { recordMyNumberAndString { string str; int number; } }
  • 23.
    Testing MapReduce Programs ●First, write unit tests (duh) with MRUnit ● LocalJobRunner ○ Runs job in single process ● Single-node cluster (Cloudera VM!) ○ Multiple processes on the same machine ● On the real cluster
  • 24.
    MRUnit @Test public void testMapper()throws IOException { MapDriver<LongWritable, Text, Text, IntWritable> mapDriver= new MapDriver<LongWritable, Text, Text, IntWritable>(new WordCountMapper()); String line = "apple banana banana carrot"; mapDriver.withInput(new LongWritable(0), new Text(line)); mapDriver.withOutput(new Text("apple"), new IntWritable(1)); mapDriver.withOutput(new Text("banana"), new IntWritable(1)); mapDriver.withOutput(new Text("banana"), new IntWritable(1)); mapDriver.withOutput(new Text("carrot"), new IntWritable(1)); mapDriver.runTest(); }
  • 25.
    MRUnit @Test public void testReducer(){ ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver= new MapDriver<Text, IntWritable, Text, IntWritable>(new WordCountReducer()); reduceDriver.withInput(new Text("apple"), Arrays.asList(new IntWritable(1), new IntWritable(2))); reduceDriver.withOutput(new Text("apple"), new IntWritable("3")); reduceDriver.runTest(); }
  • 26.
    Counters Map-Reduce Framework Map inputrecords=183 Map output records=183 Map output bytes=533563 Map output materialized bytes=534190 Input split bytes=144 Combine input records=0 Combine output records=0 Reduce input groups=183 Reduce shuffle bytes=0 Reduce input records=183 Reduce output records=183 Spilled Records=366 Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=7 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 File System Counters FILE: Number of bytes read=1844866 FILE: Number of bytes written=1927344 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 File Input Format Counters Bytes Read=655137 File Output Format Counters Bytes Written=537484
  • 27.
    Counters if (record.isUgly()) { context.getCounter("UglyRecord Counters", "Ugly Records").increment(1); }
  • 28.
    Counters Map-Reduce Framework Map inputrecords=183 Map output records=183 Map output bytes=533563 Map output materialized bytes=534190 Input split bytes=144 Combine input records=0 Combine output records=0 Reduce input groups=183 Reduce shuffle bytes=0 Reduce input records=183 Reduce output records=183 Spilled Records=366 Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=7 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 File System Counters FILE: Number of bytes read=1844866 FILE: Number of bytes written=1927344 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 File Input Format Counters Bytes Read=655137 File Output Format Counters Bytes Written=537484 Ugly Record Counters Ugly Records=1024
  • 29.
    Distributed Cache We needsome data and libraries on all the nodes.
  • 30.
    Distributed Cache Map or ReduceTask Map or Reduce Task Local Copy HDFS Distributed CacheMap or Reduce Task Map or Reduce Task Local Copy
  • 31.
    Distributed Cache In ourdriver: DistributedCache .addCacheFile( new URI("/some/path/to/ourfile.txt" ), conf); In our mapper or reducer: @Override public void setup(Context context) throws IOException, InterruptedException { Configuration conf = context.getConfiguration(); localFiles = DistributedCache .getLocalCacheFiles(conf); }
  • 32.
  • 33.
    Crunch ● Library ontop of MapReduce that makes it easy to write pipelines of jobs in Java ● Contains capabilities like joins and aggregation functions to save programmers from writing these for each job
  • 34.
    Crunch public class WordCount{ public static void main(String[] args) throws Exception { Pipeline pipeline = new MRPipeline(WordCount.class); PCollection<String> lines = pipeline.readTextFile(args[0]); PCollection<String> words = lines.parallelDo("my splitter", new DoFn<String, String>() { public void process(String line, Emitter<String> emitter) { for (String word : line.split("s+")) { emitter.emit(word); } } }, Writables.strings()); PTable<String, Long> counts= Aggregate.count(words); pipeline.writeTextFile(counts, args[1]); pipeline.run(); } }
  • 35.
    Mahout ● Machine Learningon Hadoop ○ Collaborative Filtering ○ User and Item based recommenders ○ K-Means, Fuzzy K-Means clustering ○ Dirichlet process clustering ○ Latent Dirichlet Allocation ○ Singular value decomposition ○ Parallel Frequent Pattern mining ○ Complementary Naive Bayes classifier ○ Random forest decision tree based classifier
  • 36.
    Non-Java technologies thatuse MapReduce ● Hive ○ SQL -> M/R translator, metadata manager ● Pig ○ Scripting DSL -> M/R translator ● Distcp ○ HDFS tool to bulk copy data from one HDFS cluster to another
  • 37.