Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Big Data & Hadoop D. Praveen Kumar Research Scholar (Full-Time) Department of Computer Science & Engineering YSREC of Yogi Vemana University, Proddatur Kadapa Dt., A. P, India November 8, 2016 Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 1 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution 1 Introduction 2 Hadoop Installation 3 Hadoop Configuration 4 Starting & Stopping 5 Map Reduce 6 Execution Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 2 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution GUESTS =4 Transportation from railway station to your home( one Auto/car is sufficient) mom can prepare food or snacks without risk. Your house is sufficient for Accommodation. Facilities like bed, bathrooms, water and TV are provided which you use. You can talk to each other and crack jokes and you can make them happy Expenditure is nearly Rs.1000/- Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 3 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution GUESTS =100 Transportation = 25 autos/car or two buses Food = catering. Accommodation = Lodge. Facilities = AC, TV, and all other facilities Maintenance= somewhat difficult Expenditure =nearly Rs. 90,000/- Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 4 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution GUESTS =10000 Transportation = 2500 autos or 500 buses Food = catering. Accommodation = all Lodges, function halls and cottages in the town. Facilities = AC, TV, and all other facilities are somewhat difficult to provide. Maintenance= more difficult Expenditure =nearly Rs. 2,00,000/- Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 5 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution GUESTS =10000000 Transportation=how many autos=? Food =? Accommodation =? Facilities =? Maintenance=? Cost =? Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 6 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Problems Same we assume in computing environment Difficult to handle a huge and ever growing amount of data Processing of data can not be possible with few machines distributing large data sets is difficult Construction of online or offline models are very difficult Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 7 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Solution A single solution to all these problems is Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 8 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution What is Big Data? Big data refers to voluminous amounts of structured or unstructured data that organizations can potentially mine and analyze. Big data is huge amount of large data sets characterized by Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 9 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Big Data Platforms and Analytical Software Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 10 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Hadoop Here we go with Why ? Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 11 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Hadoop Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. The base Apache Hadoop framework is composed of the following modules: Hadoop Common contains libraries and utilities needed by other Hadoop modules Hadoop Distributed File System (HDFS) a distributed file-system that stores data Hadoop YARN a resource-management platform Hadoop MapReduce for large scale data processing. Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 12 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Hadoop Components Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 13 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Requirements Necessary Java >= 7 ssh Linux OS (Ubuntu >= 14.04) Hadoop framework Optional Eclipse Internet connection Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 14 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Java 7 & Installation Hadoop requires a working Java installation. However, using java 1.7 or more is recommended. Following command is used to install java in linux platform sudo apt-get install openjdk-7-jdk (or) sudo apt-get install default-jdk Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 15 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Java PATH Setup We need to set JAVA path Open the .bashrc file located in home directory gedit ~/.bashrc Add below line at the end: export JAVA HOME=/usr/lib/jvm/java−7−openjdk−amd64 Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 16 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Installation & Configuration of SSH Hadoop requires SSH(Secure Shell) access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it. Install SSH using following command sudo apt-get install ssh First, we have to generate DSA an SSH key for user. ssh-keygen -t dsa -P ’’ -f ~ /.ssh/id dsa cat ~ /.ssh/id dsa.pub >> ~ /.ssh/authorized keys Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 17 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Download & Extract Hadoop Download Hadoop from the Apache Download Mirrors http://mirror.fibergrid.in/apache/hadoop/common/ Extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. $ cd /usr/local $ sudo tar xzf hadoop-2.7.2.tar.gz $ sudo mv hadoop-2.7.2 hadoop Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 18 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Add Hadoop configuration in .bashrc Add Hadoop configuration in .bashrc in home directory. export HADOOP INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP INSTALL/bin export PATH=$PATH:$HADOOP INSTALL/sbin export HADOOP MAPRED HOME=$HADOOP INSTALL export HADOOP HDFS HOME=$HADOOP INSTALL export HADOOP COMMON HOME=$HADOOP INSTALL export YARN HOME=$HADOOP INSTALL export HADOOP COMMON LIB NATIVE DIR=$HADOOP INSTALL/lib/native export HADOOP OPTS="-Djava.library.path=$HADOOP INSTALL/lib" Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 19 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Create temp file, DataNode & NameNode Execute below commands to create NameNode mkdir -p /usr/local/hadoopdata/hdfs/namenode Execute below commands to create DataNode mkdir -p /usr/local/hadoopdata/hdfs/datanode Execute below code to create the tmp directory in hadoop sudo mkdir -p /app/hadoop/tmp sudo chown hadoop1:hadoop1 /app/hadoop/tmp sudo chmod 750 /app/hadoop/tmp Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 20 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Files to Configure The following are the files we need to configure core-site.xml hadoop-env.sh mapred-site.xml hdfs-site.xml Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 21 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Add properties in /usr/local/hadoop/etc/core-site.xml Add the following snippets between the < configuration > ... < /configuration > tags in the core-site.xml file. Add below property to specify the location of tmp < property > < name > hadoop.tmp.dir < /name > < value > /app/hadoop/tmp < /value > < /property > Add below property to specify the location of default file system and its port number. < property > < name > fs.default.name < /name > < value > hdfs : //localhost : 54310 < /value > < /property > Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 22 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Add properties in /usr/local/hadoop/etc/hadoop-env.sh Un-Comment the JAVA HOME and Give Correct Path For Java. export JAVA HOME=/usr/lib/jvm/java-7-openjdk-amd64 Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 23 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Add property in /usr/local/hadoop/etc/hadoop/mapred-site.xml In file we add The host name and port that the MapReduce job tracker runs at. Add following in mapred-site.xml : < property > < name > mapred.job.tracker < /name > < value > localhost : 54311 < /value > < /property > Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 24 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Add properties in ... etc/hadoop/hdfs-site.xml In file hdfs-site.xml add following: Add replication factor < property > < name > dfs.replication < /name > < value > 1 < /value > < /property > Specify the NameNode < property > < name > dfs.namenode.name.dir < /name > < value > file : /usr/local/hadoopdata/hdfs/namenode < /value > < /property > Specify the DataNode < property > < name > dfs.datanode.name.dir < /name > < value > file : /usr/local/hadoopdata/hdfs/datanode < /value > < /property > Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 25 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Formatting the HDFS filesystem via the NameNode The first step to starting up your Hadoop installation is Formatting the Hadoop file system We need to do this the first time you set up a Hadoop. Do not format a running Hadoop filesystem as you will lose all the data currently in HDFS To format the filesystem, run the command hadoop namenode -format Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 26 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Starting single-node cluster Run the command: start-all.sh This will startup a NameNode,SecondaryNameNode, DataNode, ResourceManager and a NodeManager on your machine. A nifty tool for checking whether the expected Hadoop processes are running is jps hadoop1@hadoop1:/usr/local/hadoop$ jps 2598 NameNode 3112 ResourceManager 3523 Jps 2917 SecondaryNameNode 2727 DataNode 3242 NodeManager Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 27 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Stopping your single-node cluster Run the command stop-all.sh To stop all the daemons running on your machine output will be like this. stopping NodeManager localhost: stopping ResourceManager stopping NameNode localhost: stopping DataNode localhost: stopping SecondaryNameNode Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 28 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Map-Reduce Framework Map Reduce programming paradigm It relies basically on two functions, Map and Reduce Map Reduce used to manage many large-scale computations The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. The framework to effectively schedule tasks on the nodes where data is already present Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 29 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Map-Reduce Computation Steps The key-value pairs from each Map task are collected by a master controller and sorted by key. The keys are divided among all the Reduce tasks, so all key-value pairs with the same key wind up at the same Reduce task. The Reduce tasks work on one key at a time, and combine all the values associated with that key in some way. The manner of combination of values is determined by the code written by the user for the Reduce function. Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 30 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Hadoop - MapReduce Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 31 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Hadoop - MapReduce (Word Count) Example Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 32 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution MapReduce - WordCountMapper In WordCountMapper class we perform the following operations Read a line from file Split line into Words Assign Count 1 to each word Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 33 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution WordCountMapper source code public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 34 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution MapReduce - WordCountReducer In WordCountReducer class we perform the following operations Sum the list of values Assign sum to corresponding word Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 35 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution WordCountReducer source code public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 36 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution WordCountJob public class WordCountJob { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "word count"); job.setJarByClass(WordCountJob.class); job.setMapperClass(WordCountMapper.class); job.setCombinerClass(WordCountReducer.class); job.setReducerClass(WordCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 37 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Header Files to include import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 38 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Execution of Hadoop Program in Eclipse Step1: 1 Starting Hadoop in terminal using command: $ Start-all.sh 2 Use JPS command to check all services of Hadoop are started or not. Step 2: open Eclipse Step 3: Go to file ⇒ New ⇒ Project Select Java Project and click on Next button Write project name and click on Finish button Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 39 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Continue... Step 4: Right side it creates a project 1 Right click on Project ⇒ New ⇒ Class 2 Write Name of Class and then Click Finish 3 Write MapReduce program in that class Step 5: Write JAVA Program Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 40 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Continue... Step 6: Importing JAR files 1 Right click on Project and select properties (Alt+Enter) 2 Select Java Build Path ⇒ Click on Libraries, then click on add external JARS 3 Select the following jars from Hadoop library. /usr/local/Hadoop/share/Hadoop/common/libs /usr/local/Hadoop/share/Hadoop/hdfs/libs /usr/local/Hadoop/share/Hadoop/httpfs/libs /usr/local/Hadoop/share/Hadoop/mapreduce/libs /usr/local/Hadoop/share/Hadoop/yarn/libs /usr/local/Hadoop/share/Hadoop/tools/ Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 41 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution Continue .... Step 7: Set input file path 1 Create folder in home dir 2 copy text files in to that 3 Select path of Input Step 8: Set input and output path 1 right click on source ⇒ Run As ⇒ Run Configuration ⇒ Argument 2 Enter your input and out put path with a single space 3 click on Run Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 42 / 43
Outline Introduction Hadoop Installation Hadoop Configuration Starting & Stopping Map Reduce Execution thank You Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 43 / 43

Hadoop installation, Configuration, and Mapreduce program

  • 1.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Big Data & Hadoop D. Praveen Kumar Research Scholar (Full-Time) Department of Computer Science & Engineering YSREC of Yogi Vemana University, Proddatur Kadapa Dt., A. P, India November 8, 2016 Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 1 / 43
  • 2.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution 1 Introduction 2 Hadoop Installation 3 Hadoop Configuration 4 Starting & Stopping 5 Map Reduce 6 Execution Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 2 / 43
  • 3.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution GUESTS =4 Transportation from railway station to your home( one Auto/car is sufficient) mom can prepare food or snacks without risk. Your house is sufficient for Accommodation. Facilities like bed, bathrooms, water and TV are provided which you use. You can talk to each other and crack jokes and you can make them happy Expenditure is nearly Rs.1000/- Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 3 / 43
  • 4.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution GUESTS =100 Transportation = 25 autos/car or two buses Food = catering. Accommodation = Lodge. Facilities = AC, TV, and all other facilities Maintenance= somewhat difficult Expenditure =nearly Rs. 90,000/- Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 4 / 43
  • 5.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution GUESTS =10000 Transportation = 2500 autos or 500 buses Food = catering. Accommodation = all Lodges, function halls and cottages in the town. Facilities = AC, TV, and all other facilities are somewhat difficult to provide. Maintenance= more difficult Expenditure =nearly Rs. 2,00,000/- Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 5 / 43
  • 6.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution GUESTS =10000000 Transportation=how many autos=? Food =? Accommodation =? Facilities =? Maintenance=? Cost =? Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 6 / 43
  • 7.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Problems Same we assume in computing environment Difficult to handle a huge and ever growing amount of data Processing of data can not be possible with few machines distributing large data sets is difficult Construction of online or offline models are very difficult Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 7 / 43
  • 8.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Solution A single solution to all these problems is Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 8 / 43
  • 9.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution What is Big Data? Big data refers to voluminous amounts of structured or unstructured data that organizations can potentially mine and analyze. Big data is huge amount of large data sets characterized by Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 9 / 43
  • 10.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Big Data Platforms and Analytical Software Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 10 / 43
  • 11.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Hadoop Here we go with Why ? Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 11 / 43
  • 12.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Hadoop Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. The base Apache Hadoop framework is composed of the following modules: Hadoop Common contains libraries and utilities needed by other Hadoop modules Hadoop Distributed File System (HDFS) a distributed file-system that stores data Hadoop YARN a resource-management platform Hadoop MapReduce for large scale data processing. Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 12 / 43
  • 13.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Hadoop Components Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 13 / 43
  • 14.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Requirements Necessary Java >= 7 ssh Linux OS (Ubuntu >= 14.04) Hadoop framework Optional Eclipse Internet connection Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 14 / 43
  • 15.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Java 7 & Installation Hadoop requires a working Java installation. However, using java 1.7 or more is recommended. Following command is used to install java in linux platform sudo apt-get install openjdk-7-jdk (or) sudo apt-get install default-jdk Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 15 / 43
  • 16.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Java PATH Setup We need to set JAVA path Open the .bashrc file located in home directory gedit ~/.bashrc Add below line at the end: export JAVA HOME=/usr/lib/jvm/java−7−openjdk−amd64 Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 16 / 43
  • 17.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Installation & Configuration of SSH Hadoop requires SSH(Secure Shell) access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it. Install SSH using following command sudo apt-get install ssh First, we have to generate DSA an SSH key for user. ssh-keygen -t dsa -P ’’ -f ~ /.ssh/id dsa cat ~ /.ssh/id dsa.pub >> ~ /.ssh/authorized keys Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 17 / 43
  • 18.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Download & Extract Hadoop Download Hadoop from the Apache Download Mirrors http://mirror.fibergrid.in/apache/hadoop/common/ Extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. $ cd /usr/local $ sudo tar xzf hadoop-2.7.2.tar.gz $ sudo mv hadoop-2.7.2 hadoop Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 18 / 43
  • 19.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Add Hadoop configuration in .bashrc Add Hadoop configuration in .bashrc in home directory. export HADOOP INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP INSTALL/bin export PATH=$PATH:$HADOOP INSTALL/sbin export HADOOP MAPRED HOME=$HADOOP INSTALL export HADOOP HDFS HOME=$HADOOP INSTALL export HADOOP COMMON HOME=$HADOOP INSTALL export YARN HOME=$HADOOP INSTALL export HADOOP COMMON LIB NATIVE DIR=$HADOOP INSTALL/lib/native export HADOOP OPTS="-Djava.library.path=$HADOOP INSTALL/lib" Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 19 / 43
  • 20.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Create temp file, DataNode & NameNode Execute below commands to create NameNode mkdir -p /usr/local/hadoopdata/hdfs/namenode Execute below commands to create DataNode mkdir -p /usr/local/hadoopdata/hdfs/datanode Execute below code to create the tmp directory in hadoop sudo mkdir -p /app/hadoop/tmp sudo chown hadoop1:hadoop1 /app/hadoop/tmp sudo chmod 750 /app/hadoop/tmp Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 20 / 43
  • 21.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Files to Configure The following are the files we need to configure core-site.xml hadoop-env.sh mapred-site.xml hdfs-site.xml Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 21 / 43
  • 22.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Add properties in /usr/local/hadoop/etc/core-site.xml Add the following snippets between the < configuration > ... < /configuration > tags in the core-site.xml file. Add below property to specify the location of tmp < property > < name > hadoop.tmp.dir < /name > < value > /app/hadoop/tmp < /value > < /property > Add below property to specify the location of default file system and its port number. < property > < name > fs.default.name < /name > < value > hdfs : //localhost : 54310 < /value > < /property > Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 22 / 43
  • 23.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Add properties in /usr/local/hadoop/etc/hadoop-env.sh Un-Comment the JAVA HOME and Give Correct Path For Java. export JAVA HOME=/usr/lib/jvm/java-7-openjdk-amd64 Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 23 / 43
  • 24.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Add property in /usr/local/hadoop/etc/hadoop/mapred-site.xml In file we add The host name and port that the MapReduce job tracker runs at. Add following in mapred-site.xml : < property > < name > mapred.job.tracker < /name > < value > localhost : 54311 < /value > < /property > Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 24 / 43
  • 25.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Add properties in ... etc/hadoop/hdfs-site.xml In file hdfs-site.xml add following: Add replication factor < property > < name > dfs.replication < /name > < value > 1 < /value > < /property > Specify the NameNode < property > < name > dfs.namenode.name.dir < /name > < value > file : /usr/local/hadoopdata/hdfs/namenode < /value > < /property > Specify the DataNode < property > < name > dfs.datanode.name.dir < /name > < value > file : /usr/local/hadoopdata/hdfs/datanode < /value > < /property > Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 25 / 43
  • 26.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Formatting the HDFS filesystem via the NameNode The first step to starting up your Hadoop installation is Formatting the Hadoop file system We need to do this the first time you set up a Hadoop. Do not format a running Hadoop filesystem as you will lose all the data currently in HDFS To format the filesystem, run the command hadoop namenode -format Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 26 / 43
  • 27.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Starting single-node cluster Run the command: start-all.sh This will startup a NameNode,SecondaryNameNode, DataNode, ResourceManager and a NodeManager on your machine. A nifty tool for checking whether the expected Hadoop processes are running is jps hadoop1@hadoop1:/usr/local/hadoop$ jps 2598 NameNode 3112 ResourceManager 3523 Jps 2917 SecondaryNameNode 2727 DataNode 3242 NodeManager Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 27 / 43
  • 28.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Stopping your single-node cluster Run the command stop-all.sh To stop all the daemons running on your machine output will be like this. stopping NodeManager localhost: stopping ResourceManager stopping NameNode localhost: stopping DataNode localhost: stopping SecondaryNameNode Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 28 / 43
  • 29.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Map-Reduce Framework Map Reduce programming paradigm It relies basically on two functions, Map and Reduce Map Reduce used to manage many large-scale computations The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. The framework to effectively schedule tasks on the nodes where data is already present Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 29 / 43
  • 30.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Map-Reduce Computation Steps The key-value pairs from each Map task are collected by a master controller and sorted by key. The keys are divided among all the Reduce tasks, so all key-value pairs with the same key wind up at the same Reduce task. The Reduce tasks work on one key at a time, and combine all the values associated with that key in some way. The manner of combination of values is determined by the code written by the user for the Reduce function. Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 30 / 43
  • 31.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Hadoop - MapReduce Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 31 / 43
  • 32.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Hadoop - MapReduce (Word Count) Example Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 32 / 43
  • 33.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution MapReduce - WordCountMapper In WordCountMapper class we perform the following operations Read a line from file Split line into Words Assign Count 1 to each word Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 33 / 43
  • 34.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution WordCountMapper source code public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 34 / 43
  • 35.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution MapReduce - WordCountReducer In WordCountReducer class we perform the following operations Sum the list of values Assign sum to corresponding word Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 35 / 43
  • 36.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution WordCountReducer source code public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 36 / 43
  • 37.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution WordCountJob public class WordCountJob { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "word count"); job.setJarByClass(WordCountJob.class); job.setMapperClass(WordCountMapper.class); job.setCombinerClass(WordCountReducer.class); job.setReducerClass(WordCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 37 / 43
  • 38.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Header Files to include import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 38 / 43
  • 39.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Execution of Hadoop Program in Eclipse Step1: 1 Starting Hadoop in terminal using command: $ Start-all.sh 2 Use JPS command to check all services of Hadoop are started or not. Step 2: open Eclipse Step 3: Go to file ⇒ New ⇒ Project Select Java Project and click on Next button Write project name and click on Finish button Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 39 / 43
  • 40.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Continue... Step 4: Right side it creates a project 1 Right click on Project ⇒ New ⇒ Class 2 Write Name of Class and then Click Finish 3 Write MapReduce program in that class Step 5: Write JAVA Program Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 40 / 43
  • 41.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Continue... Step 6: Importing JAR files 1 Right click on Project and select properties (Alt+Enter) 2 Select Java Build Path ⇒ Click on Libraries, then click on add external JARS 3 Select the following jars from Hadoop library. /usr/local/Hadoop/share/Hadoop/common/libs /usr/local/Hadoop/share/Hadoop/hdfs/libs /usr/local/Hadoop/share/Hadoop/httpfs/libs /usr/local/Hadoop/share/Hadoop/mapreduce/libs /usr/local/Hadoop/share/Hadoop/yarn/libs /usr/local/Hadoop/share/Hadoop/tools/ Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 41 / 43
  • 42.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution Continue .... Step 7: Set input file path 1 Create folder in home dir 2 copy text files in to that 3 Select path of Input Step 8: Set input and output path 1 right click on source ⇒ Run As ⇒ Run Configuration ⇒ Argument 2 Enter your input and out put path with a single space 3 click on Run Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 42 / 43
  • 43.
    Outline Introduction HadoopInstallation Hadoop Configuration Starting & Stopping Map Reduce Execution thank You Bapatla Engineering College, Bapatla, Guntur Big Data & Hadoop November 8, 2016 Slide: 43 / 43