BigData - Introduction Walter Dal Mut – walter.dalmut@gmail.com @walterdalmut - @corleycloud - @upcloo
Whoami • Walter Dal Mut • Corley S.r.l. • Startupper • Load and Stress tests • Corley S.r.l. • http://www.corley.it/ • www.corley.it • UpCloo Ltd. • Scalable CMS • http://www.upcloo.com/ • Electronic Engineer • www.xitecms.it • Polytechnic of Turin • Consultancy • Social • PHP • @walterdalmut - @corleycloud - @upcloo • AWS • Websites • Distributed Systems • walterdalmut.com – corley.it – upcloo.com • Scalable Systems Walter Dal Mut walter.dalmut@gmail.com @walterdalmut @corleycloud @upcloo
Wait a moment… What is it? • Big data is a collection of data sets • So large and complex • Difficult to process: • using on-hand database management tools • traditional data processing • Big data challenges • Capture • Curation • Storage • Search • Sharing • Transfer • Analysis • Visualization
Common scenario • Data arriving at fast rates • Data are typically unstructured or partially structured • Log files • JSON/BSON structures • Data are stored without any kind of aggregation
What we are looking for? • Scalable Storage • In 2010 Facebook claimed that they had 21 PB of storage • 30PB in 2011 • 100PB in 2012 • 500PB in 2013 • Massive Parallel Processing • In 2012 we can elaborate exabytes (10^18 bytes) of data in a reasonable amount of time • The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is used in every Yahoo! Web search query. • Resonable Cost • The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth)
What is Apache Hadoop? • Apache Hadoop is an open-source software framework that supports data- intensive distributed applications. • It was derived from Google’s MapReduce and Google File System (GFS) papers • Hadoop implements a computational paradigm named: MapReduce • Hadoop architecture • Hadoop Kernel • MapReduce • Hadoop Distributed File System (HDFS) • Other project created in top of Hadoop • Hive, Pig • Hbase • ZooKeeper • Etc.
What is MapReduce? SLICE 0 SLICE 1 R0 SLICE 2 SLICE 3 Input SLICE 4 R1 Output SLICE 5 SLICE 6 R2 SLICE 7 MAP REDUCE
Word Count with Map Reduce
The Hadoop Ecosystem Data Mining Data Access Client Access Mahout Sqoop Hive, Pig Network Data Storage Data Processing Coordination HDFS, HBase MapReduce ZooKeeper JVM (Java Virtual Machine) Operating System (RedHat, Ubuntu, etc) Hardware
Getting Started with Apache Hadoop • Hadoop is a framework • We need to write down our software • A Map • A Reduce • Put all togheter • WordCount MapReduce with Hadoop • http://wiki.apache.org/hadoop/WordCount
(MAP) WordCount MapReduce with Hadoop public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
(RED) WordCount MapReduce with Hadoop public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
WordCount MapReduce with Hadoop public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
Scripting Languages on top of Hadoop
Apache HIVE – Data Warehouse • Hive is a data warehouse system for Hadoop • Facilitates easy data summarization • Ad-hoc queries • Analysis of large datasets stored in Hadoop compatible file systems • SQL-like language called HiveQL • Custom plug-in MapReduce if needed
Apache Hive Features • SELECTs and FILTERs • SELECT * FROM table_name; • SELECT * FROM table_name tn WHERE tn.weekday = 12; • GROUP BY • SELECT * FROM table_name GROUP BY weekday; • JOINs • SELECT a.* FROM a JOIN b ON (a.id = b.id) • More (multitable inserts, streaming) • Example: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingS tarted-SimpleExampleUseCases
A real example with a CSV (tab- separated) CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH 'ml-data/u.data' OVERWRITE INTO TABLE u_data; SELECT COUNT(*) FROM u_data;
Apache PIG • It is a platform for analyzing large data sets that consists of a high- level language for expressing data analysis programs • Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs • Ease of programming • Optimization opportunities • Auto-optimization of tasks, allowing the user to focus on semantics rather than efficiency. • Extensibility
Apache Pig – Log Flow analysis • A common scenario is collect logs and analyse it as post processing • We will cover Apache2 (web server) log analysis • Read AWS Elastic Map Reduce  Apache Pig example • http://aws.amazon.com/articles/2729 • Apache Log: • 122.161.184.193 - - [21/Jul/2009:13:14:17 -0700] "GET /rss.pl HTTP/1.1" 200 35942 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.21022; InfoPath.2; .NET CLR 3.5.30729; .NET CLR 3.0.30618; OfficeLiveConnector.1.3; OfficeLivePatch.1.3; MSOffice 12)"
Log Analysis [Common Log Format (CLF)] • More information about CLF: • http://httpd.apache.org/docs/2.2/logs.html • Log interesting fields: • Remote Address (IP) • Remote Log Name • User • Time • Request • Status • Bytes • Refererer • Browser
Getting started with Apache Pig • First of all we need to load all logs from a HDFS compabile source (for example AWS S3) • Create a table starting from all log lines • Create the Map Reduce program in order to analyse and store results
Load logs from S3 and create a table RAW_LOGS = LOAD ‘s3://bucket/path’ USING TextLoader as (line:chararray); LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( EXTRACT(line, '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"') ) as ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray );
LOGS_BASE is a Table Remote logNm User Time Request Status Bytes Refererer Browser 72.14.194.1 - - 21/Jul/20 GET 200 2969 - FeedFetch 09:18:04: /gwidgets er- 54 -0700 /alexa.xml Google; HTTP/1.1 (+http://w ww.googl e.com/fee dfetcher.h tml)
Data Analysis REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer; FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*' OR referrer matches '.*google.*'; SEARCH_TERMS = FOREACH FILTERED GENERATE FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as terms:chararray; SEARCH_TERMS_FILTERED = FILTER SEARCH_TERMS BY NOT $0 IS NULL; SEARCH_TERMS_COUNT = FOREACH (GROUP SEARCH_TERMS_FILTERED BY $0) GENERATE $0, COUNT($1) as num; SEARCH_TERMS_COUNT_SORTED = ORDER SEARCH_TERMS_COUNT BY num DESC;
Thanks for Listening • Any questions?

Big data, just an introduction to Hadoop and Scripting Languages

  • 1.
    BigData - Introduction Walter Dal Mut – walter.dalmut@gmail.com @walterdalmut - @corleycloud - @upcloo
  • 2.
    Whoami • Walter DalMut • Corley S.r.l. • Startupper • Load and Stress tests • Corley S.r.l. • http://www.corley.it/ • www.corley.it • UpCloo Ltd. • Scalable CMS • http://www.upcloo.com/ • Electronic Engineer • www.xitecms.it • Polytechnic of Turin • Consultancy • Social • PHP • @walterdalmut - @corleycloud - @upcloo • AWS • Websites • Distributed Systems • walterdalmut.com – corley.it – upcloo.com • Scalable Systems Walter Dal Mut walter.dalmut@gmail.com @walterdalmut @corleycloud @upcloo
  • 3.
    Wait a moment…What is it? • Big data is a collection of data sets • So large and complex • Difficult to process: • using on-hand database management tools • traditional data processing • Big data challenges • Capture • Curation • Storage • Search • Sharing • Transfer • Analysis • Visualization
  • 4.
    Common scenario • Dataarriving at fast rates • Data are typically unstructured or partially structured • Log files • JSON/BSON structures • Data are stored without any kind of aggregation
  • 5.
    What we arelooking for? • Scalable Storage • In 2010 Facebook claimed that they had 21 PB of storage • 30PB in 2011 • 100PB in 2012 • 500PB in 2013 • Massive Parallel Processing • In 2012 we can elaborate exabytes (10^18 bytes) of data in a reasonable amount of time • The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is used in every Yahoo! Web search query. • Resonable Cost • The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth)
  • 6.
    What is ApacheHadoop? • Apache Hadoop is an open-source software framework that supports data- intensive distributed applications. • It was derived from Google’s MapReduce and Google File System (GFS) papers • Hadoop implements a computational paradigm named: MapReduce • Hadoop architecture • Hadoop Kernel • MapReduce • Hadoop Distributed File System (HDFS) • Other project created in top of Hadoop • Hive, Pig • Hbase • ZooKeeper • Etc.
  • 7.
    What is MapReduce? SLICE 0 SLICE 1 R0 SLICE 2 SLICE 3 Input SLICE 4 R1 Output SLICE 5 SLICE 6 R2 SLICE 7 MAP REDUCE
  • 8.
    Word Count withMap Reduce
  • 9.
    The Hadoop Ecosystem Data Mining Data Access Client Access Mahout Sqoop Hive, Pig Network Data Storage Data Processing Coordination HDFS, HBase MapReduce ZooKeeper JVM (Java Virtual Machine) Operating System (RedHat, Ubuntu, etc) Hardware
  • 10.
    Getting Started withApache Hadoop • Hadoop is a framework • We need to write down our software • A Map • A Reduce • Put all togheter • WordCount MapReduce with Hadoop • http://wiki.apache.org/hadoop/WordCount
  • 11.
    (MAP) WordCount MapReducewith Hadoop public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
  • 12.
    (RED) WordCount MapReducewith Hadoop public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
  • 13.
    WordCount MapReduce withHadoop public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
  • 14.
  • 15.
    Apache HIVE –Data Warehouse • Hive is a data warehouse system for Hadoop • Facilitates easy data summarization • Ad-hoc queries • Analysis of large datasets stored in Hadoop compatible file systems • SQL-like language called HiveQL • Custom plug-in MapReduce if needed
  • 16.
    Apache Hive Features •SELECTs and FILTERs • SELECT * FROM table_name; • SELECT * FROM table_name tn WHERE tn.weekday = 12; • GROUP BY • SELECT * FROM table_name GROUP BY weekday; • JOINs • SELECT a.* FROM a JOIN b ON (a.id = b.id) • More (multitable inserts, streaming) • Example: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingS tarted-SimpleExampleUseCases
  • 17.
    A real examplewith a CSV (tab- separated) CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH 'ml-data/u.data' OVERWRITE INTO TABLE u_data; SELECT COUNT(*) FROM u_data;
  • 18.
    Apache PIG • Itis a platform for analyzing large data sets that consists of a high- level language for expressing data analysis programs • Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs • Ease of programming • Optimization opportunities • Auto-optimization of tasks, allowing the user to focus on semantics rather than efficiency. • Extensibility
  • 19.
    Apache Pig –Log Flow analysis • A common scenario is collect logs and analyse it as post processing • We will cover Apache2 (web server) log analysis • Read AWS Elastic Map Reduce  Apache Pig example • http://aws.amazon.com/articles/2729 • Apache Log: • 122.161.184.193 - - [21/Jul/2009:13:14:17 -0700] "GET /rss.pl HTTP/1.1" 200 35942 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.21022; InfoPath.2; .NET CLR 3.5.30729; .NET CLR 3.0.30618; OfficeLiveConnector.1.3; OfficeLivePatch.1.3; MSOffice 12)"
  • 20.
    Log Analysis [CommonLog Format (CLF)] • More information about CLF: • http://httpd.apache.org/docs/2.2/logs.html • Log interesting fields: • Remote Address (IP) • Remote Log Name • User • Time • Request • Status • Bytes • Refererer • Browser
  • 21.
    Getting started withApache Pig • First of all we need to load all logs from a HDFS compabile source (for example AWS S3) • Create a table starting from all log lines • Create the Map Reduce program in order to analyse and store results
  • 22.
    Load logs fromS3 and create a table RAW_LOGS = LOAD ‘s3://bucket/path’ USING TextLoader as (line:chararray); LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( EXTRACT(line, '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"') ) as ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray );
  • 23.
    LOGS_BASE is aTable Remote logNm User Time Request Status Bytes Refererer Browser 72.14.194.1 - - 21/Jul/20 GET 200 2969 - FeedFetch 09:18:04: /gwidgets er- 54 -0700 /alexa.xml Google; HTTP/1.1 (+http://w ww.googl e.com/fee dfetcher.h tml)
  • 24.
    Data Analysis REFERRER_ONLY =FOREACH LOGS_BASE GENERATE referrer; FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*' OR referrer matches '.*google.*'; SEARCH_TERMS = FOREACH FILTERED GENERATE FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as terms:chararray; SEARCH_TERMS_FILTERED = FILTER SEARCH_TERMS BY NOT $0 IS NULL; SEARCH_TERMS_COUNT = FOREACH (GROUP SEARCH_TERMS_FILTERED BY $0) GENERATE $0, COUNT($1) as num; SEARCH_TERMS_COUNT_SORTED = ORDER SEARCH_TERMS_COUNT BY num DESC;
  • 25.

Editor's Notes

  • #16 In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis.