Big data processing using Cloudera Quickstart

thanachart@imcinstitute.com1 Big Data Processing Using Cloudera Quickstart with a Docker Container June 2016 Dr.Thanachart Numnonda IMC Institute thanachart@imcinstitute.com Modifiy from Original Version by Danairat T. Certified Java Programmer, TOGAF – Silver danairat@gmail.com

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Outline ● Launch AWS EC2 Instance ● Install Docker on Ubuntu ● Pull Cloudera QuickStart to the docker ● HDFS ● Hive ● Pig ● Impala ● Spark ● Spark SQL ● Spark Streaming Hive.apache.org

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Cloudera VM This lab will use a EC2 virtual server on AWS to install Cloudera, However, you can also use Cloudera QuickStart VM which can be downloaded from: http://www.cloudera.com/content/www/en-us/downloads.html

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Hands-On: Launch a virtual server on EC2 Amazon Web Services (Note: You can skip this session if you use your own computer or another cloud service)

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Virtual Server This lab will use a EC2 virtual server to install a Cloudera Cluster using the following features: Ubuntu Server 14.04 LTS Four m3.xLarge 4vCPU, 15 GB memory,80 GB SSD Security group: default Keypair: imchadoop

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Select a EC2 service and click on Lunch Instance

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Select an Amazon Machine Image (AMI) and Ubuntu Server 14.04 LTS (PV)

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Choose m3.xlarge Type virtual server

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Add Storage: 80 GB

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Name the instance

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Select Create an existing security group > Default

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Click Launch and choose imchadoop as a key pair

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Review an instance and rename one instance as a master / click Connect for an instruction to connect to the instance

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Connect to an instance from Mac/Linux

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Can also view details of the instance such as Public IP and Private IP

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Connect to an instance from Windows using Putty

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Connect to the instance

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Hands-On: Installing Cloudera Quickstart on Docker Container

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Installation Steps ● Update OS ● Install Docker ● Pull Cloudera Quickstart ● Run Cloudera Quickstart ● Run Cloudera Manager Hive.apache.org

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Update OS (Ubuntu) ● Command: sudo apt-get update

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Docker Installation ● Command: sudo apt-get install docker.io

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Pull Cloudera Quickstart ● Command: sudo docker pull cloudera/quickstart:latest

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Show docker images ● Command: sudo docker images

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Run Cloudera quickstart ● Command: sudo docker run --hostname=quickstart.cloudera --privileged=true -t -i [OPTIONS] [IMAGE] /usr/bin/docker-quickstart Example: sudo docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 8888:8888 cloudera/quickstart /usr/bin/docker-quickstart

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Finding the EC2 instance's DNS

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Login to Hue http://ec2-54-173-154-79.compute-1.amazonaws.com:8888

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Hands-On: Importing/Exporting Data to HDFS

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 HDFS ● Default storage for the Hadoop cluster ● Data is distributed and replicated over multiple machines ● Designed to handle very large files with straming data access patterns. ● NameNode/DataNode ● Master/slave architecture (1 master 'n' slaves) ● Designed for large files (64 MB default, but configurable) across all the nodes Hive.apache.org

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 HDFS Architecture Source Hadoop: Shashwat Shriparv

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Data Replication in HDFS Source Hadoop: Shashwat Shriparv

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 How does HDFS work? Source Introduction to Apache Hadoop-Pig: PrashantKommireddi

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Review file in Hadoop HDFS using File Browse

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Create a new directory name as: input & output

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Upload a local file to HDFS

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Hands-On: Connect to a master node via SSH

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Docker commands: ● docker images ● docker ps ● docker attach id ● docker kill id ● Exit from container ● exit (exit & kill the running image) ● Ctrl-P, Ctrl-Q (exit without killing the running image) Hive.apache.org

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 SSH Login to a master node

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Hadoop syntax for HDFS

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Install wget ● Command: yum install wget

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Download an example text file Make your own durectory at a master node to avoid mixing with others $mkdir guest1 $cd guest1 $wget https://s3.amazonaws.com/imcbucket/input/pg2600.txt

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Upload Data to Hadoop $hadoop fs -ls /user/cloudera/input $hadoop fs -rm /user/cloudera/input/* $hadoop fs -put pg2600.txt /user/cloudera/input/ $hadoop fs -ls /user/cloudera/input Note: you login as ubuntu, so you need to a sudo command to Switch user to hdfs

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Lecture: Understanding Map Reduce Processing Client Name Node Job Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Map Reduce

thanachart@imcinstitute.com53 Before MapReduce… ● Large scale data processing was difficult! – Managing hundreds or thousands of processors – Managing parallelization and distribution – I/O Scheduling – Status and monitoring – Fault/crash tolerance ● MapReduce provides all of these, easily! Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 MapReduce Framework Source: www.bigdatauniversity.com

thanachart@imcinstitute.com55 How Map and Reduce Work Together ● Map returns information ● Reduces accepts information ● Reduce applies a user defined function to reduce the amount of data

thanachart@imcinstitute.com56 Map Abstraction ● Inputs a key/value pair – Key is a reference to the input value – Value is the data set on which to operate ● Evaluation – Function defined by user – Applies to every value in value input ● Might need to parse input ● Produces a new list of key/value pairs – Can be different type from input pair

thanachart@imcinstitute.com57 Reduce Abstraction ● Starts with intermediate Key / Value pairs ● Ends with finalized Key / Value pairs ● Starting pairs are sorted by key ● Iterator supplies the values for a given key to the Reduce function.

thanachart@imcinstitute.com58 Reduce Abstraction ● Typically a function that: – Starts with a large number of key/value pairs ● One key/value for each word in all files being greped (including multiple entries for the same word) – Ends with very few key/value pairs ● One key/value for each unique word across all the files with the number of instances summed into this entry ● Broken up so a given worker works with input of the same key.

thanachart@imcinstitute.com59 Why is this approach better? ● Creates an abstraction for dealing with complex overhead – The computations are simple, the overhead is messy ● Removing the overhead makes programs much smaller and thus easier to use – Less testing is required as well. The MapReduce libraries can be assumed to work properly, so only user code needs to be tested ● Division of labor also handled by the MapReduce libraries, so programmers only need to focus on the actual computation

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Hands-On: Writing you own Map Reduce Program

thanachart@imcinstitute.com61 Example MapReduce: WordCount

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Running Map Reduce Program $cd /guest1 $wget https://dl.dropboxusercontent.com/u/12655380/wordcount.jar $hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/input/* /user/cloudera/output/wordcount

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Reviewing MapReduce Job in Hue

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Reviewing MapReduce Output Result

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Lecture Understanding Hive

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Introduction A Petabyte Scale Data Warehouse Using Hadoop Hive is developed by Facebook, designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL

Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop What Hive is NOT Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs, etc.).

Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Sample HiveQL The Query compiler uses the information stored in the metastore to convert SQL queries into a sequence of map/reduce jobs, e.g. the following query SELECT * FROM t where t.c = 'xyz' SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1) SELECT t1.c1, count(1) from t1 group by t1.c1 Hive.apache.or g

Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop System Architecture and Components Metastore: To store the meta data. Query compiler and execution engine: To convert SQL queries to a sequence of map/reduce jobs that are then executed on Hadoop. SerDe and ObjectInspectors: Programmable interfaces and implementations of common data formats and types. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. UDF and UDAF: Programmable interfaces and implementations for user defined functions (scalar and aggregate functions). Clients: Command line client similar to Mysql command line. hive.apache.or g

Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Architecture Overview HDFS Hive CLI Querie s Browsin g Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Hive QL Parser Planner Mgmt. WebUI HDFS DDL Hive Hive.apache.org

Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hive Metastore Hive Metastore is a repository to keep all Hive metadata; Tables and Partitions definition. By default, Hive will store its metadata in Derby DB

Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hive Built in Functions Return Type Function Name (Signature) Description BIGINT round(double a) returns the rounded BIGINT value of the double BIGINT floor(double a) returns the maximum BIGINT value that is equal or less than the double BIGINT ceil(double a) returns the minimum BIGINT value that is equal or greater than the double double rand(), rand(int seed) returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic. string concat(string A, string B,...) returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them. string substr(string A, int start) returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar' string substr(string A, int start, int length) returns the substring of A starting from start position with the given length e.g. substr('foobar', 4, 2) results in 'ba' string upper(string A) returns the string resulting from converting all characters of A to upper case e.g. upper('fOoBaR') results in 'FOOBAR' string ucase(string A) Same as upper string lower(string A) returns the string resulting from converting all characters of B to lower case e.g. lower('fOoBaR') results in 'foobar' string lcase(string A) Same as lower string trim(string A) returns the string resulting from trimming spaces from both ends of A e.g. trim(' foobar ') results in 'foobar' string ltrim(string A) returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar ' string rtrim(string A) returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar' string regexp_replace(string A, string B, string C) returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb' string from_unixtime(int unixtime) convert the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00" string to_date(string timestamp) Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" int year(string date) Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970 int month(string date) Return the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11 int day(string date) Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1 string get_json_object(string json_string, string path) Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid hive.apache.org

Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hive Aggregate Functions Return Type Aggregation Function Name (Signature) Description BIGINT count(*), count(expr), count(DISTINCT expr[, expr_.]) count(*) - Returns the total number of retrieved rows, including rows containing NULL values; count(expr) - Returns the number of rows for which the supplied expression is non- NULL; count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. DOUBLE sum(col), sum(DISTINCT col) returns the sum of the elements in the group or the sum of the distinct values of the column in the group DOUBLE avg(col), avg(DISTINCT col) returns the average of the elements in the group or the average of the distinct values of the column in the group DOUBLE min(col) returns the minimum value of the column in the group DOUBLE max(col) returns the maximum value of the column in the group hive.apache.org

Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Running Hive Hive Shell Interactive hive Script hive -f myscript Inline hive -e 'SELECT * FROM mytable' Hive.apache.or g

Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hive Commands ortonworks.com

: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hive Tables ● Managed- CREATE TABLE ● LOAD- File moved into Hive's data warehouse directory ● DROP- Both data and metadata are deleted. ● External- CREATE EXTERNAL TABLE ● LOAD- No file moved ● DROP- Only metadata deleted ● Use when sharing data between Hive and Hadoop applications or you want to use multiple schema on the same data

Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hive External Table Dropping External Table using Hive:- Hive will delete metadata from metastore Hive will NOT delete the HDFS file You need to manually delete the HDFS file

Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Java JDBC for Hive import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveJdbcClient { private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver"; public static void main(String[] args) throws SQLException { try { Class.forName(driverName); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(1); } Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", ""); Statement stmt = con.createStatement(); String tableName = "testHiveDriverTable"; stmt.executeQuery("drop table " + tableName); ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value string)"); // show tables String sql = "show tables '" + tableName + "'"; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); if (res.next()) { System.out.println(res.getString(1)); } // describe table sql = "describe " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1) + "t" + res.getString(2)); }

Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop HiveQL and MySQL Comparison ortonworks.com

Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop HiveQL and MySQL Query Comparison ortonworks.com

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Hands-On: Loading Data using Hive

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 hive> quit; Quit from Hive Start Hive

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html Create Hive Table

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Reviewing Hive Table in HDFS

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Alter and Drop Hive Table Hive > alter table test_tbl add columns (remarks STRING); hive > describe test_tbl; OK id int country string remarks string Time taken: 0.077 seconds hive > drop table test_tbl; OK Time taken: 0.9 seconds See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Preparing Large Dataset http://grouplens.org/datasets/movielens/

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 MovieLen Dataset 1)Type command > wget http://files.grouplens.org/datasets/movielens/ml-100k.zip 2)Type command > yum install unzip 3)Type command > unzip ml-100k.zip 4)Type command > more ml-100k/u.user

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Moving dataset to HDFS 1)Type command > cd ml-100k 2)Type command > hadoop fs -mkdir /user/cloudera/movielens 3)Type command > hadoop fs -put u.user /user/cloudera/movielens 4)Type command > hadoop fs -ls /user/cloudera/movielens

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 CREATE & SELECT Table

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Bay Area Bike Share (BABS) http://www.bayareabikeshare.com/open-data

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Preparing a bike data $wget https://s3.amazonaws.com/babs-open-data/ babs_open_data_year_1.zip $unzip babs_open_data_year_1.zip $cd 201402_babs_open_data/ $hadoop fs -put 201402_trip_data.csv /user/cloudera $ hadoop fs -ls /user/cloudera

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Importing CSV Data with the Metastore App The BABS data set contains 4 CSVs that contain data for stations, trips, rebalancing (availability), and weather. We will import trips dataset using Metastore Tables

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Select: Create a new table from a file

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Name a table and select a file

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Choose Delimiter

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Define Column Types

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Create Table : Done

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Starting Hive Editor

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Find the top 10 most popular start stations based on the trip data SELECT startterminal, startstation, COUNT(1) AS count FROM trip GROUP BY startterminal, startstation ORDER BY count DESC LIMIT 10

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Find the total number of trips and average duration (in minutes) of those trips, grouped by hour SELECT hour, COUNT(1) AS trips, ROUND(AVG(duration) / 60) AS avg_duration FROM ( SELECT CAST(SPLIT(SPLIT(t.startdate, ' ')[1], ':')[0] AS INT) AS hour, t.duration AS duration FROM `bikeshare`.`trips` t WHERE t.startterminal = 70 AND t.duration IS NOT NULL ) r GROUP BY hour ORDER BY hour ASC;

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Lecture Understanding Pig

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Introduction A high-level platform for creating MapReduce programs Using Hadoop Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Pig Components ● Two Compnents ● Language (Pig Latin) ● Compiler ● Two Execution Environments ● Local pig -x local ● Distributed pig -x mapreduce Hive.apache.org

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Running Pig ● Script pig myscript ● Command line (Grunt) pig ● Embedded Writing a java program Hive.apache.org

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Pig Latin Hive.apache.org

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Pig Execution Stages Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Why Pig? ● Makes writing Hadoop jobs easier ● 5% of the code, 5% of the time ● You don't need to be a programmer to write Pig scripts ● Provide major functionality required for DatawareHouse and Analytics ● Load, Filter, Join, Group By, Order, Transform ● User can write custom UDFs (User Defined Function) Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Pig v.s. Hive Hive.apache.org

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Hands-On: Running a Pig script

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Starting Pig Command Line $ pig -x mapreduce 2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53 2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hdadmin/pig_1375327740024.log 2013-08-01 10:29:00,066 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hdadmin/.pigbootup not found 2013-08-01 10:29:00,212 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// grunt>

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Writing a Pig Script for wordcount A = load '/user/cloudera/input/*'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D into '.user/cloudera/output/wordcountPig';

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Lecture Understanding Impala

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Introduction open source massively parallel processing (MPP) SQL query engine Cloudera Impala is a query engine that runs on Apache Hadoop. Impala brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation.

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 What is Impala? General--- purpose SQL engine Real--time queries in Apache Hadoop Opensource under Apache License Runs directly within Hadoop High performance – C++ instead of Java – Runtime code generator – Roughly 4-100 x Hive

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Impala Overview Impala daemon run on HDFS nodes Statestore (for cluster metadata) v.s. Metastore (for database metastore) Queries run on “revelants” nodes Support common HDFS file formats Submit quries via Hue/Beeswax No fault tolerant

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Impala Architecture

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Start Impala Query Editor

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Update the list of tables/metadata by excute the command invalidate metadata

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Restart Impala Query Editor and refresh the table list

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Find the top 10 most popular start stations based on the trip data: Using Impala SELECT startterminal, startstation, COUNT(1) AS count FROM trip GROUP BY startterminal, startstation ORDER BY count DESC LIMIT 10

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Lecture Understanding Spark

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Introduction A fast and general engine for large scale data processing An open source big data processing framework built around speed, ease of use, and sophisticated analytics. Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk.

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 What is Spark? Framework for distributed processing. In-memory, fault tolerant data structures Flexible APIs in Scala, Java, Python, SQL, R Open source

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Why Spark? Handle Petabytes of data Significant faster than MapReduce Simple and intutive APIs General framework – Runs anywhere – Handles (most) any I/O – Interoperable libraries for specific use-cases

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Source: Jump start into Apache Spark and Databricks

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark: History Founded by AMPlab, UC Berkeley Created by Matei Zaharia (PhD Thesis) Maintained by Apache Software Foundation Commercial support by Databricks

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark Platform

thanachart@imcinstitute.com141 Spark Platform Source: MapR Academy

thanachart@imcinstitute.com142 Source: MapR Academy

thanachart@imcinstitute.com143 Source: TRAINING Intro to Apache Spark - Brian Clapper

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 What is a RDD? Resilient: if the data in memory (or on a node) is lost, it can be recreated. Distributed: data is chucked into partitions and stored in memory acress the custer. Dataset: initial data can come from a table or be created programmatically

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 RDD: Fault tollerant Immutable Three methods for creating RDD: – Parallelizing an existing correction – Referencing a dataset – Transformation from an existing RDD Types of files supported: – Text files – SequenceFiles – Hadoop InputFormat

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 RDD Creation hdfsData = sc.textFile("hdfs://data.txt”) Source: Pspark: A brain-friendly introduction

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 RDD: Operations Transformations: transformations are lazy (not computed immediately) Actions: the transformed RDD gets recomputed when an action is run on it (default)

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Direct Acyclic Graph (DAG)

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 What happens when an action is executed Source: Spark Fundamentals I Big Data Usibersity

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 What happens when an action is executed

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark:Transformation

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Single RDD Transformation Source: Jspark: A brain-friendly introduction

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Multiple RDD Transformation Source: Jspark: A brain-friendly introduction

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Pair RDD Transformation Source: Jspark: A brain-friendly introduction

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark:Actions

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark: Persistence

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Accumulators Similar to a MapReduce “Counter” A global variable to track metrics about your Spark program for debugging. Reasoning: Excutors do not communicate with each other. Sent back to driver

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Broadcast Variables Similar to a MapReduce “Distributed Cache” Sends read-only values to worker nodes. Great for lookup tables, dictionaries, etc.

thanachart@imcinstitute.com170 A distributed collection of rows organied into named columns. An abstraction for selecting, filtering, aggregating, and plotting structured data. Previously => SchemaRDD DataFrame

thanachart@imcinstitute.com171 Creating and running Spark program faster – Write less code – Read less data – Let the optimizer do the hard work SparkSQL

thanachart@imcinstitute.com172 Source: Jump start into Apache Spark and Databricks

thanachart@imcinstitute.com173 Stream Process Architecture Source: MapR Academy

thanachart@imcinstitute.com174 Spark Streaming Architecture Source: MapR Academy

thanachart@imcinstitute.com175 Processing Spark DStreams Source: MapR Academy

thanachart@imcinstitute.com176 Use Case: Time Series Data Source: MapR Academy

thanachart@imcinstitute.com177 Use Case Source: http://www.insightdataengineering.com/

thanachart@imcinstitute.com178 What is MLlib? Source: MapR Academy

thanachart@imcinstitute.com179 Mllib algorithms and utilities Source: MapR Academy

thanachart@imcinstitute.com180 Hadoop + Spark Source: MapR Academy

thanachart@imcinstitute.com181 Recommended Books

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Hands-On: Spark Programming

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Functional tools in Python map filter reduce lambda IterTools • Chain, flatmap

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Map >>> a= [1,2,3] >>> def add1(x) : return x+1 >>> map(add1, a) Result: [2,3,4]

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Filter >>> a= [1,2,3,4] >>> def isOdd(x) : return x%2==1 >>> filter(isOdd, a) Result: [1,3]

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Reduce >>> a= [1,2,3,4] >>> def add(x,y) : return x+y >>> reduce(add, a) Result: 10

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 lambda >>> (lambda x: x + 1)(3) Result: 4 >>> map((lambda x: x + 1), [1,2,3]) Result: [2,3,4]

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Exercises

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 More exercises

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Start Spark-shell $spark-shell

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Testing SparkContext Spark-context scala> sc

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark Program in Scala: WordCount scala> val file = sc.textFile("hdfs:///user/cloudera/input/pg2600.txt") scala> val wc = file.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) scala> wc.saveAsTextFile("hdfs:///user/cloudera/output/wordcountScala ")

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 WordCount output

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark Program in Python: WordCount $ pyspark >>> from operator import add >>> file = sc.textFile("hdfs:///user/cloudera/input/pg2600.txt") >>> wc = file.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add) >>> wc.saveAsTextFile("hdfs:///user/cloudera/output/ wordcountPython")

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark Program in Python: Random >>> import random >>> flips = 1000000 >>> #lazy eval >>> coins = xrange(flips) >>> heads = sc.parallelize(coins).map(lambda i random.random()) filter(lambda r : r < 0.51).count()

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Transformations >>> nums = sc.parallelize([1,2,3]) >>> squared = nums.map(lambda x : x*x) >>> even = squared.filter(lambda x: x%2 == 0) >>> evens = nums.flatMap(lambda x: range(x))

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Actions >>> nums = sc.parallelize([1,2,3]) >>> nums.collect() >>> nums.take(2) >>> nums.count() >>> nums.reduce(lambda:x, y:x+y) >>> nums.saveAsTextFile("hdfs:///user/cloudera/output/test”)

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Key-Value Operations >>> pet = sc.parallelize([("cat",1),("dog",1),("cat",2)]) >>> pet2 = pet.reduceByKey(lambda x, y:x+y) >>> pet3 = pet.groupByKey() >>> pet4 = pet.sortByKey()

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark Program : Toy_data.txt $ wget https://s3.amazonaws.com/imcbucket/data/toy_data.txt $ hadoop fs -put toy_data.txt /user/cloudera/input Upload a data to HDFS Start pyspark $ pyspark

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark Program : Find Big Spenders >>> file_rdd = sc.textFile("hdfs:///user/cloudera/input/toy_data.txt") >>> import json >>> json_rdd = file_rdd.map(lambda x: json.loads(x)) >>> big_spenders = json_rdd.map(lambda x: tuple((x.keys() [0],int(x.values()[0])))) >>> big_spenders.reduceByKey(lambda x,y: x + y).filter(lambda x: x[1] > 5).collect()

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Project: Flight

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Flight Details Data http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Flight Details Data http://stat-computing.org/dataexpo/2009/the-data.html

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Data Description

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Snapshot of Dataset

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 FiveThirtyEight http://projects.fivethirtyeight.com/flights/

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark Program : Upload Flight Delay Data $ wget https://s3.amazonaws.com/imcbucket/data/flights/2008.csv $ hadoop fs -put 2008.csv /user/cloudera/input Upload a data to HDFS

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark Program : Navigating Flight Delay Data >>> airline = sc.textFile("hdfs:///user/cloudera/input/2008.csv") >>> airline.take(2)

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark Program : Preparing Data >>> header_line = airline.first() >>> header_list = header_line.split(',') >>> airline_no_header = airline.filter(lambda row: row != header_line) >>> airline_no_header.first() >>> def make_row(row): ... row_list = row.split(',') ... d = dict(zip(header_list,row_list)) ... return d ... >>> airline_rows = airline_no_header.map(make_row) >>> airline_rows.take(5)

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark Program : Define convert_float function >>> def convert_float(value): ... try: ... x = float(value) ... return x ... except ValueError: ... return 0 ... >>>

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark Program : Finding best/worst airports >>> destination_rdd = airline_rows.map(lambda row: (row['Dest'],convert_float(row['ArrDelay']))) >>> origin_rdd = airline_rows.map(lambda row: (row['Origin'],convert_float(row['DepDelay']))) >>> destination_rdd.take(2) >>> origin_rdd.take(2)

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark Program : Finding best/worst airports >>> import numpy as np >>> mean_delays_dest = destination_rdd.groupByKey().mapValues(lambda delays: np.mean(delays.data)) >>> mean_delays_dest.sortBy(lambda t:t[1], ascending=True).take(10) >>> mean_delays_dest.sortBy(lambda t:t[1], ascending=False).take(10)

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Hands-On: Spark SQL

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 SparkSQL

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark Program : Upload Data $ wget https://s3.amazonaws.com/imcbucket/data/events.txt $ wget https://s3.amazonaws.com/imcbucket/data/meals.txt $ hadoop fs -put events.txt /user/cloudera/input $ hadoop fs -put meals.txt /user/cloudera/input Upload a data to HDFS

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark SQL : Preparing data >>> meals_rdd = sc.textFile("hdfs:///user/cloudera/input/meals.txt") >>> events_rdd = sc.textFile("hdfs:///user/cloudera/input/events.txt") >>> header_meals = meals_rdd.first() >>> header_events = events_rdd.first() >>> meals_no_header = meals_rdd.filter(lambda row:row != header_meals) >>> events_no_header =events_rdd.filter(lambda row:row != header_events) >>> meals_json = meals_no_header.map(lambda row:row.split(';')).map(lambda row_list: dict(zip(header_meals.split(';'), row_list))) >>> events_json = events_no_header.map(lambda row:row.split(';')).map(lambda row_list: dict(zip(header_events.split(';'), row_list)))

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark SQL : Preparing data >>> import json >>> def type_conversion(d, columns) : ... for c in columns: ... d[c] = int(d[c]) ... return d ... >>> meal_typed = meals_json.map(lambda j:json.dumps(type_conversion(j, ['meal_id','price']))) >>> event_typed = events_json.map(lambda j:json.dumps(type_conversion(j, ['meal_id','userid']))

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark SQL : Create DataFrame >>> meals_dataframe = sqlContext.jsonRDD(meal_typed) >>> events_dataframe = sqlContext.jsonRDD(event_typed) >>> meals_dataframe.head() >>> meals_dataframe.printSchema()

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark SQL : Running SQL Query >>> meals_dataframe.registerTempTable('meals') >>> events_dataframe.registerTempTable('events') >>> sqlContext.sql("SELECT * FROM meals LIMIT 5").collect() >>> meals_dataframe.take(5)

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Spark SQL : More complex query >>> sqlContext.sql(""" ... SELECT type, COUNT(type) AS cnt FROM ... meals ... INNER JOIN ... events on meals.meal_id = events.meal_id ... WHERE ... event = 'bought' ... GROUP BY ... type ... ORDER BY cnt DESC ... """).collect()

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Hands-On: WordCount using Spark Streaming

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Start Spark-shell with extra memory

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 WordCount using Spark Streaming $ scala> :paste import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.storage.StorageLevel import StorageLevel._ import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ val ssc = new StreamingContext(sc, Seconds(2)) val lines = ssc.socketTextStream("localhost",8585,MEMORY_ONLY) val wordsFlatMap = lines.flatMap(_.split(" ")) val wordsMap = wordsFlatMap.map( w => (w,1)) val wordCount = wordsMap.reduceByKey( (a,b) => (a+b)) wordCount.print ssc.start

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Running the netcat server on another window

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Hands-On: Streaming Twitter data

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Create a new Twitter App Login to your Twitter @ twitter.com

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Create a new Twitter App (cont.) Create a new Twitter App @ apps.twitter.com

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Create a new Twitter App (cont.) Enter all the details in the application:

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Create a new Twitter App (cont.) Your application will be created:

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Create a new Twitter App (cont.) Click on Keys and Access Tokens:

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Create a new Twitter App (cont.) Your Access token got created:

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Download the third-party libraries $ wget http://central.maven.org/maven2/org/apache/spark/spark- streaming-twitter_2.10/1.2.0/spark-streaming-twitter_2.10- 1.2.0.jar $ wget http://central.maven.org/maven2/org/twitter4j/twitter4j- stream/4.0.2/twitter4j-stream-4.0.2.jar $ wget http://central.maven.org/maven2/org/twitter4j/twitter4j- core/4.0.2/twitter4j-core-4.0.2.jar Run Spark-shell $ spark-shell --jars spark-streaming-twitter_2.10-1.2.0.jar, twitter4j-stream-4.0.2.jar,twitter4j-core-4.0.2.jar

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Running Spark commands $ scala> :paste // Entering paste mode (ctrl-D to finish) import org.apache.spark.streaming.twitter._ import twitter4j.auth._ import twitter4j.conf._ import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ val ssc = new StreamingContext(sc, Seconds(10)) val cb = new ConfigurationBuilder

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Running Spark commands cb.setDebugEnabled(true).setOAuthConsumerKey("MjpswndxVj27ylnp OoSBrnfLX").setOAuthConsumerSecret("QYmuBO1smD5Yc3zE0ZF9ByCgeE QxnxUmhRVCisAvPFudYVjC4a").setOAuthAccessToken("921172807- EfMXJj6as2dFECDH1vDe5goyTHcxPrF1RIJozqgx").setOAuthAccessToken Secret("HbpZEVip3D5j80GP21a37HxA4y10dH9BHcgEFXUNcA9xy") val auth = new OAuthAuthorization(cb.build) val tweets = TwitterUtils.createStream(ssc,Some(auth)) val status = tweets.map(status => status.getText) status.print ssc.checkpoint("hdfs:///user/cloudera/data/tweets") ssc.start ssc.awaitTermination

Thanachart Numnonda, thanachart@imcinstitute.com June 2016Hadoop Workshop using Cloudera on Amazon EC2 Thank you www.imcinstitute.com www.facebook.com/imcinstitute

Big data processing using Cloudera Quickstart

More Related Content

Viewers also liked

Similar to Big data processing using Cloudera Quickstart

More from IMC Institute

Recently uploaded

In this document

Big data processing using Cloudera Quickstart