Welcome to Hands-on Session on Big Data processing using Apache Spark reachus@cloudxlab.com CloudxLab.com +1 419 665 3276 (US) +91 803 959 1464 (IN)
Agenda 1 Apache Spark Introduction 2 CloudxLab Introduction 3 Introduction to RDD (Resilient Distributed Datasets) 4 Loading data into an RDD 5 RDD Operations Transformation 6 RDD Operations Actions 7 Hands-on demos using CloudxLab 8 Questions and Answers
Hands-On: Objective Compute the word frequency of a text file stored in HDFS - Hadoop Distributed File System Using Apache Spark
Welcome to CloudxLab Session • Learn Through Practice • Real Environment • Connect From Anywhere • Connect From Any Device A cloud based lab for students to gain hands-on experience in Big Data Technologies such as Hadoop and Spark • Centralized Data sets • No Installation • No Compatibility Issues • 24x7 Support
About Instructor? 2015 CloudxLab A big data platform. 2014 KnowBigData Founded 2014 Amazon Built High Throughput Systems for Amazon.com site using in-house NoSql. 2012 2012 InMobi Built Recommender that churns 200 TB 2011 tBits Global Founded tBits Global Built an enterprise grade Document Management System 2006 D.E.Shaw Built the big data systems before the term was coined 2002 2002 IIT Roorkee Finished B.Tech.
Apache A fast and general engine for large-scale data processing. • Really fast MapReduce • 100x faster than Hadoop MapReduce in memory, • 10x faster on disk. • Builds on similar paradigms as MapReduce • Integrated with Hadoop
Spark Architecture Spark Core StandaloneAmazon EC2 Hadoop YARN Apache Mesos HDFS HBase Hive Tachyon ... SQL Streaming MLLib GraphXSparkR Java Python Scala Libraries Languages
Getting Started - Launching the console Open CloudxLab.com to get login/password Login into Console Or • Download • http://spark.apache.org/downloads.html • Install python • (optional) Install Hadoop Run pyspark
SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET A collection of elements partitioned across cluster • RDD Can be persisted in memory • RDD Auto recover from node failures • Can have any data type but has a special dataset type for key-value • Supports two type of operations: transformation and action • RDD is read only
Convert an existing array into RDD: myarray = sc.parallelize([1,3,5,6,19, 21]); SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET Load a file From HDFS: lines = sc.textFile('/data/mr/wordcount/input/big.txt') Check first 10 lines: lines.take(10); // Does the actual execution of loading and printing 10 lines.
SPARK - TRANSFORMATIONS persist() cache()
SPARK - TRANSFORMATIONS map(func) Return a new distributed dataset formed by passing each element of the source through a function func. Analogous to foreach of pig. filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true. flatMap( func) Similar to map, but each input item can be mapped to 0 or more output items groupByKey ([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. See More: sample, union, intersection, distinct, groupByKey, reduceByKey, sortByKey,join https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaPairRDD. html
Define a function to convert a line into words: def toWords(mystr): wordsArr = mystr.split() return wordsArr SPARK - Break the line into words Check first 10 lines: words.take(10); // Does the actual execution of loading and printing 10 lines. Execute the flatmap() transformation: words = lines.flatMap(toWords);
Define a function to clean & convert to key-value: import re def cleanKV(mystr): mystr = mystr.lower() mystr = re.sub("[^0-9a-z]", "", mystr) #replace non alphanums with space return (mystr, 1); # returning a tuple - word & count SPARK - Cleaning the data Execute the map() transformation: cleanWordsKV = words.map(cleanKV); //passing “clean” function as argument Check first 10 words pairs: cleanWordsKV.take(10); // Does the actual execution of loading and printing 10 lines.
SPARK - ACTIONS Return value to the driver
SPARK - ACTIONS reduce(func) Aggregate elements of dataset using a function: • Takes 2 arguments and returns one • Commutative and associative for parallelism count() Return the number of elements in the dataset. collect() Return all elements of dataset as an array at driver. Used for small output. take(n) Return an array with the first n elements of the dataset. Not Parallel. See More: first(), takeSample(), takeOrdered(), saveAsTextFile(path), reduceByKey() https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD
Define an aggregation function: def sum(x, y): return x+y; SPARK - Compute the words count Check first 10 counts: wordsCount.take(10); // Does the actual execution of loading and printing 10 lines. Execute the reduce action: wordsCount = cleanWordsKV.reduceByKey(sum); Save: wordsCount.saveAsTextFile("mynewdirectory");
www.KnowBigData.com After taking a shot with his bow, the archer took a bow.INPUT words = lines.flatMap(toWords); (After,1) (taking,1) (bow.,1)(his,1)(with,1)(shot,1) (a,1)(a,1) (took,1)(archer,1)(the,1)(bow,,1) (after,1) (taking,1) (bow,1)(his,1)(with,1)(shot,1) (a,1)(a,1) (took,1)(archer,1)(the,1)(bow,1) words.reduceByKey(sm) (after,1) (taking,1)(bow,1) (his,1) (with,1)(shot,1)(a,1)(a,1) (took,1)(archer,1) (the,1)(bow,1) (a,2) (bow,2) words = lines.map(cleanKV); sm sm SAVE TO HDFS FIle
Thank you. +1 419 665 3276 (US) +91 803 959 1464 (IN) reachus@cloudxlab.com Subscribe to our Youtube channel for latest videos - https://www. youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA

Apache Spark Introduction - CloudxLab

  • 1.
    Welcome to Hands-on Sessionon Big Data processing using Apache Spark reachus@cloudxlab.com CloudxLab.com +1 419 665 3276 (US) +91 803 959 1464 (IN)
  • 2.
    Agenda 1 Apache SparkIntroduction 2 CloudxLab Introduction 3 Introduction to RDD (Resilient Distributed Datasets) 4 Loading data into an RDD 5 RDD Operations Transformation 6 RDD Operations Actions 7 Hands-on demos using CloudxLab 8 Questions and Answers
  • 3.
    Hands-On: Objective Compute theword frequency of a text file stored in HDFS - Hadoop Distributed File System Using Apache Spark
  • 4.
    Welcome to CloudxLabSession • Learn Through Practice • Real Environment • Connect From Anywhere • Connect From Any Device A cloud based lab for students to gain hands-on experience in Big Data Technologies such as Hadoop and Spark • Centralized Data sets • No Installation • No Compatibility Issues • 24x7 Support
  • 5.
    About Instructor? 2015 CloudxLabA big data platform. 2014 KnowBigData Founded 2014 Amazon Built High Throughput Systems for Amazon.com site using in-house NoSql. 2012 2012 InMobi Built Recommender that churns 200 TB 2011 tBits Global Founded tBits Global Built an enterprise grade Document Management System 2006 D.E.Shaw Built the big data systems before the term was coined 2002 2002 IIT Roorkee Finished B.Tech.
  • 6.
    Apache A fast andgeneral engine for large-scale data processing. • Really fast MapReduce • 100x faster than Hadoop MapReduce in memory, • 10x faster on disk. • Builds on similar paradigms as MapReduce • Integrated with Hadoop
  • 7.
    Spark Architecture Spark Core StandaloneAmazonEC2 Hadoop YARN Apache Mesos HDFS HBase Hive Tachyon ... SQL Streaming MLLib GraphXSparkR Java Python Scala Libraries Languages
  • 8.
    Getting Started -Launching the console Open CloudxLab.com to get login/password Login into Console Or • Download • http://spark.apache.org/downloads.html • Install python • (optional) Install Hadoop Run pyspark
  • 9.
    SPARK - CONCEPTS- RESILIENT DISTRIBUTED DATASET A collection of elements partitioned across cluster • RDD Can be persisted in memory • RDD Auto recover from node failures • Can have any data type but has a special dataset type for key-value • Supports two type of operations: transformation and action • RDD is read only
  • 10.
    Convert an existingarray into RDD: myarray = sc.parallelize([1,3,5,6,19, 21]); SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET Load a file From HDFS: lines = sc.textFile('/data/mr/wordcount/input/big.txt') Check first 10 lines: lines.take(10); // Does the actual execution of loading and printing 10 lines.
  • 11.
  • 12.
    SPARK - TRANSFORMATIONS map(func) Returna new distributed dataset formed by passing each element of the source through a function func. Analogous to foreach of pig. filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true. flatMap( func) Similar to map, but each input item can be mapped to 0 or more output items groupByKey ([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. See More: sample, union, intersection, distinct, groupByKey, reduceByKey, sortByKey,join https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaPairRDD. html
  • 13.
    Define a functionto convert a line into words: def toWords(mystr): wordsArr = mystr.split() return wordsArr SPARK - Break the line into words Check first 10 lines: words.take(10); // Does the actual execution of loading and printing 10 lines. Execute the flatmap() transformation: words = lines.flatMap(toWords);
  • 14.
    Define a functionto clean & convert to key-value: import re def cleanKV(mystr): mystr = mystr.lower() mystr = re.sub("[^0-9a-z]", "", mystr) #replace non alphanums with space return (mystr, 1); # returning a tuple - word & count SPARK - Cleaning the data Execute the map() transformation: cleanWordsKV = words.map(cleanKV); //passing “clean” function as argument Check first 10 words pairs: cleanWordsKV.take(10); // Does the actual execution of loading and printing 10 lines.
  • 15.
    SPARK - ACTIONS Returnvalue to the driver
  • 16.
    SPARK - ACTIONS reduce(func) Aggregateelements of dataset using a function: • Takes 2 arguments and returns one • Commutative and associative for parallelism count() Return the number of elements in the dataset. collect() Return all elements of dataset as an array at driver. Used for small output. take(n) Return an array with the first n elements of the dataset. Not Parallel. See More: first(), takeSample(), takeOrdered(), saveAsTextFile(path), reduceByKey() https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD
  • 17.
    Define an aggregationfunction: def sum(x, y): return x+y; SPARK - Compute the words count Check first 10 counts: wordsCount.take(10); // Does the actual execution of loading and printing 10 lines. Execute the reduce action: wordsCount = cleanWordsKV.reduceByKey(sum); Save: wordsCount.saveAsTextFile("mynewdirectory");
  • 18.
    www.KnowBigData.com After taking ashot with his bow, the archer took a bow.INPUT words = lines.flatMap(toWords); (After,1) (taking,1) (bow.,1)(his,1)(with,1)(shot,1) (a,1)(a,1) (took,1)(archer,1)(the,1)(bow,,1) (after,1) (taking,1) (bow,1)(his,1)(with,1)(shot,1) (a,1)(a,1) (took,1)(archer,1)(the,1)(bow,1) words.reduceByKey(sm) (after,1) (taking,1)(bow,1) (his,1) (with,1)(shot,1)(a,1)(a,1) (took,1)(archer,1) (the,1)(bow,1) (a,2) (bow,2) words = lines.map(cleanKV); sm sm SAVE TO HDFS FIle
  • 19.
    Thank you. +1 419665 3276 (US) +91 803 959 1464 (IN) reachus@cloudxlab.com Subscribe to our Youtube channel for latest videos - https://www. youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA