Apache spark workshop

1.
twitter: @rabbitonweb, email: paul.szulc@gmail.com ApacheSpark Workshops by Paweł Szulc

2.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Beforewe start Make sure you’ve installed: JDK, Scala, SBT Clone project: https://github.com/rabbitonweb/spark-workshop Run `sbt compile` on it to fetch all dependencies

3.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Whatwe’re going to cover?

4.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Whatis Apache Spark?

5.
twitter: @rabbitonweb, email: paul.szulc@gmail.com ApacheSpark

6.
twitter: @rabbitonweb, email: paul.szulc@gmail.com ApacheSpark “Apache Spark™ is a fast and general engine for large-scale data processing.”

7.
twitter: @rabbitonweb, email: paul.szulc@gmail.com ApacheSpark “Apache Spark™ is a fast and general engine for large-scale data processing.”

8.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Why?

9.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Why? buzzword:Big Data

10.
twitter: @rabbitonweb, email: paul.szulc@gmail.com BigData is like...

11.
twitter: @rabbitonweb, email: paul.szulc@gmail.com BigData is like... “Big Data is like teenage sex:

12.
twitter: @rabbitonweb, email: paul.szulc@gmail.com BigData is like... “Big Data is like teenage sex: everyone talks about it,

13.
twitter: @rabbitonweb, email: paul.szulc@gmail.com BigData is like... “Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it,

14.
twitter: @rabbitonweb, email: paul.szulc@gmail.com BigData is like... “Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it,

15.
twitter: @rabbitonweb, email: paul.szulc@gmail.com BigData is like... “Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it”

16.
twitter: @rabbitonweb, email: paul.szulc@gmail.com BigData is all about...

17.
twitter: @rabbitonweb, email: paul.szulc@gmail.com BigData is all about... ● well, the data :)

18.
twitter: @rabbitonweb, email: paul.szulc@gmail.com BigData is all about... ● well, the data :) ● It is said that 2.5 exabytes (2.5×10^18) of data is being created around the world every single day

19.
twitter: @rabbitonweb, email: paul.szulc@gmail.com BigData is all about... “Every two days, we generate as much information as we did from the dawn of civilization until 2003” -- Eric Schmidt Former CEO Google

20.
twitter: @rabbitonweb, email: paul.szulc@gmail.com source:http://papyrus.greenville.edu/2014/03/selfiesteem/

21.
twitter: @rabbitonweb, email: paul.szulc@gmail.com BigData is all about... ● well, the data :) ● It is said that 2.5 exabytes (2.5×10^18) of data is being created around the world every single day

22.
twitter: @rabbitonweb, email: paul.szulc@gmail.com BigData is all about... ● well, the data :) ● It is said that 2.5 exabytes (2.5×10^18) of data is being created around the world every single day ● It's a capacity on which you can not any longer use standard tools and methods of evaluation

23.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Challengesof Big Data ● The gathering ● Processing and discovery ● Present it to business ● Hardware and network failures

24.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Challengesof Big Data ● The gathering ● Processing and discovery ● Present it to business ● Hardware and network failures

25.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Whatwas before?

26.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Tothe rescue MAP REDUCE

27.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Tothe rescue MAP REDUCE “'MapReduce' is a framework for processing parallelizable problems across huge datasets using a cluster, taking into consideration scalability and fault-tolerance”

28.
twitter: @rabbitonweb, email: paul.szulc@gmail.com MapReduce- phases (1) Map Reduce is combined of sequences of two phases:

29.
twitter: @rabbitonweb, email: paul.szulc@gmail.com MapReduce- phases (1) Map Reduce is combined of sequences of two phases: 1. Map

30.
twitter: @rabbitonweb, email: paul.szulc@gmail.com MapReduce- phases (1) Map Reduce is combined of sequences of two phases: 1. Map 2. Reduce

31.

32.

33.
twitter: @rabbitonweb, email: paul.szulc@gmail.com WordCount ● The “Hello World” of Big Data world. ● For initial input of multiple lines, extract all words with number of occurrences To be or not to be Let it be Be me It must be Let it be be 7 to 2 let 2 or 1 not 1 must 2 me 1

34.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Input Tobe or not to be Let it be Be me It must be Let it be

35.
twitter: @rabbitonweb, email: paul.szulc@gmail.com InputSplitting To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me

36.
twitter: @rabbitonweb, email: paul.szulc@gmail.com InputSplitting Mapping To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1

37.
twitter: @rabbitonweb, email: paul.szulc@gmail.com InputSplitting Mapping Shuffling To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1 be 1 be 1 be 1 be 1 be 1 be 1 to 1 to 1 or 1 not 1 let 1 let 1 must 1 me 1

38.
twitter: @rabbitonweb, email: paul.szulc@gmail.com InputSplitting Mapping Shuffling To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1 be 1 be 1 be 1 be 1 be 1 be 1 to 1 to 1 or 1 not 1 let 1 let 1 must 1 me 1 EXPENSIVE

39.
twitter: @rabbitonweb, email: paul.szulc@gmail.com InputSplitting Mapping Shuffling To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1 be 1 be 1 be 1 be 1 be 1 be 1 to 1 to 1 or 1 not 1 let 1 let 1 must 1 me 1

40.
twitter: @rabbitonweb, email: paul.szulc@gmail.com InputSplitting Mapping Shuffling Reducing To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1 be 1 be 1 be 1 be 1 be 1 be 1 to 1 to 1 or 1 not 1 let 1 let 1 must 1 me 1 be 6 to 2 or 1 not 1 let 2 must 1 me 1

41.
twitter: @rabbitonweb, email: paul.szulc@gmail.com InputSplitting Mapping Shuffling Reducing Final result To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1 be 1 be 1 be 1 be 1 be 1 be 1 to 1 to 1 or 1 not 1 let 1 let 1 must 1 me 1 be 6 to 2 or 1 not 1 let 2 must 1 me 1 be 6 to 2 let 2 or 1 not 1 must 2 me 1

42.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount - pseudo-code function map(String name, String document): for each word w in document: emit (w, 1)

43.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount - pseudo-code function map(String name, String document): for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)

44.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Map-Reduce-Map-Reduce-Map-Red..

45.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Why?

46.
twitter: @rabbitonweb, email: paul.szulc@gmail.com WhyApache Spark? We have MapReduce open-sourced implementation (Hadoop) running successfully for the last 12 years. Why to bother?

47.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Problemswith Map Reduce 1. MapReduce provides a difficult programming model for developers

48.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount - revisited function map(String name, String document): for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)

49.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount: Hadoop implementation 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { sum += val.get(); } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 46 Job job = new Job(conf, "wordcount"); 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 54 job.setInputFormatClass(TextInputFormat.class);

50.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Problemswith Map Reduce 1. MapReduce provides a difficult programming model for developers

51.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Problemswith Map Reduce 1. MapReduce provides a difficult programming model for developers 2. It suffers from a number of performance issues

52.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Performanceissues ● Map-Reduce pair combination

53.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Performanceissues ● Map-Reduce pair combination ● Output saved to the file

54.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Performanceissues ● Map-Reduce pair combination ● Output saved to the file ● Iterative algorithms go through IO path again and again

55.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Performanceissues ● Map-Reduce pair combination ● Output saved to the file ● Iterative algorithms go through IO path again and again ● Poor API (key, value), even basic join requires expensive code

56.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Problemswith Map Reduce 1. MapReduce provides a difficult programming model for developers 2. It suffers from a number of performance issues

57.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Problemswith Map Reduce 1. MapReduce provides a difficult programming model for developers 2. It suffers from a number of performance issues 3. While batch-mode analysis is still important, reacting to events as they arrive has become more important (lack support of “almost” real-time)

58.
twitter: @rabbitonweb, email: paul.szulc@gmail.com ApacheSpark to the rescue

59.
twitter: @rabbitonweb, email: paul.szulc@gmail.com TheBig Picture Cluster (Standalone, Yarn, Mesos)

60.
twitter: @rabbitonweb, email: paul.szulc@gmail.com TheBig Picture Driver Program Cluster (Standalone, Yarn, Mesos)

61.
twitter: @rabbitonweb, email: paul.szulc@gmail.com TheBig Picture Driver Program Cluster (Standalone, Yarn, Mesos) SPARK API: 1. Scala 2. Java 3. Python

62.
twitter: @rabbitonweb, email: paul.szulc@gmail.com TheBig Picture Driver Program Cluster (Standalone, Yarn, Mesos) SPARK API: 1. Scala 2. Java 3. Python Master

63.
twitter: @rabbitonweb, email: paul.szulc@gmail.com TheBig Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master

64.
twitter: @rabbitonweb, email: paul.szulc@gmail.com TheBig Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt”

65.
twitter: @rabbitonweb, email: paul.szulc@gmail.com TheBig Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master)

66.
twitter: @rabbitonweb, email: paul.szulc@gmail.com TheBig Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf)

67.
twitter: @rabbitonweb, email: paul.szulc@gmail.com TheBig Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) Executor 1 Executor 2 Executor 3

68.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- the definition

69.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- the definition RDD stands for resilient distributed dataset

70.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- the definition RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated

71.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- the definition RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated Distributed - stored in nodes among the cluster

72.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- the definition RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated Distributed - stored in nodes among the cluster Dataset - initial data comes from a file or can be created programmatically

73.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example

74.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("hdfs://logs.txt")

75.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("hdfs://logs.txt") From Hadoop Distributed File System

76.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("hdfs://logs.txt") From Hadoop Distributed File System This is the RDD

77.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("/home/rabbit/logs.txt") From local file system (must be available on executors) This is the RDD

78.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.parallelize(List(1, 2, 3, 4)) Programmatically from a collection of elements This is the RDD

79.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("logs.txt")

80.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase)

81.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) Creates a new RDD

82.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”))

83.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) And yet another RDD

84.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Gettingstarted with Spark https://github.com/rabbitonweb/spark-workshop /src/main/scala/sw/ex1/ /src/main/resources/all-shakespeare.txt

85.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Gettingstarted with Spark https://github.com/rabbitonweb/spark-workshop ● Make it a love story: Print out all lines that have both Juliet & Romeo in it http://spark.apache. org/docs/latest/api/scala/index.html#org.

86.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Gettingstarted with Spark https://github.com/rabbitonweb/spark-workshop ● Make it a love story: Print out all lines that have both Juliet & Romeo in it ● Would be nice to have a REPL

87.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) And yet another RDD

88.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) And yet another RDD Performance Alert?!?!

89.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- Operations 1. Transformations a. Map b. Filter c. FlatMap d. Sample e. Union f. Intersect g. Distinct h. GroupByKey i. …. 2. Actions a. Reduce b. Collect c. Count d. First e. Take(n) f. TakeSample g. SaveAsTextFile h. ….

90.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”))

91.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count

92.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count This will trigger the computation

93.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count This will trigger the computation This will the calculated value (Int)

94.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Example2 - other actions https://github.com/rabbitonweb/spark-workshop /src/main/scala/ex2/

95.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Exercise2: ● Save results of your calculations as text file

96.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Exercise2: ● Save results of your calculations as text file ● Hint: saveAsTextFile

97.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Exercise2: ● Save results of your calculations as text file ● Hint: saveAsTextFile ● Why the output is so weird?

98.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Partitions? Apartition represents subset of data within your distributed collection.

99.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Partitions? Apartition represents subset of data within your distributed collection. Number of partitions tightly coupled with level of parallelism.

100.
Partitions evaluation val counted= sc.textFile(..).count

101.
Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3

102.

103.

104.

105.

106.

107.

108.

109.

110.

111.

112.

113.

114.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Task= partition + calculation Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) Executor 1 Executor 2 Executor 3

115.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Task= partition + calculation Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) Executor 1 Executor 2 Executor 3 HDFS, GlusterFS, locality

116.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Task= partition + calculation Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3

117.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Task= partition + calculation Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3

118.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Task= partition + calculation Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3

119.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Task= partition + calculation Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2T3

120.

121.

122.

123.

124.

125.

126.

127.

128.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Example3 - working with key-value https://github.com/rabbitonweb/spark-workshop /src/main/scala/ex3/

129.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Exercise3 - working with key-value ● Change sw.ex3.Startings to sort the result by key ● Write program that answers following: which char starts most often in all-shakespeare.txt

130.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Pipeline

131.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Pipeline map

132.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Pipeline mapcount

133.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Pipeline mapcount task

134.

135.

136.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Butwhat if... val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

137.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Butwhat if... val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

138.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Butwhat if... filter val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

139.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Andnow what? filter val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

140.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Andnow what? filter mapValues val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

141.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Andnow what? filter val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

142.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filtergroupyBy val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

143.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filtermapValuesgroupyBy val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

144.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filterreduceByKeygroupyBy val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length } mapValues

145.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filterreduceByKeygroupyBy mapValues

146.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filterreduceByKey task groupyBy mapValues

147.

148.

149.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filterreduceByKey task Wait for calculations on all partitions before moving on groupyBy mapValues

150.

151.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filterreduceByKey task groupyBy Data flying around through cluster mapValues

152.

153.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filterreduceByKey task task groupyBy mapValues

154.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filterreduceByKeygroupyBy mapValues

155.
twitter: @rabbitonweb, email: paul.szulc@gmail.com stage1 Stage filterreduceByKeygroupyBy mapValues

156.
twitter: @rabbitonweb, email: paul.szulc@gmail.com sda stage2stage1 Stage filterreduceByKeygroupyBy mapValues

157.
twitter: @rabbitonweb, email: paul.szulc@gmail.com DirectedAcyclic Graph

158.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Exercise4 - DAG that ● Open sw.ex3.Ex4.scala ● You will find three programs: ○ StagesStagesA ○ StagesStagesB ○ StagesStagesC ● Can you tell how DAG will look like for all three?

159.
twitter: @rabbitonweb, email: paul.szulc@gmail.com DirectedAcyclic Graph val startings = allShakespeare .filter(_.trim != "") .map(line => (line.charAt(0), line)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length } println(startings.toDebugString)

160.
twitter: @rabbitonweb, email: paul.szulc@gmail.com DirectedAcyclic Graph val startings = allShakespeare .filter(_.trim != "") .map(line => (line.charAt(0), line)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length } println(startings.toDebugString) (2) ShuffledRDD[5] at reduceByKey at Ex3.scala:18 [] +-(2) MapPartitionsRDD[4] at mapValues at Ex3.scala:17 [] | MapPartitionsRDD[3] at map at Ex3.scala:16 [] | MapPartitionsRDD[2] at filter at Ex3.scala:15 [] | src/main/resources/all-shakespeare.txt MapPartitionsRDD[1] | src/main/resources/all-shakespeare.txt HadoopRDD[0] at textFile

161.
twitter: @rabbitonweb, email: paul.szulc@gmail.com DirectedAcyclic Graph val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length } println(startings.toDebugString) (2) MapPartitionsRDD[6] at reduceByKey at Ex3.scala:42 | MapPartitionsRDD[5] at mapValues at Ex3.scala:41 | ShuffledRDD[4] at groupBy at Ex3.scala:40 +-(2) MapPartitionsRDD[3] at groupBy at Ex3.scala:40 | MapPartitionsRDD[2] at filter at Ex3.scala:39 | src/main/resources/all-shakespeare.txt MapPartitionsRDD[1] | src/main/resources/all-shakespeare.txt HadoopRDD[0]

162.
twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD- the definition RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated Distributed - stored in nodes among the cluster Dataset - initial data comes from a file or can be created programmatically

163.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Whatabout Resilience? RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated Distributed - stored in nodes among the cluster Dataset - initial data comes from a file or can be created programmatically

164.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience DriverProgram Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3

165.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience DriverProgram Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3

166.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience DriverProgram Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3

167.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience DriverProgram Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2T3

168.

169.

170.

171.

172.

173.

174.

175.

176.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience DriverProgram Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3

177.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience DriverProgram Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3

178.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience DriverProgram Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2T3

179.

180.

181.

182.

183.

184.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience DriverProgram Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 EDeDEADutor 3 T1 T2

185.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience DriverProgram Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 T1 T2

186.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience DriverProgram Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 T1 T2 T3

187.

188.

189.

190.

191.

192.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Exercise5 - The Big Data problem ● Write a Word Count program using Spark ● Use all-shakespeare.txt as input To be or not to be Let it be Be me It must be Let it be be 7 to 2 let 2 or 1 not 1 must 2 me 1

193.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount once again

194.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount once again val wc = scala.io.Source.fromFile(args(0)).getLines Scala solution

195.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) Scala solution

196.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq Scala solution

197.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq .groupBy(identity) Scala solution

198.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq .groupBy(identity) .map { case (word, group) => (word, group.size) } Scala solution

199.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq .groupBy(identity) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) Scala solution Spark solution (in Scala language)

200.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq .groupBy(identity) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) .map(line => line.toLowerCase) Scala solution Spark solution (in Scala language)

201.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq .groupBy(identity) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")) Scala solution Spark solution (in Scala language)

202.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq .groupBy(identity) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")) .groupBy(identity) Scala solution Spark solution (in Scala language)

203.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Wordcount once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq .groupBy(identity) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")) .groupBy(identity) .map { case (word, group) => (word, group.size) } Scala solution Spark solution (in Scala language)

204.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Butthat solution has major flaw

205.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Butthat solution has major flaw ● Flaw: groupBy

206.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Butthat solution has major flaw ● Flaw: groupBy ● But before we do understand it, we have to: ○ instantiate a Standalone cluster ○ understand how cluster works ○ talk about serialization (& uber jar!) ○ see the Spark UI ○ talk about Spark configuration

207.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Butthat solution has major flaw ● Flaw: groupBy ● But before we do understand it, we have to: ○ instantiate a Standalone cluster ○ understand how cluster works ○ talk about serialization ○ see the Spark UI ○ talk about Spark configuration ● http://spark.apache. org/docs/latest/configuration.html

208.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Butthat solution has major flaw ● What can we do about it? ● Something spooky: let’s see Spark code!

209.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Mid-termexam ● Given all-shakespeare.txt ● Given names popularity in male-names.txt & female-names. txt ● Show how given name is popular nowadays & how many times it occurred in Shakespeare ● Result: key-value pair (key: name, value: pair) E.g Romeo is mentioned 340 in Shakespeare Romeo is nowadays 688th popular name So result will be: (romeo,(688,340))

210.
What is aRDD?

211.
What is aRDD? Resilient Distributed Dataset

212.
What is aRDD? Resilient Distributed Dataset

213.
... 10 10/05/2015 10:14:01UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski What is a RDD?

214.
node 1 ... 10 10/05/201510:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski node 2 node 3 What is a RDD?

215.
node 1 ... 10 10/05/201510:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski node 2 node 3 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski What is a RDD?

216.
What is aRDD?

217.
What is aRDD? RDD needs to hold 3 chunks of information in order to do its work:

218.
What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent

219.
What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned

220.
What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data

221.

222.
What is apartition? A partition represents subset of data within your distributed collection.

223.
What is apartition? A partition represents subset of data within your distributed collection. override def getPartitions: Array[Partition] = ???

224.
What is apartition? A partition represents subset of data within your distributed collection. override def getPartitions: Array[Partition] = ??? How this subset is defined depends on type of the RDD

225.
example: HadoopRDD val journal= sc.textFile(“hdfs://journal/*”)

226.
example: HadoopRDD val journal= sc.textFile(“hdfs://journal/*”) How HadoopRDD is partitioned?

227.
example: HadoopRDD val journal= sc.textFile(“hdfs://journal/*”) How HadoopRDD is partitioned? In HadoopRDD partition is exactly the same as file chunks in HDFS

228.
example: HadoopRDD 10 10/05/201510:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni

229.
example: HadoopRDD node 1 1010/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni

230.

231.

232.

233.

234.
example: HadoopRDD class HadoopRDD[K,V](...) extends RDD[(K, V)](sc, Nil) with Logging { ... override def getPartitions: Array[Partition] = { val jobConf = getJobConf() SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array }

235.

236.

237.
example: MapPartitionsRDD val journal= sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

238.
example: MapPartitionsRDD val journal= sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } How MapPartitionsRDD is partitioned?

239.
example: MapPartitionsRDD val journal= sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } How MapPartitionsRDD is partitioned? MapPartitionsRDD inherits partition information from its parent RDD

240.
example: MapPartitionsRDD class MapPartitionsRDD[U:ClassTag, T: ClassTag](...) extends RDD[U](prev) { ... override def getPartitions: Array[Partition] = firstParent[T].partitions

241.

242.

243.
RDD parent sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map{ case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)

244.
RDD parent sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map{ case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)

245.
RDD parent sc.textFile() .groupBy() .map {} .filter { } .take() .foreach()

246.
Directed acyclic graph sc.textFile().groupBy() .map { } .filter { } .take() .foreach()

247.
Directed acyclic graph HadoopRDD sc.textFile().groupBy() .map { } .filter { } .take() .foreach()

248.
Directed acyclic graph HadoopRDD ShuffeledRDD sc.textFile().groupBy() .map { } .filter { } .take() .foreach()

249.
Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

250.
Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

251.

252.
Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two types of parent dependencies: 1. narrow dependency 2. wider dependency

253.

254.

255.

256.
Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Tasks

257.
Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Tasks

258.

259.

260.

261.
Stage 1 Stage 2 Directedacyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

262.
What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data

263.

264.
Stage 1 Stage 2 RunningJob aka materializing DAG sc.textFile() .groupBy() .map { } .filter { }

265.
Stage 1 Stage 2 RunningJob aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect()

266.
Stage 1 Stage 2 RunningJob aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect() action

267.
Stage 1 Stage 2 RunningJob aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect() action Actions are implemented using sc.runJob method

268.
Running Job akamaterializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( ): Array[U]

269.
Running Job akamaterializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], ): Array[U]

270.
Running Job akamaterializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], partitions: Seq[Int], ): Array[U]

271.
Running Job akamaterializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], partitions: Seq[Int], func: Iterator[T] => U, ): Array[U]

272.
Running Job akamaterializing DAG /** * Return an array that contains all of the elements in this RDD. */ def collect(): Array[T] = { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) }

273.
Running Job akamaterializing DAG /** * Return an array that contains all of the elements in this RDD. */ def collect(): Array[T] = { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) } /** * Return the number of elements in the RDD. */ def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

274.
Multiple jobs forsingle action /** * Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit. */ def take(num: Int): Array[T] = { (….) val left = num - buf.size val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true) (….) res.foreach(buf ++= _.take(num - buf.size)) partsScanned += numPartsToTry (….) buf.toArray }

275.
Lets test what we’velearned

276.
Towards efficiency val events= sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

277.
Towards efficiency val events= sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString (4) MapPartitionsRDD[22] at filter at <console>:50 [] | MapPartitionsRDD[21] at map at <console>:49 [] | ShuffledRDD[20] at groupBy at <console>:48 [] +-(6) HadoopRDD[17] at textFile at <console>:47 []

278.
Towards efficiency val events= sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString (4) MapPartitionsRDD[22] at filter at <console>:50 [] | MapPartitionsRDD[21] at map at <console>:49 [] | ShuffledRDD[20] at groupBy at <console>:48 [] +-(6) HadoopRDD[17] at textFile at <console>:47 [] events.count

279.
Stage 1 val events= sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

280.
Stage 1 val events= sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

281.
Stage 1 val events= sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3

282.

283.

284.

285.

286.

287.

288.

289.

290.

291.

292.

293.
Everyday I’m Shuffling

294.
Everyday I’m Shuffling valevents = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3

295.

296.

297.

298.

299.

300.

301.

302.

303.

304.

305.
Stage 2 node 1 node2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

306.

307.

308.

309.

310.

311.

312.

313.

314.
Let's refactor val events= sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

315.

316.

317.
Let's refactor val events= sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) }

318.
Let's refactor val events= sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) }

319.
Let's refactor val events= sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3

320.

321.

322.
Let's refactor val events= sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3

323.

324.

325.
Let's refactor val events= sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) }

326.
Let's refactor val events= sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e)) .combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 + c2) .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) }

327.
Let's refactor val events= sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e)) .combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 + c2)

328.
A bit moreabout partitions val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4 .map( e => (extractDate(e), e))

329.
A bit moreabout partitions val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4 .repartition(256) // note, this will cause a shuffle .map( e => (extractDate(e), e))

330.
A bit moreabout partitions val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024 .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e))

331.
A bit moreabout partitions val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024 .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .coalesce(64) // this will NOT shuffle .map( e => (extractDate(e), e))

332.

333.
What is aRDD? RDD needs to hold 3 + 2 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data 4. data locality 5. paritioner

334.

335.
Data Locality: HDFSexample node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni

336.

337.

338.
Spark performance -shuffle optimization map groupBy

339.
Spark performance -shuffle optimization map groupBy

340.
Spark performance -shuffle optimization map groupBy join

341.
Spark performance -shuffle optimization map groupBy join

342.
Spark performance -shuffle optimization map groupBy join Optimization: shuffle avoided if data is already partitioned

343.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Example6 - Using partitioner ● sw/ex6/Ex6.scala

344.
Spark performance -shuffle optimization map groupBy join Optimization: shuffle avoided if data is already partitioned

345.
Spark performance -shuffle optimization map groupBy map

346.
Spark performance -shuffle optimization map groupBy map

347.
Spark performance -shuffle optimization map groupBy map join

348.
Spark performance -shuffle optimization map groupBy map join

349.
Spark performance -shuffle optimization map groupBy mapValues

350.
Spark performance -shuffle optimization map groupBy mapValues

351.
Spark performance -shuffle optimization map groupBy mapValues join

352.
Spark performance -shuffle optimization map groupBy mapValues join

353.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Example6 - Using partitioner ● sw/ex6/Ex6.scala

354.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Example6 - Using partitioner ● sw/ex6/Ex6.scala ● How can I know which transformations preserve partitioner?

355.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Exercise6 - Can I be better? ● Open sw/ex6/Ex6.scala ● Program ‘Join’ is not performing well ○ Can you tell why? ○ What should be done to fix it?

356.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching mapgroupBy

357.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching mapgroupBy

358.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching mapgroupBy filter

359.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching ●.persist() & .cache() methods map groupBy persist

360.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching ●.persist() & .cache() methods map groupBy persist filter

361.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching- where is it cached? ● How cache is stored depends on storage level

362.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching- where is it cached? ● How cache is stored depends on storage level ● Levels: ○ MEMORY_ONLY ○ MEMORY_AND_DISK ○ MEMORY_ONLY_SER ○ MEMORY_AND_DISK_SER ○ DISK_ONLY ○ OFF_HEAP

363.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching ●.persist() & .cache() methods map groupBy persist filter

364.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching ●.persist() & .cache() methods ● caching is fault-tolerant! map groupBy persist filter

365.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching ●.persist() & .cache() methods ● caching is fault-tolerant! map groupBy persist filter

366.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching- when it can be useful? http://people.csail.mit. edu/matei/papers/2012/nsdi_spark.pdf

367.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Sparkperformance - caching

368.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Sparkperformance - caching

369.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Sparkperformance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records).

370.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Sparkperformance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by (...) Hadoop (...) cluster of 2100 nodes.

371.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Sparkperformance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by (...) Hadoop (...) cluster of 2100 nodes. Using Spark on 206 nodes, we completed the benchmark in 23 minutes.

372.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Sparkperformance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by (...) Hadoop (...) cluster of 2100 nodes. Using Spark on 206 nodes, we completed the benchmark in 23 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines.

373.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Sparkperformance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by (...) Hadoop (...) cluster of 2100 nodes. Using Spark on 206 nodes, we completed the benchmark in 23 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All (...) without using Spark’s in-memory cache.”

374.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Example7 - Save me maybe ● /sw/ex7/Ex7.scala

375.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Checkpointing ●.checkpoint() map groupBy filter

376.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Checkpointing ●.checkpoint() checkpoint filter

377.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Andthat is all folks! Pawel Szulc Email: paul.szulc@gmail.com Twitter: @rabbitonweb Blog: http://rabbitonweb.com

378.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Andthat is all folks!

379.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Andthat is all folks! Pawel Szulc

380.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Andthat is all folks! Pawel Szulc Email: paul.szulc@gmail.com

381.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Andthat is all folks! Pawel Szulc Email: paul.szulc@gmail.com Twitter: @rabbitonweb

382.
twitter: @rabbitonweb, email: paul.szulc@gmail.com Andthat is all folks! Pawel Szulc Email: paul.szulc@gmail.com Twitter: @rabbitonweb Blog: http://rabbitonweb.com

Apache spark workshop

More Related Content

What's hot

Viewers also liked

Similar to Apache spark workshop

More from Pawel Szulc

Recently uploaded

Apache spark workshop