Lightning-fast cluster computing Rahul Kavale(rahulkav@thoughtworks.com) Unmesh Joshi(uvjoshi@thoughtworks.com)
2 Some properties of “Big Data” •Big data is inherently immutable, meaning it is not supposed to updated once generated. •Mostly the operations are coarse grained when it comes to write •Commodity hardware makes more sense for storage/computation of such enormous data,hence the data is distributed across cluster of many such machines • The distributed nature makes the programming complicated.
3 Brush up for Hadoop concepts Distributed Storage => HDFS Cluster Manager => YARN Fault tolerance => achieved via replication Job scheduling => Scheduler in YARN Mapper Reducer Combiner
4http://hadoop.apache.org/docs/r1.2.1/images/hdfsarchitecture.gif
5 Map Reduce Programming Model
6https://twitter.com/francesc/status/507942534388011008
7http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop
8 http://www.slideshare.net/JimArgeropoulos/hadoop-101-32661121
9 MapReduce pain points • considerable latency • only Map and Reduce phases • Non trivial to test • results into complex workflow • Not suitable for Iterative processing
10 Immutability and MapReduce model • HDFS storage is immutable or append-only. • The MapReduce model lacks to exploit the immutable nature of the data. • The intermediate results are persisted resulting in huge of IO, causing a serious performance hit.
11 Wouldn’t it be very nice if we could have• Low latency • Programmer friendly programming model • Unified ecosystem • Fault tolerance and other typical distributed system properties • Easily testable code • Of course open source :)
12 What is Apache Spark • Cluster computing Engine • Abstracts the storage and cluster management • Unified interfaces to data • API in Scala, Python, Java, R*
13 Where does it fit in existing Bigdata ecosystem http://www.kdnuggets.com/2014/06/yarn-all-rage-hadoop-summit.html
14 Why should you care about Apache Spark • Abstracts underlying storage, • Abstracts cluster management • Easy programming model • Very easy to test the code • Highly performant
15 • Petabyte sort record https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
16 • Offers in memory caching of data • Specialized Applications • GraphX for graph processing • Spark Streaming • MLib for Machine learning • Spark SQL • Data exploration via Spark-Shell
17 Programming model for Apache Spark
18 Word Count example val file = spark.textFile("input path") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) counts.saveAsTextFile("destination path")
19 Comparing example with MapReduce
20 Spark Shell Demo • SparkContext • RDD • RDD operations
21 RDD • RDD stands for Resilient Distributed Dataset. • basic abstraction for Spark
22 • Equivalent of Distributed collections. • The interface makes distributed nature of underlying data transparent. • RDD is immutable • Can be created via, • parallelising a collection, • transforming an existing RDD by applying a transformation function, • reading from a persistent data store like HDFS.
23 RDD is lazily evaluated RDD has two type of operations • Transformations Create a DAG of transformations to be applied on the RDD Does not evaluating anything • Actions Evaluate the DAG of transformations
24 RDD operations Transformations map(f : T ⇒ U) : RDD[T] ⇒ RDD[U] filter(f : T ⇒ Bool) : RDD[T] ⇒ RDD[T] flatMap(f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U] sample(fraction : Float) : RDD[T] ⇒ RDD[T] (Deterministic sampling) union() : (RDD[T],RDD[T]) ⇒ RDD[T] join() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (V, W))] groupByKey() : RDD[(K, V)] ⇒ RDD[(K, Seq[V])] reduceByKey(f : (V,V) ⇒ V) : RDD[(K, V)] ⇒ RDD[(K, V)] partitionBy(p : Partitioner[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]
25 Actions count() : RDD[T] ⇒ Long collect() : RDD[T] ⇒ Seq[T] reduce(f : (T,T) ⇒ T) : RDD[T] ⇒ T lookup(k : K) : RDD[(K, V)] ⇒ Seq[V] (On hash/range partitioned RDDs) save(path : String) : Outputs RDD to a storage system, e.g., HDFS
26 Job Execution
27 Spark Execution in Context of YARN http://kb.cnblogs.com/page/198414/
28 Fault tolerance via lineage MappedRDD FilteredRDD FlatMappedRDD MappedRDD HadoopRDD
29 Testing
30 Why is Spark more performant than MapReduce
31 Reduced IO • No disk IO between phases since phases themselves are pipelined • No network IO involved unless a shuffle is required
32 No Mandatory Shuffle • Programs not bounded by map and reduce phases • No mandatory Shuffle and sort required
33 In memory caching of data • Optional In memory caching • DAG engine can apply certain optimisations since when an action is called, it knows what all transformations as to be applied
34 Questions?
35 Thank You!

Scrap Your MapReduce - Apache Spark

  • 1.
    Lightning-fast cluster computing RahulKavale(rahulkav@thoughtworks.com) Unmesh Joshi(uvjoshi@thoughtworks.com)
  • 2.
    2 Some properties of“Big Data” •Big data is inherently immutable, meaning it is not supposed to updated once generated. •Mostly the operations are coarse grained when it comes to write •Commodity hardware makes more sense for storage/computation of such enormous data,hence the data is distributed across cluster of many such machines • The distributed nature makes the programming complicated.
  • 3.
    3 Brush up forHadoop concepts Distributed Storage => HDFS Cluster Manager => YARN Fault tolerance => achieved via replication Job scheduling => Scheduler in YARN Mapper Reducer Combiner
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    9 MapReduce pain points •considerable latency • only Map and Reduce phases • Non trivial to test • results into complex workflow • Not suitable for Iterative processing
  • 10.
    10 Immutability and MapReducemodel • HDFS storage is immutable or append-only. • The MapReduce model lacks to exploit the immutable nature of the data. • The intermediate results are persisted resulting in huge of IO, causing a serious performance hit.
  • 11.
    11 Wouldn’t it bevery nice if we could have• Low latency • Programmer friendly programming model • Unified ecosystem • Fault tolerance and other typical distributed system properties • Easily testable code • Of course open source :)
  • 12.
    12 What is ApacheSpark • Cluster computing Engine • Abstracts the storage and cluster management • Unified interfaces to data • API in Scala, Python, Java, R*
  • 13.
    13 Where does itfit in existing Bigdata ecosystem http://www.kdnuggets.com/2014/06/yarn-all-rage-hadoop-summit.html
  • 14.
    14 Why should youcare about Apache Spark • Abstracts underlying storage, • Abstracts cluster management • Easy programming model • Very easy to test the code • Highly performant
  • 15.
    15 • Petabyte sortrecord https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
  • 16.
    16 • Offers inmemory caching of data • Specialized Applications • GraphX for graph processing • Spark Streaming • MLib for Machine learning • Spark SQL • Data exploration via Spark-Shell
  • 17.
  • 18.
    18 Word Count example valfile = spark.textFile("input path") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) counts.saveAsTextFile("destination path")
  • 19.
  • 20.
    20 Spark Shell Demo •SparkContext • RDD • RDD operations
  • 21.
    21 RDD • RDD standsfor Resilient Distributed Dataset. • basic abstraction for Spark
  • 22.
    22 • Equivalent ofDistributed collections. • The interface makes distributed nature of underlying data transparent. • RDD is immutable • Can be created via, • parallelising a collection, • transforming an existing RDD by applying a transformation function, • reading from a persistent data store like HDFS.
  • 23.
    23 RDD is lazilyevaluated RDD has two type of operations • Transformations Create a DAG of transformations to be applied on the RDD Does not evaluating anything • Actions Evaluate the DAG of transformations
  • 24.
    24 RDD operations Transformations map(f :T ⇒ U) : RDD[T] ⇒ RDD[U] filter(f : T ⇒ Bool) : RDD[T] ⇒ RDD[T] flatMap(f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U] sample(fraction : Float) : RDD[T] ⇒ RDD[T] (Deterministic sampling) union() : (RDD[T],RDD[T]) ⇒ RDD[T] join() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (V, W))] groupByKey() : RDD[(K, V)] ⇒ RDD[(K, Seq[V])] reduceByKey(f : (V,V) ⇒ V) : RDD[(K, V)] ⇒ RDD[(K, V)] partitionBy(p : Partitioner[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]
  • 25.
    25 Actions count() : RDD[T]⇒ Long collect() : RDD[T] ⇒ Seq[T] reduce(f : (T,T) ⇒ T) : RDD[T] ⇒ T lookup(k : K) : RDD[(K, V)] ⇒ Seq[V] (On hash/range partitioned RDDs) save(path : String) : Outputs RDD to a storage system, e.g., HDFS
  • 26.
  • 27.
    27 Spark Execution inContext of YARN http://kb.cnblogs.com/page/198414/
  • 28.
    28 Fault tolerance vialineage MappedRDD FilteredRDD FlatMappedRDD MappedRDD HadoopRDD
  • 29.
  • 30.
    30 Why is Sparkmore performant than MapReduce
  • 31.
    31 Reduced IO • Nodisk IO between phases since phases themselves are pipelined • No network IO involved unless a shuffle is required
  • 32.
    32 No Mandatory Shuffle •Programs not bounded by map and reduce phases • No mandatory Shuffle and sort required
  • 33.
    33 In memory cachingof data • Optional In memory caching • DAG engine can apply certain optimisations since when an action is called, it knows what all transformations as to be applied
  • 34.
  • 35.