Apache Spark Fundamentals and MLlib Jens Fisseler
Introducing Apache Spark Apache Spark is a fast and general engine for large-scale data processing • Cluster computing • In-memory execution model • Rich set of transformations 1/18
Spark Components Spark Core Spark SQL Spark Streaming MLlib GraphX Standalone Scheduler YARN Mesos 2/18
Resilient Distributed Datasets • Immutable distributed collection of objects • Stored in memory or on disk • Automatically rebuilt in case of failure • Diverse set of lazy parallel transformations 3/18
Demo #1 Basic RDD operations 4/18
RDD Transformations • map[U](f: (T) => U): RDD[U] • flatMap[U](f: (T) => Seq[U]): RDD[U] • filter(f: (T) => Boolean): RDD[T] • union(other: RDD[T]): RDD[T] • intersection(other: RDD[T]): RDD[T] • distinct(other: RDD[T]): RDD[T] • groupBy[K](f: (T) => K): RDD[(K, Seq[T])] • keyBy[K](f: (T) => K): RDD[(K, T)] 5/18
Key-Value RDDs • mapValues[U](f: (V) => U): RDD[(K, U)] • flatMapValues[U](f: (V) => Seq[U]): RDD[(K, U)] • reduceByKey(f: (V, V) => V): RDD[(K, V)] • foldByKey(zeroValue: V)(f: (V, V) => V): RDD[(K, V)] • combineByKey[C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)] • groupByKey(): RDD[(K, Seq[V])] 6/18
Actions on RDDs • count(): Long • first(): T • take(num: Int): Array[T] • takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] • collect(): Array[T] • reduce(f: (T, T) => T): T • aggregate[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U • saveAsObjectFile(path: String): Unit 7/18
Representing RDDs /1 RDD is a (low-level) interface 1. Set of partitions 2. List of dependencies on parent RDDs 3. Function for computing a partition given its parent(s) 4. (Optional) partitioner 5. (Optional) preferred location(s) 8/18
Representing RDDs /2 A B C D E F GStage 1 Stage 2 Stage 3 groupBy map union join 9/18
Clustering Driver Program SparkContext Cluster Manager Worker Node Executor Cache Task Task Worker Node Executor Cache Task Task 10/18
Structured RDDs • (Semi-)Structured tabular data • Table of named columns and rows • Structure facilitates optimization 11/18
Demo #2 Datasets and DataFrames 12/18
DataFrame and Dataset Transformations • select(cols: Column*): DataFrame • join(right: Dataset[_], usingColumns: Seq[String]): DataFrame • where(condition: Column): Dataset[T] • groupBy(cols: Column*): RelationalGroupedDataset • cube(cols: Column*): RelationalGroupedDataset • rollup(cols: Column*): RelationalGroupedDataset • drop(col: Column): DataFrame • withColumn(colName: String, col: Column): DataFrame 13/18
Column functions Aggregate Average, count, first, last, sum, variance, standard deviation Arrays Element check, size, sorting Datetime Current date-time, date field extraction, arithmetic Math and Logic Arithmetic, boolean, trigonometric, exponentiation, logarithm, n-th root, rounding Text Formatting, conversion, prefix/suffix tests, regular expressions User-defined functions Anything 14/18
Actions on Datasets • count(): Long • first(): T • take(n: Int): Array[T] • takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] • collect(): Array[T] • show(numRows: Int): Unit 15/18
Machine Learning with MLlib • DataFrame-based API (since Spark 2.0) • Algorithms for classification, regression, clustering, collaborative filtering • Feature extraction, transforming, dimensionality reduction, and selection • Pipelines for machine learning workflows 16/18
Demo #3 Linear regression with MLlib 17/18
Learning More • http://spark.apache.org/ • Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia. Learning Spark. O’Reilly Media, 2015. • Petar Zeˇcevi´c, Marko Bona´ci. Spark in Action. Manning Publications Co., 2017. • MOOCs 18/18

Apache Spark — Fundamentals and MLlib

  • 1.
    Apache Spark Fundamentals andMLlib Jens Fisseler
  • 2.
    Introducing Apache Spark ApacheSpark is a fast and general engine for large-scale data processing • Cluster computing • In-memory execution model • Rich set of transformations 1/18
  • 3.
    Spark Components Spark Core Spark SQL Spark Streaming MLlibGraphX Standalone Scheduler YARN Mesos 2/18
  • 4.
    Resilient Distributed Datasets •Immutable distributed collection of objects • Stored in memory or on disk • Automatically rebuilt in case of failure • Diverse set of lazy parallel transformations 3/18
  • 5.
    Demo #1 Basic RDDoperations 4/18
  • 6.
    RDD Transformations • map[U](f:(T) => U): RDD[U] • flatMap[U](f: (T) => Seq[U]): RDD[U] • filter(f: (T) => Boolean): RDD[T] • union(other: RDD[T]): RDD[T] • intersection(other: RDD[T]): RDD[T] • distinct(other: RDD[T]): RDD[T] • groupBy[K](f: (T) => K): RDD[(K, Seq[T])] • keyBy[K](f: (T) => K): RDD[(K, T)] 5/18
  • 7.
    Key-Value RDDs • mapValues[U](f:(V) => U): RDD[(K, U)] • flatMapValues[U](f: (V) => Seq[U]): RDD[(K, U)] • reduceByKey(f: (V, V) => V): RDD[(K, V)] • foldByKey(zeroValue: V)(f: (V, V) => V): RDD[(K, V)] • combineByKey[C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)] • groupByKey(): RDD[(K, Seq[V])] 6/18
  • 8.
    Actions on RDDs •count(): Long • first(): T • take(num: Int): Array[T] • takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] • collect(): Array[T] • reduce(f: (T, T) => T): T • aggregate[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U • saveAsObjectFile(path: String): Unit 7/18
  • 9.
    Representing RDDs /1 RDDis a (low-level) interface 1. Set of partitions 2. List of dependencies on parent RDDs 3. Function for computing a partition given its parent(s) 4. (Optional) partitioner 5. (Optional) preferred location(s) 8/18
  • 10.
    Representing RDDs /2 AB C D E F GStage 1 Stage 2 Stage 3 groupBy map union join 9/18
  • 11.
    Clustering Driver Program SparkContext ClusterManager Worker Node Executor Cache Task Task Worker Node Executor Cache Task Task 10/18
  • 12.
    Structured RDDs • (Semi-)Structuredtabular data • Table of named columns and rows • Structure facilitates optimization 11/18
  • 13.
  • 14.
    DataFrame and DatasetTransformations • select(cols: Column*): DataFrame • join(right: Dataset[_], usingColumns: Seq[String]): DataFrame • where(condition: Column): Dataset[T] • groupBy(cols: Column*): RelationalGroupedDataset • cube(cols: Column*): RelationalGroupedDataset • rollup(cols: Column*): RelationalGroupedDataset • drop(col: Column): DataFrame • withColumn(colName: String, col: Column): DataFrame 13/18
  • 15.
    Column functions Aggregate Average,count, first, last, sum, variance, standard deviation Arrays Element check, size, sorting Datetime Current date-time, date field extraction, arithmetic Math and Logic Arithmetic, boolean, trigonometric, exponentiation, logarithm, n-th root, rounding Text Formatting, conversion, prefix/suffix tests, regular expressions User-defined functions Anything 14/18
  • 16.
    Actions on Datasets •count(): Long • first(): T • take(n: Int): Array[T] • takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] • collect(): Array[T] • show(numRows: Int): Unit 15/18
  • 17.
    Machine Learning withMLlib • DataFrame-based API (since Spark 2.0) • Algorithms for classification, regression, clustering, collaborative filtering • Feature extraction, transforming, dimensionality reduction, and selection • Pipelines for machine learning workflows 16/18
  • 18.
  • 19.
    Learning More • http://spark.apache.org/ •Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia. Learning Spark. O’Reilly Media, 2015. • Petar Zeˇcevi´c, Marko Bona´ci. Spark in Action. Manning Publications Co., 2017. • MOOCs 18/18