Apache Spark — Fundamentals and MLlib

Apache Spark Fundamentals and MLlib Jens Fisseler

Introducing Apache Spark Apache Spark is a fast and general engine for large-scale data processing • Cluster computing • In-memory execution model • Rich set of transformations 1/18

Spark Components Spark Core Spark SQL Spark Streaming MLlib GraphX Standalone Scheduler YARN Mesos 2/18

Resilient Distributed Datasets • Immutable distributed collection of objects • Stored in memory or on disk • Automatically rebuilt in case of failure • Diverse set of lazy parallel transformations 3/18

Demo #1 Basic RDD operations 4/18

RDD Transformations • map[U](f: (T) => U): RDD[U] • flatMap[U](f: (T) => Seq[U]): RDD[U] • filter(f: (T) => Boolean): RDD[T] • union(other: RDD[T]): RDD[T] • intersection(other: RDD[T]): RDD[T] • distinct(other: RDD[T]): RDD[T] • groupBy[K](f: (T) => K): RDD[(K, Seq[T])] • keyBy[K](f: (T) => K): RDD[(K, T)] 5/18

Key-Value RDDs • mapValues[U](f: (V) => U): RDD[(K, U)] • flatMapValues[U](f: (V) => Seq[U]): RDD[(K, U)] • reduceByKey(f: (V, V) => V): RDD[(K, V)] • foldByKey(zeroValue: V)(f: (V, V) => V): RDD[(K, V)] • combineByKey[C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)] • groupByKey(): RDD[(K, Seq[V])] 6/18

Actions on RDDs • count(): Long • first(): T • take(num: Int): Array[T] • takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] • collect(): Array[T] • reduce(f: (T, T) => T): T • aggregate[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U • saveAsObjectFile(path: String): Unit 7/18

Representing RDDs /1 RDD is a (low-level) interface 1. Set of partitions 2. List of dependencies on parent RDDs 3. Function for computing a partition given its parent(s) 4. (Optional) partitioner 5. (Optional) preferred location(s) 8/18

Representing RDDs /2 A B C D E F GStage 1 Stage 2 Stage 3 groupBy map union join 9/18

Clustering Driver Program SparkContext Cluster Manager Worker Node Executor Cache Task Task Worker Node Executor Cache Task Task 10/18

Structured RDDs • (Semi-)Structured tabular data • Table of named columns and rows • Structure facilitates optimization 11/18

Demo #2 Datasets and DataFrames 12/18

DataFrame and Dataset Transformations • select(cols: Column*): DataFrame • join(right: Dataset[_], usingColumns: Seq[String]): DataFrame • where(condition: Column): Dataset[T] • groupBy(cols: Column*): RelationalGroupedDataset • cube(cols: Column*): RelationalGroupedDataset • rollup(cols: Column*): RelationalGroupedDataset • drop(col: Column): DataFrame • withColumn(colName: String, col: Column): DataFrame 13/18

Column functions Aggregate Average, count, first, last, sum, variance, standard deviation Arrays Element check, size, sorting Datetime Current date-time, date field extraction, arithmetic Math and Logic Arithmetic, boolean, trigonometric, exponentiation, logarithm, n-th root, rounding Text Formatting, conversion, prefix/suffix tests, regular expressions User-defined functions Anything 14/18

Actions on Datasets • count(): Long • first(): T • take(n: Int): Array[T] • takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] • collect(): Array[T] • show(numRows: Int): Unit 15/18

Machine Learning with MLlib • DataFrame-based API (since Spark 2.0) • Algorithms for classification, regression, clustering, collaborative filtering • Feature extraction, transforming, dimensionality reduction, and selection • Pipelines for machine learning workflows 16/18

Demo #3 Linear regression with MLlib 17/18

Learning More • http://spark.apache.org/ • Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia. Learning Spark. O’Reilly Media, 2015. • Petar Zeˇcevi´c, Marko Bona´ci. Spark in Action. Manning Publications Co., 2017. • MOOCs 18/18

Apache Spark — Fundamentals and MLlib

More Related Content

What's hot

Similar to Apache Spark — Fundamentals and MLlib

Recently uploaded

Apache Spark — Fundamentals and MLlib