Intro to Apache Spark

Intro to Apache Spark Marius Soutier Freelance Software Engineer @mariussoutier Clustered In-Memory Computation

Motivation • Classical data architectures break down • RDMBS can’t handle large amounts of data well • Most RDMBS can’t handle multiple input formats • Most NoSQLs don’t offer analytics Problem Running computations on BigData®

The 3 Vs of Big Data Volume 100s of GB, TB, PB Variety Structured, Unstructured, Semi-Structured Velocity Sensors, Realtime “Fast Data”

Hadoop (1) • De-facto standard for running computations on large amounts of different data is Hadoop • Hadoop consists of • HDFS distributed, fault-tolerant ﬁle system • Map/Reduce parallelizable computations pioneered by Google • Hadoop is typically run on a (large) cluster of non-virtualized commodity hardware

Hadoop (2) • However, Map/Reduce are batch jobs with high latency • Not suitable for interactive queries, real-time analytics, or Machine Learning • Pure Map/Reduce is hard to develop and maintain

Enter Spark Spark is a framework for clustered in-memory data processing

• Developed at UC Berkeley, released in 2010 • Apache Top-Level Project Since February 2014, current version is 1.2.1 / 1.3.0 • USP: Uses cluster-wide available memory to speed up computations • Very active community Apache Spark (1)

• Written in Scala (& Akka),   APIs for Java and Python • Programming model is a collection pipeline* instead of Map/Reduce • Supports batch, streaming, interactive,   or all combined using uniﬁed API Apache Spark (2) * http://martinfowler.com/articles/collection-pipeline/

Spark Ecosystem Spark Core Spark SQL Spark Hive BlinkDB Approximate SQL Spark Streaming MLlib Machine Learning GraphX SparkR ALPHA ALPHA ALPHA Tachyon

Spark is a framework for clustered in-memory data processing Spark is a platform for data driven products.

• Base abstraction Resilient Distributed Dataset (RDD) • Essentially a distributed collection of objects • Can be cached in memory or on disk RDD

RDD Word Count val sc = new SparkContext()  val input: RDD[String] = sc.textFile("/tmp/word.txt")  val words: RDD[(String, Long)] = input  .flatMap(line => line.toLowerCase.split("s+"))  .map(word => word -> 1L)  .cache()    val wordCountsRdd: RDD[(String, Long)] = words  .reduceByKey(_ + _)  .sortByKey()   val wordCounts: Array[(String, Long)] = wordCountsRdd.collect()

Cluster Driver SparkContext Master Worker Executor Worker Executor Tasks Tasks • Spark app (driver) builds DAG from RDD operations • DAG is split into tasks that are executed by workers

Example Architecture Input HDFS Message Queue Spark Streaming Spark Batch Jobs SparkSQL Real-Time Dashboard Interactive SQL Analytics, Reports

Intro to Apache Spark

More Related Content

What's hot

Viewers also liked

Similar to Intro to Apache Spark

Recently uploaded

Intro to Apache Spark