Introduction to Spark Eric Eijkelenboom - UserReport - userreport.com
• What is Spark and why should I care? • Architecture and programming model • Examples • Mini demo • Related projects
RTFM • A general-purpose computation framework that leverages distributed memory • More flexible than MapReduce (it supports general execution graphs) • Linear scalability and fault tolerance • It supports a rich set of higher-level tools including • Shark (Hive on Spark) and Spark SQL • MLlib for machine learning • GraphX for graph processing • Spark Streaming
Who cares?
! ! ! ! • Slow due to serialisation & replication • Inefficient for iterative computing & interactive querying Limitations of MapReduce Input iter. 1 iter. 2 . . . HDFS
 read HDFS
 write HDFS
 read HDFS
 write Map Map Map Reduce Reduce Input Output
Leveraging memory iter. 1 iter. 2 . . . Input HDFS
 read HDFS
 write HDFS
 read HDFS
 write
Leveraging memory iter. 1 iter. 2 . . . Input iter. 1 iter. 2 . . . Input HDFS
 read HDFS
 write HDFS
 read HDFS
 write
Leveraging memory iter. 1 iter. 2 . . . Input iter. 1 iter. 2 . . . Input HDFS
 read HDFS
 write HDFS
 read HDFS
 write Not tied to 2-stage MapReduce paradigm 1. Extract a working set 2. Cache it 3. Query it repeatedly
So, Spark is… • In-memory analytics, many times faster than Hadoop/Hive • Designed for running iterative algorithms & interactive querying • Highly compatible with Hadoop’s Storage APIs • Can run on your existing Hadoop Cluster Setup • Programming in Scala, Python or Java
Spark stack
Architecture HDFS Datanode Datanode Datanode.... Spark Worker Spark Worker Spark Worker .... Cache Cache Cache Block Block Block Cluster Manager Spark Driver (Master)
Architecture HDFS Datanode Datanode Datanode.... Spark Worker Spark Worker Spark Worker .... Cache Cache Cache Block Block Block Cluster Manager Spark Driver (Master) • YARN • Mesos • Standalone
Programming model • Resilient Distributed Datasets (RDDs) are basic building blocks • Distributed collection of objects, cached in-memory across cluster nodes • Automatically rebuilt on failure • RDD operations • Transformations: create new RDDs from existing ones • Actions: return a value to the master node after running a computation on the dataset
As you know… • … Hadoop is a distributed system for counting words • Here is how it’s done is Spark
As you know… • … Hadoop is a distributed system for counting words • Here is how it’s done is Spark Blue code: Spark operations Red code: functions (closures) that get passed to the cluster automatically
Text search
Text search In memory text search: ! ! caches the RDD in memory for faster reuse
Logistic regression ! • 100 GB of data on a 100 node cluster
Easy unit testing
Spark shell
Mini demo
Hive on Spark = Shark • A large scale data warehouse system just like Hive • Highly compatible with Hive (HQL, metastore, serialization formats, and UDFs) • Built on top of Spark (thus a faster execution engine) • Provision of creating in-memory materialized tables (Cached Tables) • And cached tables utilise columnar storage instead of raw storage
Shark Shark uses the existing Hive client and metastore
MLlib • Machine learning library based on Spark ! ! • Supports a range of machine learning algorithms, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and more
Spark Streaming • Write streaming applications in the same way as batch applications • Reuse code between batch processing and streaming • Write more than analytics applications: • Join streams against historical data • Run ad-hoc queries on stream state
Spark Streaming • Count tweets on a sliding window ! ! • Find words with higher frequency than historic data
GraphX: graph computing
Introduction to apache spark

Introduction to apache spark

  • 1.
    Introduction to Spark EricEijkelenboom - UserReport - userreport.com
  • 2.
    • What isSpark and why should I care? • Architecture and programming model • Examples • Mini demo • Related projects
  • 4.
    RTFM • A general-purposecomputation framework that leverages distributed memory • More flexible than MapReduce (it supports general execution graphs) • Linear scalability and fault tolerance • It supports a rich set of higher-level tools including • Shark (Hive on Spark) and Spark SQL • MLlib for machine learning • GraphX for graph processing • Spark Streaming
  • 5.
  • 6.
    ! ! ! ! • Slow dueto serialisation & replication • Inefficient for iterative computing & interactive querying Limitations of MapReduce Input iter. 1 iter. 2 . . . HDFS
 read HDFS
 write HDFS
 read HDFS
 write Map Map Map Reduce Reduce Input Output
  • 7.
    Leveraging memory iter. 1iter. 2 . . . Input HDFS
 read HDFS
 write HDFS
 read HDFS
 write
  • 8.
    Leveraging memory iter. 1iter. 2 . . . Input iter. 1 iter. 2 . . . Input HDFS
 read HDFS
 write HDFS
 read HDFS
 write
  • 9.
    Leveraging memory iter. 1iter. 2 . . . Input iter. 1 iter. 2 . . . Input HDFS
 read HDFS
 write HDFS
 read HDFS
 write Not tied to 2-stage MapReduce paradigm 1. Extract a working set 2. Cache it 3. Query it repeatedly
  • 10.
    So, Spark is… •In-memory analytics, many times faster than Hadoop/Hive • Designed for running iterative algorithms & interactive querying • Highly compatible with Hadoop’s Storage APIs • Can run on your existing Hadoop Cluster Setup • Programming in Scala, Python or Java
  • 11.
  • 12.
    Architecture HDFS Datanode Datanode Datanode.... SparkWorker Spark Worker Spark Worker .... Cache Cache Cache Block Block Block Cluster Manager Spark Driver (Master)
  • 13.
    Architecture HDFS Datanode Datanode Datanode.... SparkWorker Spark Worker Spark Worker .... Cache Cache Cache Block Block Block Cluster Manager Spark Driver (Master) • YARN • Mesos • Standalone
  • 14.
    Programming model • ResilientDistributed Datasets (RDDs) are basic building blocks • Distributed collection of objects, cached in-memory across cluster nodes • Automatically rebuilt on failure • RDD operations • Transformations: create new RDDs from existing ones • Actions: return a value to the master node after running a computation on the dataset
  • 15.
    As you know… •… Hadoop is a distributed system for counting words • Here is how it’s done is Spark
  • 16.
    As you know… •… Hadoop is a distributed system for counting words • Here is how it’s done is Spark Blue code: Spark operations Red code: functions (closures) that get passed to the cluster automatically
  • 17.
  • 18.
    Text search In memorytext search: ! ! caches the RDD in memory for faster reuse
  • 19.
    Logistic regression ! • 100GB of data on a 100 node cluster
  • 20.
  • 21.
  • 22.
  • 24.
    Hive on Spark= Shark • A large scale data warehouse system just like Hive • Highly compatible with Hive (HQL, metastore, serialization formats, and UDFs) • Built on top of Spark (thus a faster execution engine) • Provision of creating in-memory materialized tables (Cached Tables) • And cached tables utilise columnar storage instead of raw storage
  • 25.
    Shark Shark uses theexisting Hive client and metastore
  • 26.
    MLlib • Machine learninglibrary based on Spark ! ! • Supports a range of machine learning algorithms, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and more
  • 27.
    Spark Streaming • Writestreaming applications in the same way as batch applications • Reuse code between batch processing and streaming • Write more than analytics applications: • Join streams against historical data • Run ad-hoc queries on stream state
  • 28.
    Spark Streaming • Counttweets on a sliding window ! ! • Find words with higher frequency than historic data
  • 29.