Introduction to apache spark

Introduction to Spark Eric Eijkelenboom - UserReport - userreport.com

• What is Spark and why should I care? • Architecture and programming model • Examples • Mini demo • Related projects

RTFM • A general-purpose computation framework that leverages distributed memory • More ﬂexible than MapReduce (it supports general execution graphs) • Linear scalability and fault tolerance • It supports a rich set of higher-level tools including • Shark (Hive on Spark) and Spark SQL • MLlib for machine learning • GraphX for graph processing • Spark Streaming

! ! ! ! • Slow due to serialisation & replication • Inefﬁcient for iterative computing & interactive querying Limitations of MapReduce Input iter. 1 iter. 2 . . . HDFS  read HDFS  write HDFS  read HDFS  write Map Map Map Reduce Reduce Input Output

Leveraging memory iter. 1 iter. 2 . . . Input HDFS  read HDFS  write HDFS  read HDFS  write

Leveraging memory iter. 1 iter. 2 . . . Input iter. 1 iter. 2 . . . Input HDFS  read HDFS  write HDFS  read HDFS  write

Leveraging memory iter. 1 iter. 2 . . . Input iter. 1 iter. 2 . . . Input HDFS  read HDFS  write HDFS  read HDFS  write Not tied to 2-stage MapReduce paradigm 1. Extract a working set 2. Cache it 3. Query it repeatedly

So, Spark is… • In-memory analytics, many times faster than Hadoop/Hive • Designed for running iterative algorithms & interactive querying • Highly compatible with Hadoop’s Storage APIs • Can run on your existing Hadoop Cluster Setup • Programming in Scala, Python or Java

Architecture HDFS Datanode Datanode Datanode.... Spark Worker Spark Worker Spark Worker .... Cache Cache Cache Block Block Block Cluster Manager Spark Driver (Master)

Architecture HDFS Datanode Datanode Datanode.... Spark Worker Spark Worker Spark Worker .... Cache Cache Cache Block Block Block Cluster Manager Spark Driver (Master) • YARN • Mesos • Standalone

Programming model • Resilient Distributed Datasets (RDDs) are basic building blocks • Distributed collection of objects, cached in-memory across cluster nodes • Automatically rebuilt on failure • RDD operations • Transformations: create new RDDs from existing ones • Actions: return a value to the master node after running a computation on the dataset

As you know… • … Hadoop is a distributed system for counting words • Here is how it’s done is Spark

As you know… • … Hadoop is a distributed system for counting words • Here is how it’s done is Spark Blue code: Spark operations Red code: functions (closures) that get passed to the cluster automatically

Text search In memory text search: ! ! caches the RDD in memory for faster reuse

Logistic regression ! • 100 GB of data on a 100 node cluster

Hive on Spark = Shark • A large scale data warehouse system just like Hive • Highly compatible with Hive (HQL, metastore, serialization formats, and UDFs) • Built on top of Spark (thus a faster execution engine) • Provision of creating in-memory materialized tables (Cached Tables) • And cached tables utilise columnar storage instead of raw storage

Shark Shark uses the existing Hive client and metastore

MLlib • Machine learning library based on Spark ! ! • Supports a range of machine learning algorithms, including classiﬁcation, regression, clustering, collaborative ﬁltering, dimensionality reduction, and more

Spark Streaming • Write streaming applications in the same way as batch applications • Reuse code between batch processing and streaming • Write more than analytics applications: • Join streams against historical data • Run ad-hoc queries on stream state

Spark Streaming • Count tweets on a sliding window ! ! • Find words with higher frequency than historic data

Introduction to apache spark

More Related Content

What's hot

Viewers also liked

Similar to Introduction to apache spark

Recently uploaded

Introduction to apache spark