Study Notes: Apache Spark

Summary of Apache Spark Original Papers: 1. “Spark: Cluster Computing with Working Sets” by Matei Zaharia, et al. Hotcloud 2010. 2. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” by Matei Zaharia, et al. NSDI 2012.

Motivation • MapReduce is great, but, • There are applications where you iterate on the same set of data, e.g., for many iterations { old_data = new_data; new_data = func(old_data); } • Problem: The body of each iteration can be described as a MapReduce task, where the inputs and final outputs are GFS files. There is redundant work storing the output data to GFS and then reading it out again in the next iteration. • Idea: We can provide a mode to cache the final outputs in memory if possible. o Challenge: but if the machine crashes, we lose the outputs, can we recover? o Solution: store the lineage of the data so that they can be reconstructed as needed (e.g., if they get lost or insufficient memory).

Motivation • Spark’s goal was to generalize MapReduce to support new apps within same engine oMapReduce problems can be expressed in Spark too. oWhere Spark shines and MapReduce does not: applications that need to reuse a working set of data across multiple parallel operations • Two reasonably small additions allowed the previous specialized models to be expressed within Spark: ofast data sharing ogeneral DAGs

Motivation Some key points about Spark: • handles batch, interactive, and real-time within a single framework • native integration with Java, Python, Scala. oHas APIs written in these languages • programming at a higher level of abstraction • more general: map/reduce is just one set of supported constructs

Use Example • We’ll run Spark’s interactive shell… ./bin/spark-shell • Then from the “scala>” REPL prompt, let’s create some data… val data = 1 to 10000 • Create an RDD based on that data… val distData = sc.parallelize(data) • Then use a filter to select values less than 10… distData.filter(_ < 10).collect()

Resilient Distributed Datasets (RDD) • Represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. • RDDs can only be created through deterministic operations (aka transformations) on: oEither data in stable storage:  Any file stored in HDFS  Or other storage systems supported by Hadoop oOr other RDDs • A program cannot reference an RDD that it cannot reconstruct after a failure

Resilient Distributed Datasets (RDD) • Two types of operations on RDDs: transformations and actions o Programmers start by defining one or more RDDs through transformations on data in stable storage (e.g., map and filter). Transformations create a new dataset from an existing one  transformations are lazy (not computed immediately)  instead they remember the transformations applied to some base dataset  The transformed RDD gets recomputed when an action is run on it (default) o They can then use these RDDs in actions, which are operations that return a value to the application or export data to a storage system. • However, an RDD can be persisted into storage in memory or disk o Each node stores in memory any slices of it that it computes and reuses them in other actions on that dataset – often making future actions more than 10x faster o The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it o Note that by default, the base RDD (the original data in stable storage) is not loaded into RAM because the useful data after transformation might be only a small fraction (small enough to fit into memory).

RDD Implementation Common interface of each RDD: • A set of partitions: atomic pieces of the dataset • A set of dependencies on parent RDDs: one RDD can have multiple parents o narrow dependencies: each partition of the parent RDD is used by at most one partition of the child RDD. o wide dependencies: multiple child partitions may depend on a parent. Requires the shuffle operation. • A function for computing the dataset based on its parents • Metadata about its partitioning scheme • Metadata about its data placement, e.g., perferredLocations(p) returns a list of nodes where partition p can be accessed faster due to data locality

Narrow vs Wide Dependencies • Narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions. Wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation. • Recovery after a node failure is more efficient with a narrow dependency, as only the lost parent partitions need be recomputed, and re-computation can be done in parallel

Job Scheduling on RDDs • Similar to Dryad, but takes data locality into account • When user runs an action, scheduler builds a DAG of stages to execute. Each stage contains as many pipelined transformations with narrow dependencies as possible. • Stage boundaries are the shuffle operations required for wide dependencies • Scheduler launches tasks to compute missing partitions from each stage. Tasks are assigned to machines based on data locality using delay scheduling. • For wide dependencies, intermediate records are materialized on the nodes holding parent partitions (similar to the intermediate map outputs of MapReduce) to simplify fault recovery.

Checkpointing • Although lineage can always be used to recover RDDs after a failure, checkpointing can be helpful for RDDs with long lineage chains containing wide dependencies. • For RDDs with narrow dependencies on data in stable storage, checkpointing is not worthwhile. Reconstruction can be done in parallel for these RDDs, at a fraction of the cost of replicating the whole RDD.

RDD vs Distributed Shared Memory (DSM) • Previous frameworks that support data reuse, e.g., Pregel and Piccolo. o Perform data sharing implicitly for these patterns o Specialized frameworks; do not provide abstractions for more general reuse o Programming interface supports fine-grained updates (reads and writes to each memory location): fault-tolerance requires expensive replication of data across machines or logging of updates across machines • RDD: o Only coarse-grained transformations (e.g., map, filter and join): apply the same operation to many data item. Note that reads on RDDs can still be fine-grained. o Fault-tolerance only requires logging the transformation used to build a dataset instead of the actual data o RDDs are not suitable for applications that make asynchronous fine-grained updates to shared state.

RDD vs Distributed Shared Memory (DSM) • Other benefits of RDDs: oRDDs are immutable. A system can mitigate slow nodes (stragglers) by running backup copies of slow tasks as in MapReduce. oIn bulk operations, a runtime can schedule tasks based on data locality to improve performance oRDDs degrade gracefully when there is not enough memory to store them. An LRU eviction policy is used at the level of RDDs. A partition from the least recently accessed RDD is evicted to make room for a newly computed RDD partition. This is user-configurable via the “persistence priority” for each RDD.

Debugging RDDs • One can reconstruct the RDDs later from the lineage and let the user query them interactively • One can re-run any task from the job in a single-process debugger by recomputing the RDD partitions it depends on. • Similar to the replay debuggers but without the capturing/recording overhead.

// load error messages from a log into memory; then interactively search for // various patterns // base RDD val lines = sc.textFile("hdfs://...") // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split("t")).map(r => r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() // action 2 messages.filter(_.contains("php")).count() RDD Example

// load error messages from a log into memory; then interactively search for // various patterns // base RDD val lines = sc.textFile("hdfs://...") // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split("t")).map(r => r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() // action 2 messages.filter(_.contains("php")).count() RDD

// load error messages from a log into memory; then interactively search for // various patterns // base RDD val lines = sc.textFile("hdfs://...") // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split("t")).map(r => r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() // action 2 messages.filter(_.contains("php")).count() RDD RDD RDD RDD Transformations

// load error messages from a log into memory; then interactively search for // various patterns // base RDD val lines = sc.textFile("hdfs://...") // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split("t")).map(r => r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() // action 2 messages.filter(_.contains("php")).count() RDD RDD RDD RDD Transformations Value Actions

Shared Variables • Broadcast variables let programmer keep a read-only variable cached on each machine rather than shipping a copy of it with tasks oFor example, to give every node a copy of a large input dataset efficiently • Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost

Shared Variables • Accumulators are variables that can only be “added” to through an associative operation. oUsed to implement counters and sums, efficiently in parallel • Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types • Only the driver program can read an accumulator’s value, not the workers oEach accumulator is given a unique ID upon creation oEach worker creates a separate copy of the accumulator oWorker sends a message to driver about the updates to the accumulator

Study Notes: Apache Spark

More Related Content

What's hot

Viewers also liked

Similar to Study Notes: Apache Spark

Recently uploaded

Study Notes: Apache Spark

Editor's Notes