The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses: - RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied. - RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation. - Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling. - The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
Overview of Apache Spark including goals, motivations, and general features. Emphasizes fault tolerance, low latency, and simplicity in computing.
Details on how a Spark application is structured, including simple application examples, components, RDD graph, and stages of job execution.
Describes various deployment modes of Spark including Local, Standalone, and YARN. Highlights benefits of Spark architecture over Hadoop.
Defines RDDs, their properties, operations, examples of transformations and dependencies in RDDs, emphasizing fault tolerance and efficient computation.
Explains types of transformations on RDDs, examples of transformations like map and filter, and the lazily evaluated nature of transformations.
Detailed walk-through of the Spark Word Count example including the execution plan, DAG creation, and shuffle phase during processing.
Explains the caching mechanism in Spark including storage modules, different storage levels, and how caching improves performance.
Overview of Spark schedulers including DAG and task schedulers, their roles in managing resources, scheduling tasks, and handling job execution.
Details the shuffling process in Spark, types of shuffling, and their implementations, highlighting performance optimizations.
Outline of the data transfer interface used during shuffling in Spark, including transport clients and servers.
Introduction and Motivations Whatis Apache Spark Project goals Generality: diverse workloads, operators, job sizes Low latency: sub-second Fault tolerance: faults are the norm, not the exception Simplicity: often comes from generality Pietro Michiardi (Eurecom) Apache Spark Internals 4 / 80
5.
Introduction and Motivations Motivations Softwareengineering point of view Hadoop code base is huge Contributions/Extensions to Hadoop are cumbersome Java-only hinders wide adoption, but Java support is fundamental System/Framework point of view Unified pipeline Simplified data flow Faster processing speed Data abstraction point of view New fundamental abstraction RDD Easy to extend with new operators More descriptive computing model Pietro Michiardi (Eurecom) Apache Spark Internals 5 / 80
6.
Introduction and Motivations Hadoop:No Unified Vision Sparse modules Diversity of APIs Higher operational costs Pietro Michiardi (Eurecom) Apache Spark Internals 6 / 80
Introduction and Motivations SPARK:Descriptive Computing Model 1 val file = sc.textFile("hdfs://...") 2 3 val counts = file.flatMap(line => line.split(" ")) 4 .map(word => (word,1)) 5 .reduceByKey(_ + _) 6 7 counts.saveAsTextFile("hdfs://...") Organize computation into multiple stages in a processing pipeline Transformations apply user code to distributed data in parallel Actions assemble final output of an algorithm, from distributed data Pietro Michiardi (Eurecom) Apache Spark Internals 10 / 80
11.
Introduction and Motivations FasterProcessing Speed Let’s focus on iterative algorithms Spark is faster thanks to the simplified data flow We avoid materializing data on HDFS after each iteration Example: k-means algorithm, 1 iteration HDFS Read Map(Assign sample to closest centroid) GroupBy(Centroid_ID) NETWORK Shuffle Reduce(Compute new centroids) HDFS Write Pietro Michiardi (Eurecom) Apache Spark Internals 11 / 80
12.
Introduction and Motivations CodeBase (2012) 2012 (version 0.6.x): 20,000 lines of code 2014 (branch-1.0): 50,000 lines of code Pietro Michiardi (Eurecom) Apache Spark Internals 12 / 80
13.
Anatomy of aSpark Application Anatomy of a Spark Application Pietro Michiardi (Eurecom) Apache Spark Internals 13 / 80
14.
Anatomy of aSpark Application A Very Simple Application Example 1 val sc = new SparkContext("spark://...", "MyJob", home, jars) 2 3 val file = sc.textFile("hdfs://...") // This is an RDD 4 5 val errors = file.filter(_.contains("ERROR")) // This is an RDD 6 7 errors.cache() 8 9 errors.count() // This is an action Pietro Michiardi (Eurecom) Apache Spark Internals 14 / 80
15.
Anatomy of aSpark Application Spark Applications: The Big Picture There are two ways to manipulate data in Spark Use the interactive shell, i.e., the REPL Write standalone applications, i.e., driver programs Pietro Michiardi (Eurecom) Apache Spark Internals 15 / 80
16.
Anatomy of aSpark Application Spark Components: details Pietro Michiardi (Eurecom) Apache Spark Internals 16 / 80
17.
Anatomy of aSpark Application The RDD graph: dataset vs. partition views Pietro Michiardi (Eurecom) Apache Spark Internals 17 / 80
18.
Anatomy of aSpark Application Data Locality Data locality principle Same as for Hadoop MapReduce Avoid network I/O, workers should manage local data Data locality and caching First run: data not in cache, so use HadoopRDD’s locality prefs (from HDFS) Second run: FilteredRDD is in cache, so use its locations If something falls out of cache, go back to HDFS Pietro Michiardi (Eurecom) Apache Spark Internals 18 / 80
19.
Anatomy of aSpark Application Lifetime of a Job in Spark RDD Objects rdd1.join(rdd2) .groupBy(...) .filter(...) Build the operator DAG DAG Scheduler Split the DAG into stages of tasks Submit each stage and its tasks as ready Task Scheduler Cluster( manager( Launch tasks via Master Retry failed and strag- gler tasks Worker Block& manager& Threads& Execute tasks Store and serve blocks Pietro Michiardi (Eurecom) Apache Spark Internals 19 / 80
20.
Anatomy of aSpark Application In Summary Our example Application: a jar file Creates a SparkContext, which is the core component of the driver Creates an input RDD, from a file in HDFS Manipulates the input RDD by applying a filter(f: T => Boolean) transformation Invokes the action count() on the transformed RDD The DAG Scheduler Gets: RDDs, functions to run on each partition and a listener for results Builds Stages of Tasks objects (code + preferred location) Submits Tasks to the Task Scheduler as ready Resubmits failed Stages The Task Scheduler Launches Tasks on executors Relaunches failed Tasks Reports to the DAG Scheduler Pietro Michiardi (Eurecom) Apache Spark Internals 20 / 80
Spark Deployments Spark DeploymentModes The Spark Framework can adopt several cluster managers Local Mode Standalone mode Apache Mesos Hadoop YARN General “workflow” Spark application creates SparkContext, which initializes the DriverProgram Registers to the ClusterManager Ask resources to allocate Executors Schedule Task execution Pietro Michiardi (Eurecom) Apache Spark Internals 23 / 80
24.
Spark Deployments Worker Nodesand Executors Worker nodes are machines that run executors Host one or multiple Workers One JVM (= 1 UNIX process) per Worker Each Worker can spawn one or more Executors Executors run tasks Run in child JVM (= 1 UNIX process) Execute one or more task using threads in a ThreadPool Pietro Michiardi (Eurecom) Apache Spark Internals 24 / 80
25.
Spark Deployments Comparison toHadoop MapReduce Hadoop MapReduce One Task per UNIX process (JVM), more if JVM reuse MultiThreadedMapper, advanced feature to have threads in Map Tasks → Short-lived Executor, with one large Task Spark Tasks run in one or more Threads, within a single UNIX process (JVM) Executor process statically allocated to worker, even with no threads → Long-lived Executor, with many small Tasks Pietro Michiardi (Eurecom) Apache Spark Internals 25 / 80
26.
Spark Deployments Benefits ofthe Spark Architecture Isolation Applications are completely isolated Task scheduling per application Low-overhead Task setup cost is that of spawning a thread, not a process 10-100 times faster Small tasks → mitigate effects of data skew Sharing data Applications cannot share data in memory natively Use an external storage service like Tachyon Resource allocation Static process provisioning for executors, even without active tasks Dynamic provisioning under development Pietro Michiardi (Eurecom) Apache Spark Internals 26 / 80
27.
Resilient Distributed Datasets ResilientDistributed Datasets M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In- Memory Cluster Computing, USENIX Symposium on Networked Systems Design and Imple- mentation, 2012 Pietro Michiardi (Eurecom) Apache Spark Internals 27 / 80
28.
Resilient Distributed Datasets Whatis an RDD RDD are partitioned, locality aware, distributed collections RDD are immutable RDD are data structures that: Either point to a direct data source (e.g. HDFS) Apply some transformations to its parent RDD(s) to generate new data elements Computations on RDDs Represented by lazily evaluated lineage DAGs composed by chained RDDs Pietro Michiardi (Eurecom) Apache Spark Internals 28 / 80
29.
Resilient Distributed Datasets RDDAbstraction Overall objective Support a wide array of operators (more than just Map and Reduce) Allow arbitrary composition of such operators Simplify scheduling Avoid to modify the scheduler for each operator → The question is: How to capture dependencies in a general way? Pietro Michiardi (Eurecom) Apache Spark Internals 29 / 80
30.
Resilient Distributed Datasets RDDInterfaces Set of partitions (“splits”) Much like in Hadoop MapReduce, each RDD is associated to (input) partitions List of dependencies on parent RDDs This is completely new w.r.t. Hadoop MapReduce Function to compute a partition given parents This is actually the “user-defined code” we referred to when discussing about the Mapper and Reducer classes in Hadoop Optional preferred locations This is to enforce data locality Optional partitioning info (Partitioner) This really helps in some “advanced” scenarios in which you want to pay attention to the behavior of the shuffle mechanism Pietro Michiardi (Eurecom) Apache Spark Internals 30 / 80
Resilient Distributed Datasets FilteredRDD partitions = same as parent RDD dependencies = one-to-one on parent compute(partition) = compute parent and filter it preferredLocations(part) = none (ask parent) partitioner = none Pietro Michiardi (Eurecom) Apache Spark Internals 32 / 80
33.
Resilient Distributed Datasets JoinedRDD partitions = one per reduce task dependencies = shuffle on each parent compute(partition) = read and join shuffled data preferredLocations(part) = none partitioner = HashPartitioner(numTask)1 1 Spark knows this data is hashed. Pietro Michiardi (Eurecom) Apache Spark Internals 33 / 80
Resilient Distributed Datasets DependencyTypes (2) Narrow dependencies Each partition of the parent RDD is used by at most one partition of the child RDD Task can be executed locally and we don’t have to shuffle. (Eg: map, flatMap, filter, sample) Wide Dependencies Multiple child partitions may depend on one partition of the parent RDD This means we have to shuffle data unless the parents are hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey, cogroupByKey, join, cartesian) Pietro Michiardi (Eurecom) Apache Spark Internals 35 / 80
36.
Resilient Distributed Datasets DependencyTypes: Optimizations Benefits of Lazy evaluation The DAG Scheduler optimizes Stages and Tasks before submitting them to the Task Scheduler Piplining narrow dependencies within a Stage Join plan selection based on partitioning Cache reuse Pietro Michiardi (Eurecom) Apache Spark Internals 36 / 80
37.
Resilient Distributed Datasets Operationson RDDs: Transformations Transformations Set of operations on a RDD that define how they should be transformed As in relational algebra, the application of a transformation to an RDD yields a new RDD (because RDD are immutable) Transformations are lazily evaluated, which allow for optimizations to take place before execution Examples (not exhaustive) map(func), flatMap(func), filter(func) grouByKey() reduceByKey(func), mapValues(func), distinct(), sortByKey(func) join(other), union(other) sample() Pietro Michiardi (Eurecom) Apache Spark Internals 37 / 80
38.
Resilient Distributed Datasets Operationson RDDs: Actions Actions Apply transformation chains on RDDs, eventually performing some additional operations (e.g., counting) Some actions only store data to an external data source (e.g. HDFS), others fetch data from the RDD (and its transformation chain) upon which the action is applied, and convey it to the driver Examples (not exhaustive) reduce(func) collect(), first(), take(), foreach(func) count(), countByKey() saveAsTextFile() Pietro Michiardi (Eurecom) Apache Spark Internals 38 / 80
39.
Resilient Distributed Datasets Operationson RDDs: Final Notes Look at return types! Return type: RDD → transformation Return type: built-in scala/java types such as int, long, List<Object>, Array<Object> → action Caching is a transformation Hints to keep RDD in memory after its first evaluation Transformations depend on RDD “flavor” PairRDD SchemaRDD Pietro Michiardi (Eurecom) Apache Spark Internals 39 / 80
40.
Resilient Distributed Datasets RDDCode Snippet SparkContext This is the main entity responsible for setting up a job Contains SparkConfig, Scheduler, entry point of running jobs (runJobs) Dependencies Input RDD(s) Pietro Michiardi (Eurecom) Apache Spark Internals 40 / 80
41.
Resilient Distributed Datasets RDD.mapoperation Snippet Map: RDD[T] → RDD[U] MappedRDD For each element in a partition, apply function f Pietro Michiardi (Eurecom) Apache Spark Internals 41 / 80
42.
Resilient Distributed Datasets RDDIterator Code Snipped Method to go through an RDD and apply function f First, check local cache If not found, compute the RDD Storage Levels Disk Memory Off Heap (e.g. external memory stores like Tachyon) De-serialized Pietro Michiardi (Eurecom) Apache Spark Internals 42 / 80
43.
Resilient Distributed Datasets MakingRDD from local collections Convert a local (on the driver) Seq[T] into RDD[T] Pietro Michiardi (Eurecom) Apache Spark Internals 43 / 80
44.
Resilient Distributed Datasets HadoopRDD Code Snippet Reading HDFS data as <key, value> records Pietro Michiardi (Eurecom) Apache Spark Internals 44 / 80
Resilient Distributed Datasets CommonTransformations map(f: T => U) Returns a MappedRDD[U] by applying f to each element Pietro Michiardi (Eurecom) Apache Spark Internals 46 / 80
47.
Resilient Distributed Datasets CommonTransformations flatMap(f: T => TraversableOnce[U]) Returns a FlatMappedRDD[U] by first applying f to each element, then flattening the results Pietro Michiardi (Eurecom) Apache Spark Internals 47 / 80
48.
Spark Word Count DetailedExample: Word Count Pietro Michiardi (Eurecom) Apache Spark Internals 48 / 80
49.
Spark Word Count SparkWord Count: the driver 1 import org.apache.spark.SparkContext 2 3 import org.apache.spark.SparkContext._ 4 5 val sc = new SparkContext("spark://...", "MyJob", "spark home", "additional jars") Driver and SparkContext A SparkContext initializes the application driver, the latter then registers the application to the cluster manager, and gets a list of executors Then, the driver takes full control of the Spark job Pietro Michiardi (Eurecom) Apache Spark Internals 49 / 80
50.
Spark Word Count SparkWord Count: the code 1 val lines = sc.textFile("input") 2 val words = lines.flatMap(_.split(" ")) 3 val ones = words.map(_ -> 1) 4 val counts = ones.reduceByKey(_ + _) 5 val result = counts.collectAsMap() RDD lineage DAG is built on driver side with Data source RDD(s) Transformation RDD(s), which are created by transformations Job submission An action triggers the DAG scheduler to submit a job Pietro Michiardi (Eurecom) Apache Spark Internals 50 / 80
51.
Spark Word Count SparkWord Count: the DAG Directed Acyclic Graph Built from the RDD lineage DAG scheduler Transforms the DAG into stages and turns each partition of a stage into a single task Decides what to run Pietro Michiardi (Eurecom) Apache Spark Internals 51 / 80
52.
Spark Word Count SparkWord Count: the execution plan Spark Tasks Serialized RDD lineage DAG + closures of transformations Run by Spark executors Task scheduling The driver side task scheduler launches tasks on executors according to resource and locality constraints The task scheduler decides where to run tasks Pietro Michiardi (Eurecom) Apache Spark Internals 52 / 80
53.
Spark Word Count SparkWord Count: the Shuffle phase 1 val lines = sc.textFile("input") 2 val words = lines.flatMap(_.split(" ")) 3 val ones = words.map(_ -> 1) 4 val counts = ones.reduceByKey(_ + _) 5 val result = counts.collectAsMap() reduceByKey transformation Induces the shuffle phase In particular, we have a wide dependency Like in Hadoop MapReduce, intermediate <key,value> pairs are stored on the local file system Automatic combiners! The reduceByKey transformation implements map-side combiners to pre-aggregate data Pietro Michiardi (Eurecom) Apache Spark Internals 53 / 80
Caching and Storage Spark’sStorage Module The storage module Access (I/O) “external” data sources: HDFS, Local Disk, RAM, remote data access through the network Caches RDDs using a variety of “storage levels” Main components The Cache Manager: uses the Block Manager to perform caching The Block Manager: distributed key/value store Pietro Michiardi (Eurecom) Apache Spark Internals 55 / 80
56.
Caching and Storage ClassDiagram of the Caching Component Pietro Michiardi (Eurecom) Apache Spark Internals 56 / 80
57.
Caching and Storage HowCaching Works Frequently used RDD can be stored in memory Deciding which RDD to cache is an art! One method, one short-cut: persist(), cache() SparkContext keeps track of cached RDD Uses a data-structed called persistentRDD Maintains references to cached RDD, and eventually call the garbage collector Time-stamp based invalidation using TimeStampedWeakValueHashMap[A, B] Pietro Michiardi (Eurecom) Apache Spark Internals 57 / 80
58.
Caching and Storage HowCaching Works Pietro Michiardi (Eurecom) Apache Spark Internals 58 / 80
59.
Caching and Storage TheBlock Manager “Write-once” key-value store One node per worker No updates, data is immutable Main tasks Serves shuffle data (local or remote connections) and cached RDDs Tracks the “Storage Level” (RAM, disk) for each block Spills data to disk if memory is insufficient Handles data replication, if required Pietro Michiardi (Eurecom) Apache Spark Internals 59 / 80
60.
Caching and Storage StorageLevels The Block Manager can hold data in various storage tiers org.apache.spark.storage.StorageLevel contains flags to indicate which tier to use Manual configuration, in the application Deciding the storage level to use for RDDs is not trivial Available storage tiers RAM (default option): if the the RDD doesn’t fit in memory, some partitions will not be cached (will be re-computed when needed) Tachyon (off java heap): reduces garbage collection overhead, the crash of an executor no longer leads to cached data loss Disk Data format Serialized or as Java objects Replicated partitions Pietro Michiardi (Eurecom) Apache Spark Internals 60 / 80
Resource Allocation Spark Schedulers Twomain scheduler components, executed by the driver The DAG scheduler The Task scheduler Objectives Gain a broad understanding of how Spark submits Applications Understand how Stages and Tasks are built, and their optimization Understand interaction among various other Spark components Pietro Michiardi (Eurecom) Apache Spark Internals 62 / 80
Resource Allocation The DAGScheduler Stage-oriented scheduling Computes a DAG of stages for each job in the application Lines 10-14, details in Lines 15-27 Keeps track of which RDD and stage output are materialized Determines an optimal schedule, minimizing stages Submit stages as sets of Tasks (TaskSets) to the Task scheduler Line 26 Data locality principle Uses “preferred location” information (optionally) attached to each RDD Line 20 Package this information into Tasks and send it to the Task scheduler Manages Stage failures Failure type: (intermediate) data loss of shuffle output files Failed stages will be resubmitted NOTE: Task failures are handled by the Task scheduler, which simply resubmit them if they can be computed with no dependency on previous output Pietro Michiardi (Eurecom) Apache Spark Internals 65 / 80
66.
Resource Allocation The DAGScheduler: Implementation Details Implemented as an event queue Uses a daemon thread to handle various kinds of events Line 6 JobSubmitted, JobCancelled, CompletionEvent The thread “swipes” the queue, and routes event to the corresponding handlers What happens when a job is submitted to the DAGScheduler? JobWaiter object is created JobSubmitted event is fired The daemon thread blocks and wait for a job result Lines 3,4 Pietro Michiardi (Eurecom) Apache Spark Internals 66 / 80
67.
Resource Allocation The DAGScheduler: Implementation Details (2) Who handles the JobSubmitted event? Specific handler called handleJobSubmitted Line 6 Walk-through to the Job Submitted handler Create a new job, called ActiveJob New job starts with only 1 stage, corresponding to the last stage of the job upon which an action is called Lines 8-9 Use the dependency information to produce additional stages Shuffle Dependency: create a new map stage Line 16 Narrow Dependency: pipes them into a single stage getMissingParentStages Pietro Michiardi (Eurecom) Apache Spark Internals 67 / 80
68.
Resource Allocation More AboutStages What is a DAG Directed acyclic graph of stages Stage boundaries determined by the shuffle phase Stages are run in topological order Definition of a Stage Set of independent tasks All tasks of a stage apply the same function All tasks of a stage have the same dependency type All tasks in a stage belong to a TaskSet Stage types Shuffle Map Stage: stage tasks results are inputs for another stage Result Stage: tasks compute the final action that initiated a job (e.g., count(), save(), etc.) Pietro Michiardi (Eurecom) Apache Spark Internals 68 / 80
69.
Resource Allocation The TaskScheduler Task oriented scheduling Schedules tasks for a single SparkContext Submits tasks sets produced by the DAG Scheduler Retries failed tasks Takes care of stragglers with speculative execution Produces events for the DAG Scheduler Implementation details The Task scheduler creates a TaskSetManager to wrap the TaskSet from the DAG scheduler Line 28 The TaskSetManager class operates as follows: Keeps track of each task status Retries failed tasks Imposes data locality using delayed scheduling Lines 29,30 Message passing implemented using Actors, and precisely using the Akka framework Pietro Michiardi (Eurecom) Apache Spark Internals 69 / 80
Resource Allocation Running Taskson Executors Executors run two kinds of tasks ResultTask: apply the action on the RDD, once it has been computed, alongside all its dependencies Line 19 ShuffleTask: use the Block Manager to store shuffle output using the ShuffleWriter Lines 23,24 The ShuffleRead component depends on the type of the RDD, which is determined by the compute function and the transformation applied to it Pietro Michiardi (Eurecom) Apache Spark Internals 71 / 80
Data Shuffling The SparkShuffle Mechanism Same concept as for Hadoop MapReduce, involving: Storage of “intermediate” results on the local file-system Partitioning of “intermediate” data Serialization / De-serialization Pulling data over the network Transformations requiring a shuffle phase groupByKey(), reduceByKey(), sortByKey(), distinct() Various types of Shuffle Hash Shuffle Consolidate Hash Shuffle Sort-based Shuffle Pietro Michiardi (Eurecom) Apache Spark Internals 73 / 80
74.
Data Shuffling The SparkShuffle Mechanism: an Illustration Data Aggregation Defined on ShuffleMapTask Two methods available: AppendOnlyMap: in-memory hash table combiner ExternalAppendOnlyMap: memory + disk hash table combiner Batching disk writes to increase throughput Pietro Michiardi (Eurecom) Apache Spark Internals 74 / 80
75.
Data Shuffling The SparkShuffle Mechanism: Implementation Details Pluggable component Shuffle Manager: components registered to SparkEnv, configured through SparkConf Shuffle Writer: tracks “intermediate data” for the MapOutputTracker Shuffle Reader: pull-based mechanism used by the ShuffleRDD Shuffle Block Manager: mapping between logical partitioning and the physical layout of data Pietro Michiardi (Eurecom) Apache Spark Internals 75 / 80
76.
Data Shuffling The HashShuffle Mechanism Map Tasks write output to multiple files Assume: m map tasks and r reduce tasks Then: m × r shuffle files as well as in-memory buffers (for batching writes) Be careful on storage space requirements! Buffer size must not be too big with many tasks Buffer size must not be too small, for otherwise throughput decreases Pietro Michiardi (Eurecom) Apache Spark Internals 76 / 80
77.
Data Shuffling The ConsolidateHash Shuffle Mechanism Addresses buffer size problems Executor view vs. Task view Buckets are consolidated in a single file Hence: F = C × r files and buffers, where C is the number of Task threads within an Executor Pietro Michiardi (Eurecom) Apache Spark Internals 77 / 80
78.
Data Shuffling The Sort-basedShuffle Mechanism Implements the Hadoop Shuffle mechanism Single shuffle file, plus an index file to find “buckets” Very beneficial for write throughput, as more disk writes can be batched Sorting mechanism Pluggable external sorter Degenerates to Hash Shuffle if no sorting is required Pietro Michiardi (Eurecom) Apache Spark Internals 78 / 80
79.
Data Shuffling Data Transfer:Implementation Details BlockTransfer Service General interface for ShuffleFetcher Uses BlockDataManager to get local data Shuffle Client Manages and wraps the “client-side”, setting up the TransportContext and TransportClient Transport Context: manages the transport layer Transport Server: streaming server Transport Client: fetches consecutive chunks Pietro Michiardi (Eurecom) Apache Spark Internals 79 / 80