Spark Marco Serafini COMPSCI 532 Lecture 4
Goals • Support for iterative jobs • Reuse of intermediate results without writing to disk • Lineage-based fault tolerance • Does not require checkpointing all intermediate results 3 3
Resilient Data Sets • Collection of records (serialized objects) • Read-only • Created through deterministic transformations from • Data in storage • Other RDDs • Lineage of an RDD • Sequence of transformations that create it • Replayed (in parallel) from persisted data to recreate lost RDD • Caching an RDD: keeping it in memory for later 4
Spark terminology • Driver • Process executing the application code • Sends RDD transformations and actions to workers • Workers • Host partitions of RDDs • Execute transformations 5 5
More Terminology • Task • Unit of work that will be sent to one executor • Partition of fundamental operator, such as Map and Reduce • Stage • Set of parallel tasks one task per partition • Can include multiple pipelined tasks with no shuffling • Job • Parallel computation consisting of multiple stages • Spawned in response to an action 6 6
More terminology • Shuffle • Data transfer among workers • Partition • Worker thread • Transformation • Function that produces new RDD but no output • Lazily evaluated • Actions • Function that returns output • Triggers evaluation 7 7
Spark API 8 8
Spark Computation • Driver executes application code • Workers execute only transformations • Driver sends closure to workers • Lazy evaluation • Driver records transformations without executing them • It builds a a Directed Acyclic Graph (DAG) of transformations • Execute only as needed when output (action) required 9 9
Closures • Driver serializes functions to be executed by workers • It computes a “closure” and sends it to the workers • Closure includes all objects referenced by the function • Careful with references! Or you might send huge closures 10 10
Narrow vs. Wide Operators • Narrow dependency • Executes on data local to the same worker • Can be pipelined locally • Faster recovery (only local re-execution needed) • Example: map à filter • Wide dependency • Requires a shuffle, which marks the end of a stage • Complex recovery (multi-worker) • Example: map à groupByKey (if don’t partition by that key) 11 11
Checkpoints and Partitioning • Partitioning function declared when RDD created • Checkpointing • Speeds up recovery of lineage • When to checkpoint is left to the user • Checkpoint is stored (replicated) on file system 12 12
PageRank Example 13 13
Questions • Which intermediate RDDs are created? • Stages? Narrow/wide operators? • How to reduce shuffling? • Lineage graph 14 14
Lineage Graph 15 15
Recommend
More recommend