spark
play

Spark Marco Serafini COMPSCI 532 Lecture 4 Goals Support for - PowerPoint PPT Presentation

Spark Marco Serafini COMPSCI 532 Lecture 4 Goals Support for iterative jobs Reuse of intermediate results without writing to disk Lineage-based fault tolerance Does not require checkpointing all intermediate results 3 3


  1. Spark Marco Serafini COMPSCI 532 Lecture 4

  2. Goals • Support for iterative jobs • Reuse of intermediate results without writing to disk • Lineage-based fault tolerance • Does not require checkpointing all intermediate results 3 3

  3. Resilient Data Sets • Collection of records (serialized objects) • Read-only • Created through deterministic transformations from • Data in storage • Other RDDs • Lineage of an RDD • Sequence of transformations that create it • Replayed (in parallel) from persisted data to recreate lost RDD • Caching an RDD: keeping it in memory for later 4

  4. Spark terminology • Driver • Process executing the application code • Sends RDD transformations and actions to workers • Workers • Host partitions of RDDs • Execute transformations 5 5

  5. More Terminology • Task • Unit of work that will be sent to one executor • Partition of fundamental operator, such as Map and Reduce • Stage • Set of parallel tasks one task per partition • Can include multiple pipelined tasks with no shuffling • Job • Parallel computation consisting of multiple stages • Spawned in response to an action 6 6

  6. More terminology • Shuffle • Data transfer among workers • Partition • Worker thread • Transformation • Function that produces new RDD but no output • Lazily evaluated • Actions • Function that returns output • Triggers evaluation 7 7

  7. Spark API 8 8

  8. Spark Computation • Driver executes application code • Workers execute only transformations • Driver sends closure to workers • Lazy evaluation • Driver records transformations without executing them • It builds a a Directed Acyclic Graph (DAG) of transformations • Execute only as needed when output (action) required 9 9

  9. Closures • Driver serializes functions to be executed by workers • It computes a “closure” and sends it to the workers • Closure includes all objects referenced by the function • Careful with references! Or you might send huge closures 10 10

  10. Narrow vs. Wide Operators • Narrow dependency • Executes on data local to the same worker • Can be pipelined locally • Faster recovery (only local re-execution needed) • Example: map à filter • Wide dependency • Requires a shuffle, which marks the end of a stage • Complex recovery (multi-worker) • Example: map à groupByKey (if don’t partition by that key) 11 11

  11. Checkpoints and Partitioning • Partitioning function declared when RDD created • Checkpointing • Speeds up recovery of lineage • When to checkpoint is left to the user • Checkpoint is stored (replicated) on file system 12 12

  12. PageRank Example 13 13

  13. Questions • Which intermediate RDDs are created? • Stages? Narrow/wide operators? • How to reduce shuffling? • Lineage graph 14 14

  14. Lineage Graph 15 15

Recommend


More recommend