Rebound: Scalable Checkpointing for Coherent Shared Memor for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas D Department of Computer Science f C S i University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu p
Checkpointing in Shared-Memory MPs rollback Fault save save chkpt chkpt • HW-based schemes for small CMPs use Global checkpointing – All procs participate in system-wide checkpoints P1 P1 P2 P2 P3 P4 P3 P4 checkpoint checkpoint h k i t • Global checkpointing is not scalable – Synchronization, bursty movement of data, loss in rollback… R. Agarwal, P. Garg, J. Torrellas 2 Rebound: Scalable Checkpointing
Alternative: Coordinated Local Checkpointing • Idea: threads coordinate their checkpointing in groups • Rationale: – Faults propagate only through communication – Interleaving between non-comm. threads is irrelevant P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 Local Global Local Chkpt Chkpt Chkpt + Scalable: Checkpoint and rollback in processor groups – Complexity: Record inter-thread dependences dynamically. C l it R d i t th d d d d i ll R. Agarwal, P. Garg, J. Torrellas 3 Rebound: Scalable Checkpointing
Contributions Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory p g y • Leverages directory protocol to track inter-thread deps. • Opts to boost checkpointing efficiency: • Delaying write-back of data to safe memory at checkpoints • Supporting multiple checkpoints • Optimizing checkpointing at barrier synchronization • Avg. performance overhead for 64 procs: 2% • Compared to 15% for global checkpointing p g p g R. Agarwal, P. Garg, J. Torrellas 4 Rebound: Scalable Checkpointing
Background: In-Memory Checkpt with ReVive [Pvrulovic-02] Execution Register Register P1 P1 P2 P2 P3 P3 Dump CHK Displacement Caches Dirty Cache Dirty Cache Writebacks lines W W W W WB Checkpoint Writeback Application Stalls Stalls Logging Log Memory R. Agarwal, P. Garg, J. Torrellas 5 Rebound: Scalable Checkpointing
Background: In-Memory Checkpt with ReVive [Pvrulovic-02] Old Register restored P1 P1 P2 P2 P3 P3 CHK Fault Caches Cache Invalidated W W W W WB Memory Lines Reverted R d Log Memory Global Local Coordinated Scalable protocol Broadcast protocol R. Agarwal, P. Garg, J. Torrellas 6 Rebound: Scalable Checkpointing
Coordinated Local Checkpointing Rules P1 P1 P1 P1 P2 P2 P2 P2 P1 P1 P2 P2 wr x rd x chkpt chkpt Consumer Producer Producer Consumer rollback rollback chkpoint chkpoint rollback rollback chkpoint chkpoint P checkpoints � P’s producers checkpoint P rolls back � P s consumers rollback � P’s consumers rollback P rolls back • Banatre et al. used Coordinated Local checkpointing for bus- based machines [Banatre96] based machines [Banatre96] R. Agarwal, P. Garg, J. Torrellas 7 Rebound: Scalable Checkpointing
Rebound Fault Model Chip Multiprocessor Main Memory Log (in SW) Log (in SW) • Any part of the chip can suffer transient or permanent faults. • A fault can occur even during checkpointing • Off-chip memory and logs suffer no fault on their own (e g NVM) Off chip memory and logs suffer no fault on their own (e.g. NVM) • Fault detection outside our scope: • Fault detection latency has upper-bound of L cycles R. Agarwal, P. Garg, J. Torrellas 8 Rebound: Scalable Checkpointing
Rebound Architecture Chip Multiprocessor Main Memory P+L1 MyProducer Dep MyConsumer L2 Register Directory Cache LW-ID R. Agarwal, P. Garg, J. Torrellas 9 Rebound: Scalable Checkpointing
Rebound Architecture Chip Multiprocessor Main Memory P+L1 MyProducer Dep MyConsumer L2 Register Directory Cache LW-ID • Dependence (Dep) registers in the L2 cache controller: p ( p) g • MyProducers : bitmap of proc. that produced data consumed by the local proc. • MyConsumers : bitmap of proc that consumed data produced MyConsumers : bitmap of proc. that consumed data produced by the local proc. R. Agarwal, P. Garg, J. Torrellas 10 Rebound: Scalable Checkpointing
Rebound Architecture Chip Multiprocessor Main Memory P+L1 MyProducer Dep MyConsumer L2 Register Directory Cache LW-ID • Dependence (Dep) registers in the L2 cache controller: p ( p) g • MyProducers : bitmap of proc. that produced data consumed by the local proc. • MyConsumers : bitmap of proc that consumed data produced MyConsumers : bitmap of proc. that consumed data produced by the local proc. • Processor ID in each directory entry: • LW-ID : last writer to the line in the current checkpoint interval. LW ID l t it t th li i th t h k i t i t l R. Agarwal, P. Garg, J. Torrellas 11 Rebound: Scalable Checkpointing
Recording Inter-Thread Dependences P1 P2 MyProducers MyProducers P1 writes MyConsumers MyConsumers Write Write LW-ID P1 D Log Memory Assume MESI protocol R. Agarwal, P. Garg, J. Torrellas 12 Rebound: Scalable Checkpointing
Recording Inter-Thread Dependences MyConsumers � P2 P1 P2 y MyProducers MyProducers P1 P2 reads MyConsumers MyConsumers P2 MyProducers � P1 LW-ID P1 S D Write back Logging gg g Log Memory Assume MESI protocol R. Agarwal, P. Garg, J. Torrellas 13 Rebound: Scalable Checkpointing
Recording Inter-Thread Dependences P1 P2 MyProducers MyProducers P1 P1 writes MyConsumers MyConsumers P2 LW-ID P1 S P1 P1 D Log Memory Assume MESI protocol R. Agarwal, P. Garg, J. Torrellas 14 Rebound: Scalable Checkpointing
Recording Inter-Thread Dependences P1 P2 Clear Dep registers p g MyProducers MyProducers P1 P1 checkpoints MyConsumers MyConsumers P2 Clear LW ID Clear LW-ID LW-ID LW-ID should remain set till P1 S Writebacks W it b k th li the line is i P1 D P1 checkpointed Logging Log Memory Assume MESI protocol R. Agarwal, P. Garg, J. Torrellas 15 Rebound: Scalable Checkpointing
Distributed Checkpointing Protocol in SW • Interaction Set [P i ]: set of producer processors (transitively) for P i – Built using MyProducers – Built using MyProducers InteractionSet : P1 P1 P2 P3 P4 P1 P1 chk initiate checkpoint checkpoint R. Agarwal, P. Garg, J. Torrellas 16 Rebound: Scalable Checkpointing
Distributed Checkpointing Protocol in SW • Interaction Set [P i ]: set of producer processors (transitively) for P i – Built using MyProducers – Built using MyProducers InteractionSet : P1, P2, P3 P1 P2 P3 P4 P1 P1 chk Ck? Ck? P2 P3 initiate checkpoint checkpoint R. Agarwal, P. Garg, J. Torrellas 17 Rebound: Scalable Checkpointing
Distributed Checkpointing Protocol in SW • Interaction Set [P i ]: set of producer processors (transitively) for P i – Built using MyProducers – Built using MyProducers InteractionSet : P1, P2, P3 P1 P2 P3 P4 P1 P1 chk Ck? Ck? P2 P3 Ck ? initiate P4 checkpoint checkpoint R. Agarwal, P. Garg, J. Torrellas 18 Rebound: Scalable Checkpointing
Distributed Checkpointing Protocol in SW • Interaction Set [P i ]: set of producer processors (transitively) for P i – Built using MyProducers – Built using MyProducers InteractionSet : P1, P2, P3 P1 P2 P3 P4 P1 P1 chk Ck? Ck? P2 P3 Ck ? initiate P4 checkpoint checkpoint R. Agarwal, P. Garg, J. Torrellas 19 Rebound: Scalable Checkpointing
Distributed Checkpointing Protocol in SW • Interaction Set [P i ]: set of producer processors (transitively) for P i – Built using MyProducers – Built using MyProducers InteractionSet : P1, P2, P3 P1 P2 P3 P4 P1 P1 chk Ck? Ck? P2 P3 Ck ? initiate P4 checkpoint checkpoint • Rollback handled similarly using MyConsumers R. Agarwal, P. Garg, J. Torrellas 20 Rebound: Scalable Checkpointing
Optimization1 : Delayed Writebacks Time Interval nterval I 1 I 1 Stall Stall In Checkpoint sync sync eckpoint nterval Stall WB dirty lines WB dirty lines I 2 sync In C Ch Interval sync I 2 • Checkpointing overhead dominated by data writebacks • Delayed Writeback optimization • Processors synchronize and resume execution • Hardware automatically writes back dirty lines in background • Checkpoint only completed when all delayed data written back • Still need to record inter-thread dependences on delayed data Still d t d i t th d d d d l d d t R. Agarwal, P. Garg, J. Torrellas 21 Rebound: Scalable Checkpointing
Delayed Writeback Pros/Cons + Significant reduction in checkpoint overhead - Additional support: Each processor has two sets of Dep. registers Each cache line has a delayed bit E h h li h d l d bit - Increased vulnerability A rollback event forces both intervals to roll back R. Agarwal, P. Garg, J. Torrellas 22 Rebound: Scalable Checkpointing
Optimization2 : Multiple Checkpoints • Problem: Fault detection is not instantaneous – Checkpoint is safe only after max fault-detection latency (L) p y y ( ) Ckpt 1 Dep registers 1 Rollback Ckpt 2 ection ency Dep registers 2 Late Dete t f Fault • Solution: Keep multiple checkpoints – On fault, roll back interacting processors to safe checkpoints • No Domino Effect R. Agarwal, P. Garg, J. Torrellas 23 Rebound: Scalable Checkpointing
Recommend
More recommend