Blocking and Non-blocking Checkpointing and Rollback Recovery for Networks-on-Chip Claudia Rusu 1 , Cristian Grecu 2 , Lorena Anghel 1 1 TIMA Laboratory, CNRS-UJF-INP, Grenoble, France 2 SoC Laboratory, University of British Columbia, Vancouver, Canada WDSN 2008 – Anchorage, AK 1
OUTLINE • Introduction – Networks-on-Chip Networks-on-Chip – – Checkpoint and rollback recovery • Coordinated checkpointing • Blocking and non-blocking coordinated checkpointing • Case study • Conclusions and future work WDSN 2008 – Anchorage, AK 2
Network-on-Chip based Systems • NoC vs. traditional connection systems P2P • NoC advantages NoC – Efficient sharing of wires – Shorter design Bus time, lower effort – Scalability Router PE Link WDSN 2008 – Anchorage, AK 3
NoC QoS vs. Faults • Quality of service (QoS) – reliability, throughput, latency, bandwidth • Unreliable signal transmission medium – timing and data errors – process variation, crosstalk, electromagnetic interference, radiations • Technology down Increased scaling vulnerability => • Increased system to faults complexity WDSN 2008 – Anchorage, AK 4
Fault Tolerance in Networks-on-Chip • Faults and Fault Tolerance – At different NoC components Router PE • Links Link • Routers Fault – switching blocks – memories – At different levels of the communication protocol stack Application Transport • Fault tolerant solutions Network – adaptive routing Data link – stochastic communication Physical – EDC, ECC, NMR WDSN 2008 – Anchorage, AK 5
OUTLINE • Introduction – Networks-on-Chip – Checkpoint and rollback recovery Checkpoint and rollback recovery – • Coordinated checkpointing • Blocking and non-blocking coordinated checkpointing • Case study • Conclusions and future work WDSN 2008 – Anchorage, AK 6
Checkpoint and Rollback Recovery. Principle restart • No failure tolerance – Failure => Restart start t failure rollback • Checkpoint and rollback rollback recovery recovery start consistent t failure – Failure => Resume from a more state recent state – Principle • Failure-free – periodically store states on stable storage • Failure – rollback to the last consistent stored state WDSN 2008 – Anchorage, AK 7
Checkpoint and Rollback Recovery. Consistent State • Message types vs. recovery line S A early/orphan message T A late t message message message future past S B T B t • Consistent state with late messages S A future T A message late t message message message future past S B T B t • early messages are avoided • late messages are to be replayed after rollback WDSN 2008 – Anchorage, AK 8
Checkpoint and Rollback Recovery. Classification • Checkpointing checkpointing coordinated uncoordinated communication-induced blocking non-blocking • Message logging message logging optimistic pessimistic causal WDSN 2008 – Anchorage, AK 9
OUTLINE • Introduction – Networks-on-Chip – Checkpoint and rollback recovery • Coordinated Coordinated checkpointing checkpointing • • Blocking and non-blocking coordinated checkpointing • Case study • Conclusions and future work WDSN 2008 – Anchorage, AK 10
Coordinated Checkpointing • Task checkpoint • Principle – task state rollback – list of late messages epoch • Late messages log T A – optimistic approach T B -> small latency on failure-free T C – logged at receiver T D -> small recovery overhead global consistent • Unique coordinator synchronizations states – reduced overhead • Unique blocking and non-blocking • Failure-free – synchronization protocol –> consistent state – allows for the same checkpoint • Failure the blocking of a task set and the – rollback to the last non-blocking of another consistent state WDSN 2008 – Anchorage, AK 11
Synchronization. Markers Inconsistent state • Markers T A – are used to message 1 • avoid early messages (early) • identify late messages and to T B end the log of late messages message 2 e ) t a – dedicated messages (avoid l ( T C long checkpointing durations when communication among certain tasks is scarce) Consistent state using markers • A task has taken the T A message 1 marker 1 checkpoint only after state and late messages form T B ) other tasks are on stable 2 y message 2 a r l e p k e storage r r a r m o f ( T C WDSN 2008 – Anchorage, AK 12
OUTLINE • Introduction – Networks-on-Chip – Checkpoint and rollback recovery • Coordinated checkpointing • Blocking and non-blocking coordinated Blocking and non-blocking coordinated • checkpointing checkpointing • Case study • Conclusions and future work WDSN 2008 – Anchorage, AK 13
Blocking and Non-blocking Coordinated Checkpointing Protocol • Checkpointing protocol • Synchronization messages Initiator Non-initiator (blocking or not) - broadcast CK_REQ - on CK_REQ receipt I - broadcast CK_START - when CK_START received from all tasks T D T A - take local - when CK_TAKEN checkpoint received from - send to all tasks initiator T C - validate CK_TAKEN T B global checkpoint WDSN 2008 – Anchorage, AK 14
Blocking and Non-blocking Overhead • Synchronization messages I – n nodes • CK_REQ n T D T A • CK_START n *( n -1) O ( n 2 ) • CK_TAKEN n T C T B • Messages in NoC during checkpointing � Blocking � Non-blocking – synchronization messages – synchronization messages – application messages WDSN 2008 – Anchorage, AK 15
Checkpointing Duration • High overhead during checkpointing –> checkpointing phase reduced rollback rollback T A T A T B T B T C T C T D T D • Long checkpointing durations –> reduced number of checkpoints • When failure rate is comparable with checkpointing duration -> rollbacks to the same old checkpoint WDSN 2008 – Anchorage, AK 16
OUTLINE • Introduction – Networks-on-Chip – Checkpoint and rollback recovery • Coordinated checkpointing • Blocking and non-blocking coordinated checkpointing • Case study Case study • • Conclusions and future work WDSN 2008 – Anchorage, AK 17
Case Study • 4x4 mesh direct NoC – XY routing Router – Wormhole switching PE Link • Consider – Different traffic loads • uniform traffic loads • constant message length – Different failure rates • Analyze – Checkpointing duration and overhead – Application latency WDSN 2008 – Anchorage, AK 18
Checkpointing Duration and Overhead • Checkpointing Duration • Memory Overhead WDSN 2008 – Anchorage, AK 19
Application Latency WDSN 2008 – Anchorage, AK 20
OUTLINE • Introduction – Networks-on-Chip – Checkpoint and rollback recovery • Coordinated checkpointing • Blocking and non-blocking coordinated checkpointing • Case study • Conclusions and future work Conclusions and future work • WDSN 2008 – Anchorage, AK 21
Conclusions and Future Work • Blocking and Non-blocking coordinated checkpointing – unique protocol • Analyze and compare overhead and latency – Checkpointing duration increases with the traffic load • Non-blocking: significantly • Blocking: lesser – Application latency increases with the traffic load and the failure rate • Non-blocking: significantly • Blocking: lesser –> For higher traffic loads and higher failure rates, the blocking approach becomes mandatory • Future work – Evaluate the proposed protocol • on other traffic patterns • on application with high traffic loads and critical tasks –> subsets of blocking and non-blocking tasks WDSN 2008 – Anchorage, AK 22
Thank you! WDSN 2008 – Anchorage, AK 23
Recommend
More recommend