unified model for assessing checkpointing protocols at
play

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale - PowerPoint PPT Presentation

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale George Bosilca 1 , Aur elien Bouteiller 1 , Elisabeth Brunet 2 , Franck Cappello 3 , Jack Dongarra 1 , Amina Guermouche 4 , erault 1 , Yves Robert 1 , 4 , Thomas H eric Vivien


  1. Unified Model for Assessing Checkpointing Protocols at Extreme-Scale George Bosilca 1 , Aur´ elien Bouteiller 1 , Elisabeth Brunet 2 , Franck Cappello 3 , Jack Dongarra 1 , Amina Guermouche 4 , erault 1 , Yves Robert 1 , 4 , Thomas H´ eric Vivien 4 , and Dounia Zaidouni 4 Fr´ ed´ 1 . University of Tennessee Knoxville, USA 2 . Telecom SudParis, France 3 . INRIA & University of Illinois at Urbana Champaign, USA 4 . Ecole Normale Sup´ erieure de Lyon & INRIA, France Pittsburgh, June 28, 2012

  2. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Motivation • Very very large number of processing elements (e.g., 2 20 ) = ⇒ Probability of failures dramatically increases • Large application to be executed on whole platform = ⇒ Failure(s) will most likely occur before completion! • Resilience provided through checkpointing 1 Coordinated protocols 2 Hierarchical protocols 2 / 35

  3. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Which checkpointing protocol to use? Coordinated checkpointing � No risk of cascading rollbacks � No need to log messages � All processors need to roll back � Rumor: May not scale to very large platforms Hierarchical checkpointing � Need to log inter-groups messages • Slowdowns failure-free execution • Increases checkpoint size/time � Only processors from failed group need to roll back � Faster re-execution with logged messages � Rumor: Should scale to very large platforms 3 / 35

  4. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Outline 1 Protocol Overhead Coordinated checkpointing Hierarchical checkpointing 2 Accounting for message logging 3 Instanciating the model Applications Platforms 4 Experimental results Plotting formulas Simulations 4 / 35

  5. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Outline 1 Protocol Overhead 2 Accounting for message logging 3 Instanciating the model 4 Experimental results 5 / 35

  6. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Framework • Periodic checkpointing policies (of period T ) • Independent and identically distributed failures • Platform failure inter-arrival time: µ • Tightly-coupled application: progress ⇔ all processors available • First-order approximation: at most one failure within a period Waste : fraction of time not spent for useful computations 6 / 35

  7. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Checkpointing cost Time spent working Time spent checkpointing Time Computing the first chunk Checkpointing the first chunk Processing the first chunk Processing the second chunk 7 / 35

  8. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Checkpointing cost Time spent working Time spent checkpointing Time Computing the first chunk Checkpointing the first chunk Processing the first chunk Processing the second chunk Blocking model: while a checkpoint is taken, no computation can be performed 7 / 35

  9. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Checkpointing cost Time spent working Time spent checkpointing Time Computing the first chunk Checkpointing the first chunk Processing the first chunk Processing the second chunk Non-blocking model: while a checkpoint is taken, computations are not impacted (e.g., first copy state to RAM, then copy RAM to disk) 7 / 35

  10. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Checkpointing cost Time spent working Time spent checkpointing Time spent working with slowdown Time Computing the first chunk Checkpointing the first chunk Processing the first chunk General model: while a checkpoint is taken, computations are slowed-down: during a checkpoint of duration C , the same amount of computation is done as during a time αC without checkpointing ( 0 ≤ α ≤ 1 ). 7 / 35

  11. Protocol Overhead Accounting for message logging Instanciating the model Experimental results 1 Protocol Overhead Coordinated checkpointing Hierarchical checkpointing 2 Accounting for message logging 3 Instanciating the model Applications Platforms 4 Experimental results Plotting formulas Simulations 8 / 35

  12. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste in absence of failures Time spent working Time spent checkpointing Time spent working with slowdown Time P 0 P 1 P 2 P 3 T − C C T Time elapsed since last checkpoint: T Amount of computation saved: ( T − C ) + αC Waste coord − nofailure = T − (( T − C ) + αC ) = (1 − α ) C T T 8 / 35

  13. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures Time spent working Time spent checkpointing Time spent working with slowdown Time P 0 P 1 P 2 P 3 Failure can happen 1 During computation phase 2 During checkpointing phase 8 / 35

  14. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures Time spent working Time spent checkpointing Time spent working with slowdown Time P 0 P 1 P 2 P 3 8 / 35

  15. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures Time spent working Time spent checkpointing Time spent working with slowdown Time P 0 P 1 P 2 P 3 8 / 35

  16. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures Time spent working Time spent checkpointing Time spent working with slowdown Time P 0 P 1 P 2 P 3 T lost Coordinated checkpointing protocol: when one processor is victim of a failure, all processors lose their work and must roll back to last checkpoint 8 / 35

  17. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Time P 0 P 1 P 2 P 3 D 8 / 35

  18. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Time P 0 P 1 P 2 P 3 R Coordinated checkpointing protocol: All processors must recover from last checkpoint 8 / 35

  19. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Re-executing slowed-down work Time P 0 P 1 P 2 P 3 C αC Redo the work destroyed by the failure, that was done in the checkpointing phase before the computation phase But no checkpoint is taken in parallel, hence this re-computation is faster than the original computation 8 / 35

  20. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Re-executing slowed-down work Time P 0 P 1 P 2 P 3 T − C Re-execute the computation phase 8 / 35

  21. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Re-executing slowed-down work Time P 0 P 1 P 2 P 3 C Finally, the checkpointing phase is executed First-order approximation: we assume that no other failure occurs during the re-execution 8 / 35

  22. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Re-executing slowed-down work Time P 0 P 1 P 2 P 3 T lost D R αC T − C C T ∆ Re-Exec : ∆ − T = T lost + αC Expectation: T lost = 1 2( T − C ) Re-Exec coord − fail − in − work = T − C + αC 2 8 / 35

  23. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures • Failure in the computation phase (probability: T − C ) T Re-Exec coord − fail − in − work = T − C + αC 2 • Failure in the checkpointing phase (probability: C T ) Re-Exec coord − fail − in − checkpoint = T − C 2 + αC T − C � T − C � + C � T − C � + αC 2 + αC T 2 T = αC + T 2 9 / 35

  24. Protocol Overhead Accounting for message logging Instanciating the model Experimental results Overall waste Waste coord = Waste coord − nofailure + 1 µ ( D + R + Re-Exec coord ) � � = (1 − α ) C + 1 D + R + αC + T T µ 2 Minimize Waste coord subject to: • C ≤ T (by construction) • T ≤ 0 . 1 µ ( ⇒ Proba ( Poisson ( T µ ) ≥ 2) ≤ 0 . 005 ) 10 / 35

Recommend


More recommend