resilient and energy aware scheduling algorithms
play

Resilient and energy-aware scheduling algorithms Anne Benoit LIP, - PowerPoint PPT Presentation

Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Resilient and energy-aware scheduling algorithms Anne Benoit LIP, Ecole Normale Sup erieure de Lyon, France Anne.Benoit@ens-lyon.fr


  1. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Coping with fail-stop errors Periodic checkpoint, rollback, and recovery: (no error) C T C T C Time Fail-stop error (error) C T C T C Time Fail-stop error (error) C R T C T C Time Coordinated checkpointing (the platform is a giant macro-processor) Assume instantaneous interruption and detection. Rollback to last checkpoint and re-execute. Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 11/ 84

  2. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Coping with silent errors Silent error = detection latency Error is detected only when corrupted data is activated Same approach? Silent error C T C T C Time Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors! Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 12/ 84

  3. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Coping with silent errors Silent error = detection latency Error is detected only when corrupted data is activated Same approach? Silent error Detection C T C T C Time Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors! Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 12/ 84

  4. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Coping with silent errors Silent error = detection latency Error is detected only when corrupted data is activated Same approach? Silent error Detection corrupted! C T C T C Time Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors! Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 12/ 84

  5. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Coping with silent errors Silent error = detection latency Error is detected only when corrupted data is activated Same approach? Silent error Detection corrupted! C T C T C Time Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors! Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 12/ 84

  6. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Coping with silent errors Silent error = detection latency Error is detected only when corrupted data is activated Same approach? Detection corrupted? corrupted! C T C T C Time Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors! Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 12/ 84

  7. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Coping with silent errors Silent error = detection latency Error is detected only when corrupted data is activated Same approach? Detection corrupted? corrupted! C T C T C Time Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors! Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 12/ 84

  8. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Coping with silent errors Silent error = detection latency Error is detected only when corrupted data is activated Same approach? Detection corrupted? corrupted! C T C T C Time Keep multiple checkpoints? Which checkpoint to recover from? Need an active method to detect silent errors! Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 12/ 84

  9. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Methods for detecting silent errors General-purpose approaches Replication [ Fiala et al. 2012 ] or triple modular redundancy and voting [ Lyons and Vanderkulk 1962 ] Application-specific approaches Algorithm-based fault tolerance (ABFT): checksums in dense matrices Limited to one error detection and/or correction in practice [ Huang and Abraham 1984 ] Partial differential equations (PDE): use lower-order scheme as verification mechanism [ Benson, Schmit and Schreiber 2014 ] Generalized minimal residual method (GMRES): inner-outer iterations [ Hoemmen and Heroux 2011 ] Preconditioned conjugate gradients (PCG): orthogonalization check every k iterations, re-orthogonalization if problem detected [ Sao and Vuduc 2013, Chen 2013 ] Data-analytics approaches Dynamic monitoring of HPC datasets based on physical laws (e.g., temperature limit, speed limit) and space or temporal proximity [ Bautista-Gomez and Cappello 2014 ] Time-series prediction, spatial multivariate interpolation [ Di et al. 2014 ] Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 13/ 84

  10. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Coping with fail-stop and silent errors (no error) V C T V C T V C Time Fail-stop error (fail-stop error) V C R T V C T V C Time Silent error Detection (silent error) V C T V R T V C T V C Time What is the optimal checkpointing period? Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 14/ 84

  11. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Outline Checkpointing for resilience 1 How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption Combining checkpoint with replication 2 Replication analysis Simulations Back to task scheduling 3 A different re-execution speed can help 4 Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors Summary and need for trade-offs 5 Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 15/ 84

  12. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Optimization objective (1/2) C T C T C Time T is the pattern length (time without failures) C is the checkpoint cost E ( T ) is the expected execution time of the pattern By definition, the overhead of the pattern is defined as: H ( T ) = E ( T ) − 1 T The overhead measures the fraction of extra time due to: Checkpoints Recoveries and re-executions (failures) The goal is to minimize the quantity: H ( T ) Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 16/ 84

  13. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Optimization objective (2/2) Goal: Find the optimal pattern length T ∗ , so that the overhead is minimized Overhead: H ( T ) = E ( T ) − 1 T 1. Compute expected execution time E ( T ) (exact formula) 2. Compute overhead H ( T ) (first-order approximation) 3. Derive optimal T ∗ : fail-stop errors 4. Derive optimal T ∗ : silent errors 5. Derive optimal T ∗ : both Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 17/ 84

  14. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion 1. Expected execution time E ( T ) T : Pattern length C : Checkpoint time R : Recovery time λ f = 1 µ f : Fail-stop error rate (no error) C T C T C Time E ( T ) = P no − error ( T + C ) + Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 18/ 84

  15. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion 1. Expected execution time E ( T ) T : Pattern length C : Checkpoint time R : Recovery time λ f = 1 µ f : Fail-stop error rate (no error) C T C T C Time Fail-stop error (recovery) C R T C T C ���� Time E lost E ( T ) = P no − error ( T + C ) � � E lost + R + E ( T ) + P error Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 18/ 84

  16. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion 1. Expected execution time E ( T ) Assume that failures follow an exponential distribution Exp( λ f ) Independent errors (memoryless property) There is at least one error before time t with probability: P ( X ≤ t ) = 1 − e − λ f t (cdf) Probability of failure / no-failure P error = 1 − e − λ f T P no − error = e − λ f T Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 19/ 84

  17. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion 1. Expected execution time E ( T ) (no error) C T C T C Time Fail-stop error (recovery) C R T C T C ���� Time E lost � � E ( T ) = e − λ f T ( T + C ) + (1 − e − λ f T ) E lost + R + E ( T ) � � = T + C + ( e λ f T − 1) E lost + R E lost is the time lost when the failure strikes: � ∞ t P ( X = t | X < T ) dt = 1 e λ f T − 1 = T T E lost = 2 + o ( λ f T ) λ f − 0 We lose half the pattern upon failure (in expectation)! Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 20/ 84

  18. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion 1. Expected execution time E ( T ) (no error) C T C T C Time Fail-stop error (recovery) C R T C T C ���� Time E lost � � E ( T ) = e − λ f T ( T + C ) + (1 − e − λ f T ) E lost + R + E ( T ) � � = T + C + ( e λ f T − 1) E lost + R E lost is the time lost when the failure strikes: � ∞ t P ( X = t | X < T ) dt = 1 e λ f T − 1 = T T E lost = 2 + o ( λ f T ) λ f − 0 We lose half the pattern upon failure (in expectation)! Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 20/ 84

  19. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion 2. Compute overhead H ( T ) (no error) C T C T C Time Fail-stop error (recovery) C R T C T C ���� Time E lost We use Taylor series to approximate e − λ f T up to first-order terms: e − λ f T = 1 − λ f T + o ( λ f T ) Works well provided that λ f << T , C , R � T � E ( T ) = T + C + λ f T + o ( λ f T ) 2 + R Finally, we get the overhead of the pattern: H ( T ) = C T + λ f T 2 + o ( λ f T ) Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 21/ 84

  20. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion 2. Compute overhead H ( T ) (no error) C T C T C Time Fail-stop error (recovery) C R T C T C ���� Time E lost We use Taylor series to approximate e − λ f T up to first-order terms: e − λ f T = 1 − λ f T + o ( λ f T ) Works well provided that λ f << T , C , R � T � E ( T ) = T + C + λ f T + o ( λ f T ) 2 + R Finally, we get the overhead of the pattern: H ( T ) = C T + λ f T 2 + o ( λ f T ) Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 21/ 84

  21. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion 3. Derive optimal T ∗ : Fail-stop errors (no error) C T C T C Time Fail-stop error (recovery) C R T C T C ���� Time E lost H ( T ) = C T + λ f T 2 + o ( λ f T ) We solve: T 2 + λ f ∂ H ( T ) = − C 2 = 0 ∂ T Finally, we retrieve: � 2 C � T ∗ = 2 µ f C λ f = Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 22/ 84

  22. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion 3. Derive optimal T ∗ : Fail-stop errors (no error) C T C T C Time Fail-stop error (recovery) C R T C T C ���� Time E lost H ( T ) = C T + λ f T 2 + o ( λ f T ) We solve: T 2 + λ f ∂ H ( T ) = − C 2 = 0 ∂ T Finally, we retrieve: � 2 C � T ∗ = 2 µ f C λ f = Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 22/ 84

  23. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion 4. Derive optimal T ∗ : Silent errors Silent error Detection (silent error) V C T V R T V C T V C Time Similar to fail-stop except: λ f → λ s E lost = T V : verification time Using the same approach: H ( T ) = C + V + λ s T + o ( λ s T ) T ���� silent Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 23/ 84

  24. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion 5. Derive optimal T ∗ : Both errors H ( T ) = C + V + λ f T + λ s T + o ( λ T ) T 2 ���� ���� silent fail − stop First-order approximations [Young 1974, Daly 2006, AB et al. 2016] Fail-stop errors Silent errors Both errors Pattern T + C T + V + C T + V + C � � C � V + C V + C Optimal T ∗ λ f λ s λ s + λ f 2 2 �� � � � λ s + λ f λ f Overhead H ∗ 2 2 C 2 λ s ( V + C ) 2 ( V + C ) 2 Is this optimal for energy consumption? Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 24/ 84

  25. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion 5. Derive optimal T ∗ : Both errors H ( T ) = C + V + λ f T + λ s T + o ( λ T ) T 2 ���� ���� silent fail − stop First-order approximations [Young 1974, Daly 2006, AB et al. 2016] Fail-stop errors Silent errors Both errors Pattern T + C T + V + C T + V + C � � C � V + C V + C Optimal T ∗ λ f λ s λ s + λ f 2 2 �� � � � λ s + λ f λ f Overhead H ∗ 2 2 C 2 λ s ( V + C ) 2 ( V + C ) 2 Is this optimal for energy consumption? Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 24/ 84

  26. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Outline Checkpointing for resilience 1 How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption Combining checkpoint with replication 2 Replication analysis Simulations Back to task scheduling 3 A different re-execution speed can help 4 Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors Summary and need for trade-offs 5 Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 25/ 84

  27. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Energy model (1/2) Modern processors equipped with dynamic voltage and frequency scaling (DVFS) capability Power consumption of processing unit is P idle + κσ 3 , where κ > 0 and σ is the processing speed Error rate: May also depend on processing speed λ ( σ ) follows a U-shaped curve increases exponentially with decreased processing speed σ increases also with increased speed because of high temperature Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 26/ 84

  28. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Energy model (2/2) Total power consumption depends on: P idle : static power dissipated when platform is on (even idle) P cpu ( σ ): dynamic power spent by operating CPU at speed σ P io : dynamic power spent by I/O transfers (checkpoints and recoveries) Computation and verification: power depends upon σ (total time T cpu ( σ )) Checkpointing and recovering: I/O transfers (total time T io ) Total energy consumption: Energy ( σ ) = T cpu ( σ )( P idle + P cpu ( σ )) + T io ( P idle + P io ) Checkpoint: E C = C ( P idle + P io ) Recover: E R = R ( P idle + P io ) Verify at speed σ : E V ( σ ) = V ( σ )( P idle + P cpu ( σ )) Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 27/ 84

  29. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Bi-criteria problem Linear combination of execution time and energy consumption: a · Time + b · Energy Theorem Application subject to both fail-stop and silent errors Minimize a · Time + b · Energy � 2( V ( σ )+ C e ( σ )) The optimal checkpointing period is T ∗ ( σ ) = λ f ( σ )+2 λ s ( σ ) , a + b ( P idle + P io ) where C e ( σ ) = a + b ( P idle + P cpu ( σ )) C Similar optimal period as without energy, � T ∗ = 2( V + C ) λ f +2 λ s but account for new parameters! Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 28/ 84

  30. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Bi-criteria problem Linear combination of execution time and energy consumption: a · Time + b · Energy Theorem Application subject to both fail-stop and silent errors Minimize a · Time + b · Energy � 2( V ( σ )+ C e ( σ )) The optimal checkpointing period is T ∗ ( σ ) = λ f ( σ )+2 λ s ( σ ) , a + b ( P idle + P io ) where C e ( σ ) = a + b ( P idle + P cpu ( σ )) C Similar optimal period as without energy, � T ∗ = 2( V + C ) λ f +2 λ s but account for new parameters! Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 28/ 84

  31. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Outline Checkpointing for resilience 1 How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption Combining checkpoint with replication 2 Replication analysis Simulations Back to task scheduling 3 A different re-execution speed can help 4 Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors Summary and need for trade-offs 5 Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 29/ 84

  32. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion When Amdahl meets Young/Daly Error-free speedup with P processors and α sequential fraction: 1 Amdahl’s Law : S ( P ) = α + 1 − α P Bounded above by 1 /α Strictly increasing function of P Allocating more processors on an error-prone platform? Higher error-free speedup � More errors/faults � More frequent checkpointing � More resilience overhead � We can compute optimal processor allocation and checkpointing interval! Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 30/ 84

  33. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion How is replication used? On a Q -processor platform, application is replicated n times: Duplication : each replica has P = Q / 2 processors Triplication : each replica has P = Q / 3 processors General case : each replica has P = Q / n processors Having more replicas on an error-prone platform? Lower error-free speedup � More resilient � Smaller checkpointing frequency � Less resilience overhead � Optimal replication level, processor allocation per replica, and checkpointing interval? Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 31/ 84

  34. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion How is replication used? On a Q -processor platform, application is replicated n times: Duplication : each replica has P = Q / 2 processors Triplication : each replica has P = Q / 3 processors General case : each replica has P = Q / n processors Having more replicas on an error-prone platform? Lower error-free speedup � More resilient � Smaller checkpointing frequency � Less resilience overhead � Optimal replication level, processor allocation per replica, and checkpointing interval? Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 31/ 84

  35. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion How is replication used? On a Q -processor platform, application is replicated n times: Duplication : each replica has P = Q / 2 processors Triplication : each replica has P = Q / 3 processors General case : each replica has P = Q / n processors Having more replicas on an error-prone platform? Lower error-free speedup � More resilient � Smaller checkpointing frequency � Less resilience overhead � Optimal replication level, processor allocation per replica, and checkpointing interval? Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 31/ 84

  36. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Why is replication useful? Error detection (duplication) : Error correction (triplication) : Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 32/ 84

  37. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Why is replication useful? Error detection (duplication) : Error correction (triplication) : Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 32/ 84

  38. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Why is replication useful? Error detection (duplication) : Error correction (triplication) : Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 32/ 84

  39. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Why is replication useful? Error detection (duplication) : Error correction (triplication) : Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 32/ 84

  40. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Why is replication useful? Error detection (duplication) : Error correction (triplication) : Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 32/ 84

  41. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Outline Checkpointing for resilience 1 How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption Combining checkpoint with replication 2 Replication analysis Simulations Back to task scheduling 3 A different re-execution speed can help 4 Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors Summary and need for trade-offs 5 Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 33/ 84

  42. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Two replication modes Process replication : Group replication : Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 34/ 84

  43. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Two replication modes Process replication : Group replication : Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 34/ 84

  44. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Probability of failure Independent process error distribution: Exponential Exp ( λ ), λ = 1 /µ (Memoryless) Error probability of one process during T time of computation: P ( T ) = 1 − e − λ T Process triplication : Failure probability of any triplicated process: � 3 �� � P ( T ) 2 + P ( T ) 3 P prc 3 ( T , 1) = 1 − P ( T ) 2 1 − e − λ T � 2 + 1 − e − λ T � 3 = 1 − 3 e − 2 λ T + 2 e − 3 λ T = 3 e − λ T � � Failure probability of P-process application: P prc 3 ( T , P ) = 1 − P (“No process fails”) 3 ( T , 1)) P = 1 − � 3 e − 2 λ T − 2 e − 3 λ T � P = 1 − (1 − P prc Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 35/ 84

  45. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Probability of failure Independent process error distribution: Exponential Exp ( λ ), λ = 1 /µ (Memoryless) Error probability of one process during T time of computation: P ( T ) = 1 − e − λ T Process triplication : Failure probability of any triplicated process: � 3 �� � P ( T ) 2 + P ( T ) 3 P prc 3 ( T , 1) = 1 − P ( T ) 2 1 − e − λ T � 2 + 1 − e − λ T � 3 = 1 − 3 e − 2 λ T + 2 e − 3 λ T = 3 e − λ T � � Failure probability of P-process application: P prc 3 ( T , P ) = 1 − P (“No process fails”) 3 ( T , 1)) P = 1 − � 3 e − 2 λ T − 2 e − 3 λ T � P = 1 − (1 − P prc Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 35/ 84

  46. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Probability of failure Independent process error distribution: Exponential Exp ( λ ), λ = 1 /µ (Memoryless) Error probability of one process during T time of computation: P ( T ) = 1 − e − λ T Process triplication : Failure probability of any triplicated process: � 3 �� � P ( T ) 2 + P ( T ) 3 P prc 3 ( T , 1) = 1 − P ( T ) 2 1 − e − λ T � 2 + 1 − e − λ T � 3 = 1 − 3 e − 2 λ T + 2 e − 3 λ T = 3 e − λ T � � Failure probability of P-process application: P prc 3 ( T , P ) = 1 − P (“No process fails”) 3 ( T , 1)) P = 1 − � 3 e − 2 λ T − 2 e − 3 λ T � P = 1 − (1 − P prc Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 35/ 84

  47. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Probability of failure Group triplication : Failure probability of any P-process group: P grp 1 ( T , P ) = 1 − P (“No process in group fails”) � P = 1 − e − λ PT � = 1 − 1 − P ( T ) Failure probability of three-group application: � 3 � 1 ( T , 1) 2 + P grp P grp (1 − P grp 1 ( T , 1)) P grp 1 ( T , 1) 3 3 ( T , P ) = 2 1 − e − λ PT � 2 + = 3 e − λ PT � � 1 − e − λ PT � 3 = 1 − 3 e − 2 λ PT + 2 e − 3 λ PT 3 e − 2 λ T − 2 e − 3 λ T � P = P prc � > 1 − 3 ( T , P ) What about duplication? (any error kills both cases) P prc 2 ( T , P ) = P grp 2 ( T , P ) = 1 − e − 2 λ PT Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 36/ 84

  48. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Probability of failure Group triplication : Failure probability of any P-process group: P grp 1 ( T , P ) = 1 − P (“No process in group fails”) � P = 1 − e − λ PT � = 1 − 1 − P ( T ) Failure probability of three-group application: � 3 � 1 ( T , 1) 2 + P grp P grp (1 − P grp 1 ( T , 1)) P grp 1 ( T , 1) 3 3 ( T , P ) = 2 1 − e − λ PT � 2 + = 3 e − λ PT � � 1 − e − λ PT � 3 = 1 − 3 e − 2 λ PT + 2 e − 3 λ PT 3 e − 2 λ T − 2 e − 3 λ T � P = P prc � > 1 − 3 ( T , P ) What about duplication? (any error kills both cases) P prc 2 ( T , P ) = P grp 2 ( T , P ) = 1 − e − 2 λ PT Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 36/ 84

  49. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Probability of failure Group triplication : Failure probability of any P-process group: P grp 1 ( T , P ) = 1 − P (“No process in group fails”) � P = 1 − e − λ PT � = 1 − 1 − P ( T ) Failure probability of three-group application: � 3 � 1 ( T , 1) 2 + P grp P grp (1 − P grp 1 ( T , 1)) P grp 1 ( T , 1) 3 3 ( T , P ) = 2 1 − e − λ PT � 2 + = 3 e − λ PT � � 1 − e − λ PT � 3 = 1 − 3 e − 2 λ PT + 2 e − 3 λ PT 3 e − 2 λ T − 2 e − 3 λ T � P = P prc � > 1 − 3 ( T , P ) What about duplication? (any error kills both cases) P prc 2 ( T , P ) = P grp 2 ( T , P ) = 1 − e − 2 λ PT Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 36/ 84

  50. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Probability of failure Group triplication : Failure probability of any P-process group: P grp 1 ( T , P ) = 1 − P (“No process in group fails”) � P = 1 − e − λ PT � = 1 − 1 − P ( T ) Failure probability of three-group application: � 3 � 1 ( T , 1) 2 + P grp P grp (1 − P grp 1 ( T , 1)) P grp 1 ( T , 1) 3 3 ( T , P ) = 2 1 − e − λ PT � 2 + = 3 e − λ PT � � 1 − e − λ PT � 3 = 1 − 3 e − 2 λ PT + 2 e − 3 λ PT 3 e − 2 λ T − 2 e − 3 λ T � P = P prc � > 1 − 3 ( T , P ) What about duplication? (any error kills both cases) P prc 2 ( T , P ) = P grp 2 ( T , P ) = 1 − e − 2 λ PT Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 36/ 84

  51. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Two observations Observation 1 (Implementation) Process replication is more resilient than group replication (assuming same overhead) Group replication is easier to implement by treating an application as a blackbox Observation 2 (Analysis) Following two scenarios are equivalent w.r.t. failure probability: Group replication with n replicas, where each replica has P processes and each process has error rate λ Process replication with one process, which has error rate λ P and which is replicated n times Benefit of analysis: Group( n , P , λ ) → Process( n , 1 , λ P ) Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 37/ 84

  52. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Two observations Observation 1 (Implementation) Process replication is more resilient than group replication (assuming same overhead) Group replication is easier to implement by treating an application as a blackbox Observation 2 (Analysis) Following two scenarios are equivalent w.r.t. failure probability: Group replication with n replicas, where each replica has P processes and each process has error rate λ Process replication with one process, which has error rate λ P and which is replicated n times Benefit of analysis: Group( n , P , λ ) → Process( n , 1 , λ P ) Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 37/ 84

  53. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Analysis steps Maximize error-aware speedup S ( P ) S n ( T , P ) = E n ( T , P ) / T 1. Derive failure probability P prc n ( T , P ) or P grp n ( T , P ) — exact 2. Compute expected execution time E n ( T , P ) — exact 3. Compute first-order approx. of error-aware speedup S n ( T , P ) 4. Derive optimal T opt , P opt and get S n ( T opt , P opt ) 5. Choose right replication level n Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 38/ 84

  54. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Analytical results Duplication : On a platform with Q processors and checkpointing cost C , the optimal resilience parameters for process/group duplication are:  � 1  � 2 1 � 3 1 � 1 − α Q   P opt = min 2 , 2 α C λ   � 1 � C 2 T opt = 2 λ P opt S ( P opt ) S opt = � 1 � 1 + 2 2 λ CP opt 2 Triplication & ( n , k ) -replication ( k -out-of- n replica consensus): similar results but different for process and group, less practical for n > 3 For α > 0, not necessarily use up all available Q processors Checkpointing interval T opt nicely extends Young/Daly’s result Error-aware speedup S opt minimally affected for small λ Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 39/ 84

  55. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Analytical results Duplication : On a platform with Q processors and checkpointing cost C , the optimal resilience parameters for process/group duplication are:  � 1  � 2 1 � 3 1 � 1 − α Q   P opt = min 2 , 2 α C λ   � 1 � C 2 T opt = 2 λ P opt S ( P opt ) S opt = � 1 � 1 + 2 2 λ CP opt 2 Triplication & ( n , k ) -replication ( k -out-of- n replica consensus): similar results but different for process and group, less practical for n > 3 For α > 0, not necessarily use up all available Q processors Checkpointing interval T opt nicely extends Young/Daly’s result Error-aware speedup S opt minimally affected for small λ Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 39/ 84

  56. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Analytical results Duplication : On a platform with Q processors and checkpointing cost C , the optimal resilience parameters for process/group duplication are:  � 1  � 2 1 � 3 1 � 1 − α Q   P opt = min 2 , 2 α C λ   � 1 � C 2 T opt = 2 λ P opt S ( P opt ) S opt = � 1 � 1 + 2 2 λ CP opt 2 Triplication & ( n , k ) -replication ( k -out-of- n replica consensus): similar results but different for process and group, less practical for n > 3 For α > 0, not necessarily use up all available Q processors Checkpointing interval T opt nicely extends Young/Daly’s result Error-aware speedup S opt minimally affected for small λ Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 39/ 84

  57. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Results comparison For fully parallel jobs, i.e., α = 0 (similar for α > 0) Duplication v.s. Process triplication P opt = Q P opt = Q (Processors ↓ ) 2 3 � � C C 3 T opt = T opt = (Chkpt interval ↑ ) λ Q 2 λ 2 Q Q / 2 Q / 3 S opt = 1 + 2 √ λ CQ S opt = (Exp. speedup??) �� λ C � 2 Q 3 1 + 3 2 Process triplication v.s. Group triplication P opt = Q P opt = Q (Processors =) 3 3 � � 3 C C 3 T opt = T opt = 3 (Chkpt interval ↓ ) 2 λ 2 Q 2( λ Q ) 2 Q / 3 Q / 3 S opt = S opt = (Exp. speedup ↓ ) �� λ C � 2 Q � � λ CQ � 2 3 3 1 1 + 3 1 + 3 2 3 2 Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 40/ 84

  58. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Results comparison For fully parallel jobs, i.e., α = 0 (similar for α > 0) Duplication v.s. Process triplication P opt = Q P opt = Q (Processors ↓ ) 2 3 � � C C 3 T opt = T opt = (Chkpt interval ↑ ) λ Q 2 λ 2 Q Q / 2 Q / 3 S opt = 1 + 2 √ λ CQ S opt = (Exp. speedup??) �� λ C � 2 Q 3 1 + 3 2 Process triplication v.s. Group triplication P opt = Q P opt = Q (Processors =) 3 3 � � 3 C C 3 T opt = T opt = 3 (Chkpt interval ↓ ) 2 λ 2 Q 2( λ Q ) 2 Q / 3 Q / 3 S opt = S opt = (Exp. speedup ↓ ) �� λ C � 2 Q � � λ CQ � 2 3 3 1 1 + 3 1 + 3 2 3 2 Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 40/ 84

  59. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Results comparison For fully parallel jobs, i.e., α = 0 (similar for α > 0) Duplication v.s. Process triplication P opt = Q P opt = Q (Processors ↓ ) Choosing right mode & level of replication 2 3 Based on analytical results, app. output structure and � � C C 3 T opt = T opt = (Chkpt interval ↑ ) system/language support λ Q 2 λ 2 Q Q / 2 Q / 3 S opt = S opt = (Exp. speedup??) 1 + 2 √ λ CQ � 2 Q �� λ C 3 1 + 3 2 Process triplication v.s. Group triplication P opt = Q P opt = Q (Processors =) 3 3 � � 3 C C 3 T opt = T opt = 3 (Chkpt interval ↓ ) 2 λ 2 Q 2( λ Q ) 2 Q / 3 Q / 3 S opt = S opt = (Exp. speedup ↓ ) �� λ C � 2 Q � � λ CQ � 2 3 3 1 1 + 3 1 + 3 2 3 2 Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 41/ 84

  60. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Outline Checkpointing for resilience 1 How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption Combining checkpoint with replication 2 Replication analysis Simulations Back to task scheduling 3 A different re-execution speed can help 4 Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors Summary and need for trade-offs 5 Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 42/ 84

  61. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Simulations Consider a platform with Q = 10 6 , and study Efficiency = S opt Q Impact of MTBE and checkpointing cost C Impact of sequential fraction α Impact of number of processes P Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 43/ 84

  62. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Impact of MTBE and checkpointing cost α = 10 − 6 0 . 5 0 . 5 Duplication Sim. Proc Trip. Sim. Group Trip. Sim. 0 . 4 0 . 4 Duplication Th. Proc Trip. Th. Group Trip. Th. 0 . 3 0 . 3 Efficiency Efficiency 0 . 2 0 . 2 Duplication Sim. Proc Trip. Sim. Group Trip. Sim. 0 . 1 0 . 1 Duplication Th. Proc Trip. Th. Group Trip. Th. 0 . 0 0 . 0 10 6 10 5 10 4 10 3 10 2 10 6 10 5 10 4 10 3 10 2 System MTBE System MTBE (a) C = 1800 s (b) C = 60 s First-order accurate except for duplication (where P is larger) and with small MTBE Duplication can be sufficient for large MTBE, especially for small checkpointing cost Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 44/ 84

  63. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Impact of sequential fraction C = 1800 s 0 . 5 0 . 5 0 . 5 Duplication Sim. Duplication Sim. Proc Trip. Sim. Proc Trip. Sim. Group Trip. Sim. Group Trip. Sim. 0 . 4 0 . 4 0 . 4 Duplication Th. Duplication Th. Proc Trip. Th. Proc Trip. Th. Group Trip. Th. Group Trip. Th. 0 . 3 0 . 3 0 . 3 Efficiency Efficiency Efficiency 0 . 2 0 . 2 0 . 2 Duplication Sim. Proc Trip. Sim. Group Trip. Sim. 0 . 1 0 . 1 0 . 1 Duplication Th. Proc Trip. Th. Group Trip. Th. 0 . 0 0 . 0 0 . 0 10 6 10 5 10 4 10 3 10 2 10 6 10 5 10 4 10 3 10 2 10 6 10 5 10 4 10 3 10 2 System MTBE System MTBE System MTBE (c) α = 10 − 7 (d) α = 10 − 6 (e) α = 10 − 5 Increased α reduces efficiency Increased α increases minimum MTBE for which duplication is sufficient Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 45/ 84

  64. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Impact of number of processes α = 10 − 5 , C = 1800 s (f) MTBE = 10 4 (g) MTBE = 10 3 Efficiency/speedup not strictly increasing with P First-order P opt close to actual optimum Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 46/ 84

  65. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion What to remember “Replication + checkpointing” as a general-purpose fault- tolerance protocol for detecting/correcting silent errors in HPC Process replication is more resilient than group replication, but group replication is easier to implement Analytical solution for P opt , T opt , and S opt and for choosing right replication mode and level Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 47/ 84

  66. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Outline Checkpointing for resilience 1 How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption Combining checkpoint with replication 2 Replication analysis Simulations Back to task scheduling 3 A different re-execution speed can help 4 Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors Summary and need for trade-offs 5 Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 48/ 84

  67. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Chains of tasks High-performance computing (HPC) application: chain of tasks T 1 → T 2 → · · · → T n Parallel tasks executed on the whole platform For instance: tightly-coupled computational kernels, image processing applications, ... Goal: efficient execution, i.e., minimize total execution time Checkpoints can only be done after a task has completed Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 49/ 84

  68. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Chains of tasks High-performance computing (HPC) application: chain of tasks T 1 → T 2 → · · · → T n Parallel tasks executed on the whole platform For instance: tightly-coupled computational kernels, image processing applications, ... Goal: efficient execution, i.e., minimize total execution time Checkpoints can only be done after a task has completed Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 49/ 84

  69. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Dynamic programming algorithm without replication Possibility to add verification, memory checkpoint and disk checkpoint at the end of a task . . . . . . . . . T d 1 D T d 1 +1 T d 2 T 0 V M D T 1 V M V M D E disk ( d 1 ) E ( d 1 , d 2 ) E disk ( d 2 ) E disk ( d 2 ) = 0 ≤ d 1 < d 2 { E disk ( d 1 ) + E ( d 1 , d 2 ) + C D } min Initialization: E disk (0) = 0 Objective: Compute E disk ( n ) Compute E disk (0) , E disk (1) , E disk (2) , . . . , E disk ( n ) in that order Complexity: O ( n 2 ) Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 50/ 84

  70. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Coping with fail-stop errors with replication T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) 2 ) C 1 Fail-stop error T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) C 1 2 ) Fail-stop error T 1 ( p T 4 ( p 2 ) 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) C 1 2 ) The whole platform is used at all time, some tasks are replicated If failure hits a replicated task, no need to rollback Otherwise, rollback to last checkpoint and re-execute Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 51/ 84

  71. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Coping with fail-stop errors with replication T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) 2 ) C 1 Fail-stop error T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) C 1 2 ) Fail-stop error T 1 ( p T 4 ( p 2 ) 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) C 1 2 ) The whole platform is used at all time, some tasks are replicated If failure hits a replicated task, no need to rollback Otherwise, rollback to last checkpoint and re-execute Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 51/ 84

  72. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Coping with fail-stop errors with replication T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) 2 ) C 1 Fail-stop error T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) C 1 2 ) Fail-stop error T 1 ( p T 4 ( p 2 ) 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) C 1 2 ) The whole platform is used at all time, some tasks are replicated If failure hits a replicated task, no need to rollback Otherwise, rollback to last checkpoint and re-execute Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 51/ 84

  73. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Dynamic programming algorithm with replication Recursively computes expectation of optimal time required to execute tasks T 1 to T i and then checkpoint T i Distinguish whether T i is replicated or not T rep opt ( i ): knowing that T i is replicated T norep ( i ): knowing that T i is not replicated opt � � T rep n , T norep opt ( n ) + C rep ( n ) + C norep Solution: min opt n Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 52/ 84

  74. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Computing T rep opt ( j ): j is replicated   T rep opt ( i ) + C rep + T rep , rep ( i + 1 , j ) ,  i NC    T rep opt ( i ) + C rep + T norep , rep   ( i + 1 , j ) ,    i NC    T norep ( i ) + C norep + T rep , rep   ( i + 1 , j ) , T rep opt i NC opt ( j )= min T norep ( i )+ C norep + T norep , rep ( i + 1 , j ) , 1 ≤ i < j  opt  i NC   R rep + T rep , rep   (1 , j ) ,     1 NC   R norep + T norep , rep   (1 , j ) 1 NC T i : last checkpointed task before T j T i can be replicated or not, T i +1 can be replicated or not T A , B NC : no intermediate checkpoint, first/last task replicated or not, previous task checkpointed: complicated formula but done in constant time Similar equation for T norep ( j ) opt Overall complexity: O ( n 2 ) Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 53/ 84

  75. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Comparison to checkpoint only With identical tasks Reports occ. of checkpoints and replicas in optimal solution Checkpointing cost ≤ task length ⇒ no replication 1 . 05 e − 02 2 . 62 e − 03 6 . 55 e − 04 1 . 64 e − 04 4 . 10 e − 05 Error Rate 1 . 02 e − 05 2 . 56 e − 06 6 . 40 e − 07 None 1 . 60 e − 07 Checkpointing Only Replication Only 4 . 00 e − 08 Checkpointing+Replication 1 . 00 e − 08 1 . 0 e − 03 4 . 0 e − 03 1 . 6 e − 02 6 . 4 e − 02 2 . 6 e − 01 1 . 0 e + 00 4 . 1 e + 00 1 . 6 e + 01 6 . 6 e + 01 2 . 6 e + 02 1 . 0 e + 03 Checkpoint/Recovery cost over task length ratio Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 54/ 84

  76. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Summary Goal: Minimize execution time of linear workflows Decide which task to checkpoint and/or replicate Sophisticated dynamic programming algorithms: optimal solutions Even when accounting for energy: decide at which speed to execute each task Even with k different levels of checkpoints and partial verifications: algorithm in O ( n k +5 ) Simulations: With replication, gain over checkpoint-only approach is quite significant, when checkpoint is costly and error rate is high Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 55/ 84

  77. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Outline Checkpointing for resilience 1 How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption Combining checkpoint with replication 2 Replication analysis Simulations Back to task scheduling 3 A different re-execution speed can help 4 Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors Summary and need for trade-offs 5 Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 56/ 84

  78. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Silent vs fail-stop errors C : time to checkpoint; V : time to verify; R : time to recover; λ : error rate (platform MTBF µ = 1 /λ ) Optimal checkpointing period W for fail-stop errors (Young/Daly): W = √ 2 C µ ( V = 0) Fail-stop error V C ? R W V C W V C Time � Silent errors: W = ( V + C ) µ ( C → V + C ; missing factor 2) Silent error Detection V C W V R W V C W V C Time Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 57/ 84

  79. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Back to energy consumption Need to reduce energy consumption of future platforms Popular technique: dynamic voltage and frequency scaling (DVFS) Lower speed → energy savings: when computing at speed σ , power proportional to σ 3 and execution time proportional to 1 /σ → (dynamic) energy proportional to σ 2 Also account for static energy: trade-offs to be found Realistic approach: minimize energy consumption while guaranteeing a performance bound ⇒ At which speed should we execute the workload? Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 58/ 84

  80. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Back to energy consumption Need to reduce energy consumption of future platforms Popular technique: dynamic voltage and frequency scaling (DVFS) Lower speed → energy savings: when computing at speed σ , power proportional to σ 3 and execution time proportional to 1 /σ → (dynamic) energy proportional to σ 2 Also account for static energy: trade-offs to be found Realistic approach: minimize energy consumption while guaranteeing a performance bound ⇒ At which speed should we execute the workload? Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 58/ 84

  81. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Outline Checkpointing for resilience 1 How to cope with errors? Optimization objective and optimal period Optimal period when accounting for energy consumption Combining checkpoint with replication 2 Replication analysis Simulations Back to task scheduling 3 A different re-execution speed can help 4 Model, optimization problem, optimal solution Simulations Extensions: both fail-stop and silent errors Summary and need for trade-offs 5 Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 59/ 84

  82. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Framework Divisible-load applications Subject to silent data corruption Checkpoint/restart strategy: periodic patterns that repeat over time Verified checkpoints Is it better to use two different speeds rather than only one? What are the optimal checkpointing period and optimal execution speeds? Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 60/ 84

  83. Introduction Checkpointing Replication Task scheduling Re-execution speed Conclusion Model Set of speeds S = { s 1 , . . . , s K } : σ 1 ∈ S speed for first execution, σ 2 ∈ S speed for re-executions Silent errors: exponential distribution of rate λ Verification: V units of work; Checkpointing: time C ; Recovery: time R P idle and P io constant; and P cpu ( σ ) = κσ 3 Energy for W units of work at speed σ : W σ ( P idle + κσ 3 ) Energy of a verification at speed σ : V σ ( P idle + κσ 3 ) Energy of a checkpoint: C ( P idle + P io ) Energy of a recovery: R ( P idle + P io ) Silent error Detection T ( p , n ) T ( p , n ) T ( p , n ) V C V R V C V C Time With a silent error Winter School, Feb. 5, 2019 Anne.Benoit@ens-lyon.fr Resilient and energy-aware scheduling algorithms 61/ 84

Recommend


More recommend