Robustness of the Young/Daly formula for stochastic iterative applications Yishu Du 1 , 2 Loris Marchal 2 Guillaume Pallez 3 Yves Robert 2 , 4 1 Tongji University, China 2 CNRS, ENS Lyon and Inria, France 3 Inria and University of Bordeaux, France 4 University of Tennessee, USA August 18, 2020
Contents Introduction 1 Model 2 Static strategy 3 Dynamic strategy 4 Experiments 5 Conclusion 6 Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 2 / 31
Introduction 1 Model 2 Static strategy 3 Dynamic strategy 4 Experiments 5 Conclusion 6 Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 3 / 31
The road to Exascale Observed two growth rates. What are the barriers on the road to achieving Exascale? Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 4 / 31
The road to Exascale In Feb. 2014, the Advanced Scientific Computing Advisory Committee published the top ten challenges to achieve the development of an Exascale system. We focus here on one of those: Resilience Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 5 / 31
Why resilience? Supercomputers enroll huge numbers of processors; Mean Time Between Failures (MTBF) of each individual component is µ ind ; Time Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 6 / 31
Why resilience? Supercomputers enroll huge numbers of processors; Mean Time Between Failures (MTBF) of each individual component is µ ind ; MTBF of P processors is µ P = µ ind P ; Time Fault rate is proportional to the number of components. Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 6 / 31
Why resilience? Supercomputers enroll huge numbers of processors; Mean Time Between Failures (MTBF) of each individual component is µ ind ; MTBF of P processors is µ P = µ ind P ; Most powerful computers in the Top 500 lists are victims of at least one failure a day; . . . Time Fault rate is proportional to the number of components. Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 6 / 31
Why resilience? Supercomputers enroll huge numbers of processors; Mean Time Between Failures (MTBF) of each individual component is µ ind ; MTBF of P processors is µ P = µ ind P ; One proc: MTBF ≈ 10 years Most powerful computers in the Top 500 lists are victims of at least Petascale: MTBF ≈ 1 hour one failure a day; Exascale: MTBF ≈ 5 minutes . . . Need for fault-tolerance algorithm! Time Fault rate is proportional to the number of components. Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 6 / 31
Fail-Stop Errors Fail-stop errors : hardware failures or crashes Effects: quickly detected the execution stops the entire content of local memory is lost computation has to be re-started from the last checkpoint To handle fail-stop errors → Checkpoint/Restart Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 7 / 31
Introduction 1 Model 2 Static strategy 3 Dynamic strategy 4 Experiments 5 Conclusion 6 Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 8 / 31
Expected execution time The expected execution time to perform a work of size W followed by a checkpoint of size C in the presence of failures (Exponential distribution of parameter λ ), with a restart cost R and a downtime D is: � 1 � e λ R � e λ ( W + C ) − 1 � T λ ( W , C , D , R ) = λ + D . We assumes that failures can strike during checkpoint and recovery, but not during downtime. [Springer Monograph on Resilience 2015] Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 9 / 31
Objective Minimizing the expectation of the execution time, or makespan Divisible Applications Optimal period: P YD = √ 2 µ f C = � 2 C λ µ f : Platform MTBF, C : Checkpoint time [Young 1974, Daly 2006] Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 10 / 31
Applications decomposed into computational iterations the duration of an iteration is stochastic, i.e., obeys a probability distribution law D of mean µ D one can checkpoint only at the end of an iteration Given an iterative application with n consecutive iterations The execution times of the iterations are X 1 , . . . , X n , where the X i are IID (Independent and identically Distributed) variables following D A solution with m checkpoints writes as S = ( δ 1 , . . . , δ n ), where δ i = 1 if and only if we perform a checkpoint after the i -th iteration of length X i . 1 ≤ i 1 < i 2 < · · · < i m = n , δ j = 1 ⇐ ⇒ j ∈ { i 1 , . . . , i m } W j = � i j l = i j − 1 +1 X l denotes the work between two consecutive checkpoints (of number j − 1 and j ) Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 11 / 31
Introduction 1 Model 2 Static strategy 3 Dynamic strategy 4 Experiments 5 Conclusion 6 Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 12 / 31
Static strategy Consider an iterative application with n iterations of W i . We are interested in minimizing the total execution time (makespan) of the application. This makespan is given as follows: � m � � E [ MS ( S )] = E T λ ( W j , C , D , R ) . i =1 Static solutions decide which iterations to checkpoint. One can choose a solution to be periodic with period k , i.e., checkpoints are taken every k iterations, namely at the end of iterations number k , 2 k , . . . until the last iteration. Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 13 / 31
Theorem The periodic solution checkpointing every k static iterations is asymptotically optimal, where x static = W 0 ( − e − λ C − 1 ) + 1 log ( E [ e λ X ]) and k static is either max(1 , ⌊ x static ⌋ ) or ⌈ x static ⌉ , whichever achieves the smaller value of C ind ( k ) = e λ C E [ e λ X ] k − 1 , W 0 is the principal Lambert k function. Proposition The first-order approximation k FO of k static obeys the equation � 2 C k FO · µ D = λ . Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 14 / 31
Introduction 1 Model 2 Static strategy 3 Dynamic strategy 4 Experiments 5 Conclusion 6 Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 15 / 31
Dynamic strategy We fix a threshold W th for the amount of work since the last checkpoint. When iteration X i finishes, if the amount of work since the last checkpoint is greater than W th , then δ i = 1 (we checkpoint) otherwise δ i = 0 (we do not checkpoint). The slowdown H is defined as the ratio H = actual execution time useful execution time , so that the slowdown is equal to 1 if there is no cost for fault-tolerance (no checkpoints, nor re-execution after failures). Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 16 / 31
When an iteration is completed, we compute two values: The expected slowdown H ckpt if a checkpoint is taken at the end of this iteration; H ckpt ( w dyn ) = T ( w dyn , 0 , D , R ) + T (0 , C , D , R + w dyn ) w dyn The expected slowdown H no if no checkpoint is taken at the end of this iteration. H no ( w dyn ) = E [ T ( w dyn , 0 , D , R ) + T ( X , C , D , R + w dyn )] E [ w dyn + X ] Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 17 / 31
By definition, W th is the threshold value where H ckpt ( W th ) = H no ( W th ) Finally, we derive the threshold value: W th = 1 � λ E [ X ] � E [ X ] � � E [ X ] − λ C + E [ e λ X ] − 1 λ W 0 − E [ e λ X ] − 1 e + E [ e λ X ] − 1 . Proposition The first-order approximation W FO of W th obeys the equation � 2 C W FO = λ . Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 18 / 31
Introduction 1 Model 2 Static strategy 3 Dynamic strategy 4 Experiments 5 Conclusion 6 Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 19 / 31
Methodology An iterative application composed of n = 1000 consecutive iterations The execution time of each iteration follows a probability distribution D with µ D = 50 and the standard deviation σ . Uniform (20 , 80) Gamma (25 , 0 . 5) Normal (50 , 2 . 5 2 ) Each iteration fails with probability p fail ∈ { 10 − 3 , 10 − 2 . 5 , . . . , 10 − 0 . 1 } Checkpoint time C = ηµ D , where η is the proportion of checkpoint time to the expectation of iteration time (Default η = 0 . 1). Recovery time R = C , and fixed downtime as D = 1. Evaluating the makespan with 10000 random simulations Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 20 / 31
Static strategy results Gamma Normal Uniform 1.10 makespan normalized by MS YD_sta 1.05 1.00 0.95 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 k Figure: Performance (with boxplots) of the static strategy that chooses the value of k . Brown-red diamonds plot E [ MS D ]( k ) (theoretical makespan). The blue (resp. red) line represents the makespan obtained by the optimal dynamic strategy MS sim dyn ( W th ) (resp. the YD-dynamic strategy MS YD dyn ). Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 21 / 31
Static strategy results Table: Simulation for static case. p fail = 10 − 2 Gamma Normal Uniform k sim 5 5 5 x static 4.6114 4.6122 4.6097 5 5 5 k static � 1 2 C 4.6787 4.6787 4.6787 µ D λ k FO 5 5 5 Yishu Du (ENS Lyon, LIP) ICPP August 18, 2020 22 / 31
Recommend
More recommend