checkpointing strategies for parallel jobs
play

Checkpointing strategies for parallel jobs Marin Bougeret , Henri - PowerPoint PPT Presentation

Checkpointing strategies for parallel jobs Marin Bougeret , Henri Casanova , Mika el Rabie , Yves Robert , and Fr ed eric Vivien ENS Lyon & INRIA, France University of Hawaii at M anoa, USA University of Montpellier, France


  1. Checkpointing strategies for parallel jobs Marin Bougeret , Henri Casanova , Mika¨ el Rabie , Yves Robert , and Fr´ ed´ eric Vivien ENS Lyon & INRIA, France University of Hawai‘i at M¯ anoa, USA University of Montpellier, France

  2. Motivation Framework Very very large number of processing elements (e.g., 2 20 ) Failure-prone platform (like any realistic platform) Large application to be executed on the whole platform = ⇒ Failure(s) will certainly occur before completion! Resilience provided through coordinated checkpointing Question When should we checkpoint the application?

  3. State of the art One knows that applications should be checkpointed periodically

  4. State of the art One knows that applications should be checkpointed periodically Is this optimal?

  5. State of the art One knows that applications should be checkpointed periodically Is this optimal? Several proposed values for period √ Young: 2 × C × MTBF (1st order approximation) � Daly (1): 2 × C × ( R + MTBF) (1st order approximation) Daly (2): η × MTBF − C , where η = ξ 2 + 1 + L ( − e − (2 ξ 2 +1) ), � 2 × MTBF , and L ( z ) e L ( z ) = z C ξ = (higher order approximation)

  6. State of the art One knows that applications should be checkpointed periodically Is this optimal? Several proposed values for period √ Young: 2 × C × MTBF (1st order approximation) � Daly (1): 2 × C × ( R + MTBF) (1st order approximation) Daly (2): η × MTBF − C , where η = ξ 2 + 1 + L ( − e − (2 ξ 2 +1) ), � 2 × MTBF , and L ( z ) e L ( z ) = z C ξ = (higher order approximation) How good are these approximations? Could we find the optimal value? At least for Exponential failures? And for Weibull failures?

  7. Outline Single-processor jobs 1 Solving Makespan Solving NextFailure Parallel jobs 2 Solving Makespan Solving NextFailure Experiments 3 Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures Conclusion 4

  8. Outline Single-processor jobs 1 Solving Makespan Solving NextFailure Parallel jobs 2 Solving Makespan Solving NextFailure Experiments 3 Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures Conclusion 4

  9. Hypotheses Overall size of work: W Checkpoint cost: C (e.g., write on disk the contents of each processor memory) Downtime: D (hardware replacement by spare, or software rejuvenation via rebooting) Recovery cost after failure: R Homogeneous platform (same computation speeds, iid failure distributions) History of failures has no impact, only the time elapsed since last failure does A failure can happen during a checkpoint, a recovery, but not a downtime (otherwise replace D by 0 and R by R + D ).

  10. Outline Single-processor jobs 1 Solving Makespan Solving NextFailure Parallel jobs 2 Solving Makespan Solving NextFailure Experiments 3 Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures Conclusion 4

  11. Problem statement Makespan Minimize the job’s expected makespan, that is: the expectation E of the time T needed to process a work of size W knowing that the (single) processor failed τ units of time ago. Notation: minimize E ( T ( W| τ )) ω 1 ( W| τ ): amount of work we attempt to do before taking the first checkpoint

  12. Recursive approach E ( T ( W| τ )) =

  13. Recursive approach Probability of success P succ ( ω 1 + C | τ ) ( ω 1 + C + E ( T ( W − ω 1 | τ + ω 1 + C )) E ( T ( W| τ )) =

  14. Recursive approach Time needed to compute the 1st chunk P succ ( ω 1 + C | τ ) ( ω 1 + C + E ( T ( W − ω 1 | τ + ω 1 + C )) E ( T ( W| τ )) =

  15. Recursive approach Time needed to compute the remainder P succ ( ω 1 + C | τ ) ( ω 1 + C + E ( T ( W − ω 1 | τ + ω 1 + C )) E ( T ( W| τ )) =

  16. Recursive approach P succ ( ω 1 + C | τ ) ( ω 1 + C + E ( T ( W − ω 1 | τ + ω 1 + C )) E ( T ( W| τ )) = + (1 − P succ ( ω 1 + C | τ )) ( E ( T lost ( ω 1 + C | τ )) + E ( T rec ) + E ( T ( W| R )))

  17. Recursive approach P succ ( ω 1 + C | τ ) ( ω 1 + C + E ( T ( W − ω 1 | τ + ω 1 + C )) E ( T ( W| τ )) = + (1 − P succ ( ω 1 + C | τ )) ( E ( T lost ( ω 1 + C | τ )) + E ( T rec ) + E ( T ( W| R ))) Probability of failure

  18. Recursive approach P succ ( ω 1 + C | τ ) ( ω 1 + C + E ( T ( W − ω 1 | τ + ω 1 + C )) E ( T ( W| τ )) = + (1 − P succ ( ω 1 + C | τ )) ( E ( T lost ( ω 1 + C | τ )) + E ( T rec ) + E ( T ( W| R ))) Time elapsed before the failure occured

  19. Recursive approach P succ ( ω 1 + C | τ ) ( ω 1 + C + E ( T ( W − ω 1 | τ + ω 1 + C )) E ( T ( W| τ )) = + (1 − P succ ( ω 1 + C | τ )) ( E ( T lost ( ω 1 + C | τ )) + E ( T rec ) + E ( T ( W| R ))) Time needed to perform downtime and recovery

  20. Recursive approach P succ ( ω 1 + C | τ ) ( ω 1 + C + E ( T ( W − ω 1 | τ + ω 1 + C )) E ( T ( W| τ )) = + (1 − P succ ( ω 1 + C | τ )) ( E ( T lost ( ω 1 + C | τ )) + E ( T rec ) + E ( T ( W| R ))) Time needed to compute W from scratch

  21. Recursive approach P succ ( ω 1 + C | τ ) ( ω 1 + C + E ( T ( W − ω 1 | τ + ω 1 + C )) E ( T ( W| τ )) = + (1 − P succ ( ω 1 + C | τ )) ( E ( T lost ( ω 1 + C | τ )) + E ( T rec ) + E ( T ( W| R ))) Problem: finding ω 1 ( W , τ ) minimizing E ( T ( W| τ ))

  22. Failures following an exponential distribution Theorem Optimal strategy splits W into K ∗ same-size chunks where K ∗ = max(1 , ⌊ K 0 ⌋ ) or K ∗ = ⌈ K 0 ⌉ (whichever leads to the smaller value) where λ W 1 + L ( − e − λ C − 1 ) and L ( z ) e L ( z ) = z K 0 = Optimal expectation of makespan is � 1 � �� � � e λ ( W K ∗ + C ) − 1 K ∗ e λ R λ + D

  23. Arbitrary failure distributions E ( T ( W| τ )) = � �  P suc ( ω 1 + C | τ ) ω 1 + C + E ( T ( W − ω 1 | τ + ω 1 + C ))  min  +(1 − P suc ( ω 1 + C | τ )) × 0 <ω 1 ≤W  ( E ( T lost ( ω 1 + C | τ ))+ E ( T rec )+ E ( T ( W| R ))) Solve via dynamic programming • Time quantum u : all chunk sizes ω i are integer multiples of u • Trade-off: accuracy versus higher computing time

  24. Dynamic programming Algorithm 1: DPMakespan ( x , b , y , τ 0 ) if x = 0 then return 0 if solution [ x ][ b ][ y ] = unknown then best ← ∞ ; τ ← b τ 0 + yu for i = 1 to x do exp succ ← first ( DPMakespan ( x − i , b , y + i + C u , τ 0 )) exp fail ← first ( DPMakespan ( x , 0 , R u , τ 0 )) cur ← P suc ( iu + C | τ )( iu + C + exp succ ) � +(1 − P suc ( iu + C | τ )) E ( T lost ( iu + C , τ )) � + E ( T rec ) + exp fail if cur < best then best ← cur ; chunksize ← i solution [ x ][ b ][ y ] ← ( best , chunksize ) return solution [ x ][ b ][ y ]

  25. Outline Single-processor jobs 1 Solving Makespan Solving NextFailure Parallel jobs 2 Solving Makespan Solving NextFailure Experiments 3 Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures Conclusion 4

  26. Problem statement NextFailure Maximize expected amount of work completed before next failure Optimization on a “failure-by-failure” basis Hopefully a good approximation, at least for large job sizes W

  27. Approach E ( W ( ω | τ ))= P suc ( ω 1 + C | τ )( ω 1 + E ( W ( ω − ω 1 | τ + ω 1 + C ))) Proposition K i � � E ( W ( W| 0)) = ω i × P suc ( ω j + C | t j ) i =1 j =1 where t j = � j − 1 ℓ =1 ω ℓ + C is the total time elapsed (without failure) before execution of chunk ω l , and K is the (unknown) target number of chunks.

  28. Solving through dynamic programming Algorithm 2: DPNextFailure ( x , n , τ 0 ) if x = 0 then return 0 if solution [ x ][ n ] = unknown then best ← ∞ τ ← τ 0 + ( W − xu ) + nC for i = 1 to x do work = first ( DPNextFailure ( x − i , n + 1 , τ 0 )) cur ← P suc ( iu + C | τ ) × ( iu + work ) if cur < best then best ← cur ; chunksize ← i solution [ x ][ n ] ← ( best , chunksize ) return solution [ x ][ n ]

  29. Outline Single-processor jobs 1 Solving Makespan Solving NextFailure Parallel jobs 2 Solving Makespan Solving NextFailure Experiments 3 Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures Conclusion 4

  30. Outline Single-processor jobs 1 Solving Makespan Solving NextFailure Parallel jobs 2 Solving Makespan Solving NextFailure Experiments 3 Simulation framework Sequential jobs under synthetic failures Parallel jobs under synthetic failures Parallel jobs under trace-based failures Conclusion 4

  31. Failures following an exponential distribution Theorem Optimal strategy splits W ( p ) in K ∗ ( p ) same-size chunks where K ∗ ( p ) = max(1 , ⌊ K 0 ( p ) ⌋ ) or K ∗ ( p ) = ⌈ K 0 ( p ) ⌉ (whichever leads to the smaller value) λ W ( p ) 1 + L ( − e − p λ C − 1 ) and L ( z ) e L ( z ) = z where K 0 ( p ) = Optimal expectation of makespan is � 1 � � � � � W K ∗ ( p ) + pC e λ K ∗ ( p ) p λ + E ( T rec ( p )) − 1

Recommend


More recommend