Introduction Model DP Algo Experiments Conclusion Combining Checkpointing and Replication for Reliable Execution of Linear Workflows Anne Benoit 1 , 2 , Aur´ elien Cavelan 3 , Florina M. Ciorba 3 , evre 1 , Yves Robert 1 , 4 Valentin Le F` 1. LIP, Ecole Normale Sup´ erieure de Lyon, France 2. Georgia Institute of Technology, Atlanta, GA, USA 3. University of Basel, Switzerland 4. University of Tennessee, Knoxville, TN, USA http://graal.ens-lyon.fr/~abenoit/ APDCM workshop @ IPDPS’18, Vancouver, May 21, 2018 APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 1/ 26
Introduction Model DP Algo Experiments Conclusion Linear workflows High-performance computing (HPC) application: chain of tasks T 1 → T 2 → · · · → T n Parallel tasks executed on the whole platform For instance: tightly-coupled computational kernels, image processing applications, ... Goal: efficient execution, i.e., minimize total execution time APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 2/ 26
Introduction Model DP Algo Experiments Conclusion Linear workflows High-performance computing (HPC) application: chain of tasks T 1 → T 2 → · · · → T n Parallel tasks executed on the whole platform For instance: tightly-coupled computational kernels, image processing applications, ... Goal: efficient execution, i.e., minimize total execution time APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 2/ 26
Introduction Model DP Algo Experiments Conclusion Reliable execution Hierarchical • 10 5 or 10 6 nodes • Each node equipped with 10 4 or 10 3 cores Failure-prone MTBF – one node 1 year 10 years 120 years MTBF – platform 30sec 5mn 1h of 10 6 nodes More nodes ⇒ Shorter MTBF (Mean Time Between Failures) Need to ensure that the execution will be reliable, i.e., without failures APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 3/ 26
Introduction Model DP Algo Experiments Conclusion Coping with fail-stop errors with checkpoints Checkpoint, rollback, and recovery: (no error) T 1 C 1 T 2 T 3 C 3 T 4 C 4 Time Fail-stop error (error) T 1 C 1 T 2 T 3 C 3 T 4 C 4 Time Fail-stop error (error) T 1 C 1 T 2 T 3 R 2 T 2 T 3 C 3 · · · Time Coordinated checkpointing (the platform is a giant macro-processor) Assume instantaneous interruption and detection Rollback to last checkpoint and re-execute APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 4/ 26
Introduction Model DP Algo Experiments Conclusion Coping with fail-stop errors with checkpoints Checkpoint, rollback, and recovery: (no error) T 1 C 1 T 2 T 3 C 3 T 4 C 4 Time Fail-stop error (error) T 1 C 1 T 2 T 3 C 3 T 4 C 4 Time Fail-stop error (error) T 1 C 1 T 2 T 3 R 2 T 2 T 3 C 3 · · · Time Coordinated checkpointing (the platform is a giant macro-processor) Assume instantaneous interruption and detection Rollback to last checkpoint and re-execute APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 4/ 26
Introduction Model DP Algo Experiments Conclusion Coping with fail-stop errors with checkpoints Checkpoint, rollback, and recovery: (no error) T 1 C 1 T 2 T 3 C 3 T 4 C 4 Time Fail-stop error (error) T 1 C 1 T 2 T 3 C 3 T 4 C 4 Time Fail-stop error (error) T 1 C 1 T 2 T 3 R 2 T 2 T 3 C 3 · · · Time Coordinated checkpointing (the platform is a giant macro-processor) Assume instantaneous interruption and detection Rollback to last checkpoint and re-execute APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 4/ 26
Introduction Model DP Algo Experiments Conclusion Coping with fail-stop errors with replication T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) C 1 2 ) Fail-stop error T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) 2 ) C 1 Fail-stop error T 1 ( p T 4 ( p 2 ) T 2 ( p ) T 3 ( p ) 2 ) T 5 ( p ) C 1 C 3 C 5 T 1 ( p T 4 ( p 2 ) 2 ) C 1 The whole platform is used at all time, some tasks are replicated If failure hits a replicated task, no need to rollback Otherwise, rollback to last checkpoint and re-execute APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 5/ 26
Introduction Model DP Algo Experiments Conclusion Coping with fail-stop errors with replication T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) C 1 2 ) Fail-stop error T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) 2 ) C 1 Fail-stop error T 1 ( p T 4 ( p 2 ) T 2 ( p ) T 3 ( p ) 2 ) T 5 ( p ) C 1 C 3 C 5 T 1 ( p T 4 ( p 2 ) 2 ) C 1 The whole platform is used at all time, some tasks are replicated If failure hits a replicated task, no need to rollback Otherwise, rollback to last checkpoint and re-execute APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 5/ 26
Introduction Model DP Algo Experiments Conclusion Coping with fail-stop errors with replication T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) C 1 2 ) Fail-stop error T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) 2 ) C 1 Fail-stop error T 1 ( p T 4 ( p 2 ) T 2 ( p ) T 3 ( p ) 2 ) T 5 ( p ) C 1 C 3 C 5 T 1 ( p T 4 ( p 2 ) 2 ) C 1 The whole platform is used at all time, some tasks are replicated If failure hits a replicated task, no need to rollback Otherwise, rollback to last checkpoint and re-execute APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 5/ 26
Introduction Model DP Algo Experiments Conclusion Contributions Both checkpointing and replication have been extensively studied Combination of both techniques not yet investigated Detailed model Optimal dynamic programming algorithm Experiments to evaluate impact of using both replication and checkpointing during execution Guidelines about when to checkpoint only, replicate only, or combine both techniques APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 6/ 26
Introduction Model DP Algo Experiments Conclusion Contributions Both checkpointing and replication have been extensively studied Combination of both techniques not yet investigated Detailed model Optimal dynamic programming algorithm Experiments to evaluate impact of using both replication and checkpointing during execution Guidelines about when to checkpoint only, replicate only, or combine both techniques APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 6/ 26
Introduction Model DP Algo Experiments Conclusion Outline Model and objective 1 Optimal dynamic programming algorithm 2 Experiments 3 Conclusion 4 APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 7/ 26
Introduction Model DP Algo Experiments Conclusion Application and platform model Application: Chain T 1 → T 2 → · · · → T n Parallel tasks: (failure-free) execution time of T i using q i � � α i + 1 − α i processors is w i (Amdahl’s law) q i Platform: Homogeneous platform with p processors P i , 1 ≤ i ≤ p Fail-stop errors, Exponential distribution, error rate λ ind P ( X ≤ T ) = 1 − e − q λ ind T on q processors APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 8/ 26
Introduction Model DP Algo Experiments Conclusion Checkpointing Checkpointing time: C i ( q i ) = a i + b i q i + c i q i a i + b i q i : communication time with latency a i c i q i : message passing overhead Downtime D Recovery cost R j +1 (where T j is the last checkpointed task) R i +1 ( q i ) = C i ( q i ) for 1 ≤ i ≤ n − 1: recovering for T i +1 ≈ reading C i T 0 with w 0 = 0 checkpointed (input time R 1 ( q 1 )) T n always checkpointed (output time C n ( q n )) APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 9/ 26
Introduction Model DP Algo Experiments Conclusion No replication T i not replicated: costs C norep and R norep i i � � Failure-free execution time: T norep α i + 1 − α i = w i i p Expected execution time E norep ( i ): � � E norep ( i ) = P ( X p ≤ T norep T norep ( T norep ) + D + R norep + E norep ( i ) ) i lost i i + (1 − P ( X p ≤ T norep )) T norep i i P ( X p ≤ t ) = 1 − e − λ ind pt : probability of failure on one of the p processors before time t T norep ( T norep 1 t ) = λ ind p − e λ ind pTnorep lost i − 1 i E norep ( i ) = ( e λ ind pT norep λ ind p + D + R norep 1 − 1)( ) i i If T i is checkpointed, add C norep i APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 10/ 26
Introduction Model DP Algo Experiments Conclusion No replication T i not replicated: costs C norep and R norep i i � � Failure-free execution time: T norep α i + 1 − α i = w i i p Expected execution time E norep ( i ): � � E norep ( i ) = P ( X p ≤ T norep T norep ( T norep ) + D + R norep + E norep ( i ) ) i lost i i + (1 − P ( X p ≤ T norep )) T norep i i P ( X p ≤ t ) = 1 − e − λ ind pt : probability of failure on one of the p processors before time t T norep ( T norep 1 t ) = λ ind p − e λ ind pTnorep lost i − 1 i E norep ( i ) = ( e λ ind pT norep λ ind p + D + R norep 1 − 1)( ) i i If T i is checkpointed, add C norep i APDCM’18 Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 10/ 26
Recommend
More recommend