Practical foundations for resilient applications George Bosilca Algorithms and Scheduling Techniques to Manage Resilience and Power – Dagstuhl 2015
Failures are bad for business … • In HPC: “Today, 20% or more of the computing capacity in a large high-performance computing system is wasted due to failures and recoveries”- Dr. M. Elnozahy et al., System Resilience at Extreme Scale, DARPA • Outside HPC: Dynamic execution environments (clouds) are not suitable for parallel application execution due to volatility. • Tomorrow: U.S. Department of Energy identified 10 research challenges to Exascale. One of them is • Resilience and correctness: Ensuring correct scientific computation in face of faults, reproducibility, and algorithm verification challenges.
Fault Tolerance: many solutions • Rollb llback R Recovery y Coordina nated c che heckp kpoint nt • Legacy approach (with b h blo locki king ng, c , cons nstant nt c che heckp kpoint nts) • Checkpoint/Restart based • Active research on introducing more asynchrony (uncoordinated checkpoint, message logging, correlated sets), increasing the MTBF (hardware) and decreasing the overheads (buddy checkpointing, NVRAM) • Fo Forward R Recovery y • Replication (the only system level Forward Recovery) • Master-Worker with simple resubmission • Iterative methods, Naturally fault tolerant algorithms • Algorithm Based Fault Tolerance AB ABFT FT time Protection blocks Master Mast previous iterations previous iterations trailing matrix Factorized in Factorized in & protection a d e Worker0 Wo Factorize update by b Worker1 Wo applying the c b Wo Worker2 same operations 3
Research Status Anatomy Rollback Recovery Forward Recovery Checkpointing Algorithm Based … & Restart (C/R) Fault Tolerance (ABFT) Large Overhead Small Application Specificity Significant None 4 4
Rollback recovery modeling PurePeriodicCkpt Process 0 Application P URE P ERIODIC C KPT Library Young/Daly Process 1 Application Library P opt p PC = 2 C ( µ − D − R ) Process 2 Application Library Optimal Checkpoint Interval BiPeriodicCkpt Process 0 Application p P opt BPC , G = 2 C ( µ − D − R ) Library B I P ERIODIC C KPT p P opt Process 1 BPC , L = 2 C L ( µ − D − R ) Application Library Process 2 Application Library G ENERAL L IBRARY Checkpoint Interval Checkpoint Interval 5 5
Rollback recovery modeling Memory/component 40 Rollback Recovery Forward Recovery Nb Faults PeriodicCkpt remains constant # Faults 30 Nb Faults Bi-PeriodicCkpt 20 10 Problem Size 0 0.4 increases O( √ n) PeriodicCkpt Bi-PeriodicCkpt 0.35 Checkpointing Algorithm Based Assuming MTBF of … 0.3 & Restart (C/R) Fault Tolerance 10k nodes at 1day, Evolutionary platforms design 0.25 scaled in O(1/n) (ABFT) Waste 0.2 Checkpoint (and 0.15 restart) cost at 10K 0.1 1 minute scaled in Large Overhead Small O(n) 0.05 0 1k 10k 100k 1M 80% of each Application Specificity Significant None iteration is spent in Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, ABFT-algorithm Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International modifying 80% of Journal of Networking and Computing, ISSN 2185-2847 the data 6 6
Rollback recovery modeling Memory/component 40 Rollback Recovery Forward Recovery Nb Faults PeriodicCkpt remains constant # Faults 30 Nb Faults Bi-PeriodicCkpt 20 10 Problem Size 0 0.4 increases O( √ n) PeriodicCkpt Bi-PeriodicCkpt 0.35 Checkpointing Algorithm Based Assuming MTBF of … 0.3 & Restart (C/R) Fault Tolerance 10k nodes at 1day, Too many checkpoints !!! 0.25 scaled in O(1/n) (ABFT) Waste 0.2 Checkpoint (and 0.15 restart) cost at 10K 0.1 1 minute scaled in Large Overhead Small O(n) 0.05 0 1k 10k 100k 1M 80% of each Application Specificity Significant None iteration is spent in Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, ABFT-algorithm Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International modifying 80% of Journal of Networking and Computing, ISSN 2185-2847 the data 7 7
Rollback recovery modeling Memory/component 6 Rollback Recovery Forward Recovery Nb Faults PeriodicCkpt remains constant # Faults Nb Faults Bi-PeriodicCkpt 4 2 Problem Size 0 0.4 increases O( √ n) 0.35 Checkpointing Algorithm Based Assuming MTBF of … 0.3 & Restart (C/R) Fault Tolerance 10k nodes at 1 day, ReEvolutionary platforms design 0.25 scaled in O(1/n) (ABFT) Waste 0.2 PeriodicCkpt Bi-PeriodicCkpt Checkpoint (and 0.15 restart) cost at 10K 0.1 1 minute scaled in Large Overhead Small 0.05 O(1) 0 1k 10k 100k 1M O(n^3) vs O(n^2) of Application Specificity Significant α = 0.55 α = 0.8 α = 0.92 α = 0.975 None Nodes each iteration is Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, spent in ABFT- Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International algorithm modifying Journal of Networking and Computing, ISSN 2185-2847 80% of the data 8 8
Rollback recovery modeling Memory/component 6 Rollback Recovery Forward Recovery Nb Faults PeriodicCkpt remains constant # Faults Nb Faults Bi-PeriodicCkpt 4 2 Problem Size 0 0.4 increases O( √ n) 0.35 Checkpointing Algorithm Based Assuming MTBF of … 0.3 & Restart (C/R) Fault Tolerance 10k nodes at 1 day, Still too many checkpoints !!! 0.25 scaled in O(1/n) (ABFT) Waste 0.2 PeriodicCkpt Bi-PeriodicCkpt Checkpoint (and 0.15 restart) cost at 10K 0.1 1 minute scaled in Large Overhead Small 0.05 O(1) 0 1k 10k 100k 1M O(n^3) vs O(n^2) of Application Specificity Significant α = 0.55 α = 0.8 α = 0.92 α = 0.975 None Nodes each iteration is Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, spent in ABFT- Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International algorithm modifying Journal of Networking and Computing, ISSN 2185-2847 80% of the data 9 9
Research Status Anatomy Rollback Recovery Forward Recovery Checkpointing Algorithm Based … & Restart (C/R) Fault Tolerance (ABFT) Large Overhead Small Large Application Specificity Small This situation can be improved by moving investments from the hardware, more I/O bandwidth, future technologies (NVRAM) and increasing the MTBF of components, into software and developers. 10 10
Forward Recovery • Any technique that permit the application to continue without rollback • Repli lication n (the only system level Forward Recovery) • Master-Worker w with s h simple le r resubmi mission n • It Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms • Alg lgorithm B hm Based F Fault lt T Tole leranc nce • No checkpoint I/O overhead • No rollback, no loss of completed work • May require (sometime expensive, like replicates) protection/recovery operations, but still generally more scalable than checkpoint • “Why is not everybody doing this already, then?” • Often requires in-depths algorithm rewrite (in contrast to automatic system based C/R) • Supposes es t that M MPI c I continues es t to o oper erate a e across f failures es
Forward Recovery • Any technique that permit the application to continue without rollback • Repli lication n (the only system level Forward Recovery) • Master-Worker w with s h simple le r resubmi mission n • It Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms • Alg lgorithm B hm Based F Fault lt T Tole leranc nce • No checkpoint I/O overhead • Minimal or no rollback, no loss of completed work • May require (sometime expensive, like replicates) protection/recovery operations, but still generally more scalable than checkpoint • “Why is not everybody doing this already, then?” • Often requires in-depths algorithm rewrite (in contrast to automatic system based C/R) • Supposes es t that M MPI c I continues es t to o oper erate a e across f failures es
Forward Recovery • Any technique that permit the application to continue without rollback • Master-Worker w with s h simple le r resubmi mission n • It Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms • Alg lgorithm B hm Based F Fault lt T Tole leranc nce • Repli lication n (the only system level Forward Recovery) • No checkpoint I/O overhead • No rollback, no loss of completed work • May require (sometime expensive, like replicates) protection/recovery operations, but still generally more scalable than checkpoint • “Why is not everybody doing this already, then?” • Often requires in-depths algorithm rewrite (in contrast to automatic system based C/R) Standardization o of p programming p paradigms b beh ehavior a after er • Supposes es t that M MPI c I continues es t to o oper erate a e across f failures es failures es i is a a k key m missing i infrastructure e
USE SER LEVEL EVEL FAI AILURE MIT ITIG IGATION ION ULFM ULFM Expend the MPI communication infrastructure to integrate faults as a first class citizen of the message passing concepts
Recommend
More recommend