Practical foundations for resilient applications George Bosilca - PowerPoint PPT Presentation

Practical foundations for resilient applications George Bosilca Algorithms and Scheduling Techniques to Manage Resilience and Power – Dagstuhl 2015

Failures are bad for business … • In HPC: “Today, 20% or more of the computing capacity in a large high-performance computing system is wasted due to failures and recoveries”- Dr. M. Elnozahy et al., System Resilience at Extreme Scale, DARPA • Outside HPC: Dynamic execution environments (clouds) are not suitable for parallel application execution due to volatility. • Tomorrow: U.S. Department of Energy identified 10 research challenges to Exascale. One of them is • Resilience and correctness: Ensuring correct scientific computation in face of faults, reproducibility, and algorithm verification challenges.

Fault Tolerance: many solutions • Rollb llback R Recovery y Coordina nated c che heckp kpoint nt • Legacy approach (with b h blo locki king ng, c , cons nstant nt c che heckp kpoint nts) • Checkpoint/Restart based • Active research on introducing more asynchrony (uncoordinated checkpoint, message logging, correlated sets), increasing the MTBF (hardware) and decreasing the overheads (buddy checkpointing, NVRAM) • Fo Forward R Recovery y • Replication (the only system level Forward Recovery) • Master-Worker with simple resubmission • Iterative methods, Naturally fault tolerant algorithms • Algorithm Based Fault Tolerance AB ABFT FT time Protection blocks Master Mast previous iterations previous iterations trailing matrix Factorized in Factorized in & protection a d e Worker0 Wo Factorize update by b Worker1 Wo applying the c b Wo Worker2 same operations 3

Research Status Anatomy Rollback Recovery Forward Recovery Checkpointing Algorithm Based … & Restart (C/R) Fault Tolerance (ABFT) Large Overhead Small Application Specificity Significant None 4 4

Rollback recovery modeling PurePeriodicCkpt Process 0 Application P URE P ERIODIC C KPT Library Young/Daly Process 1 Application Library P opt p PC = 2 C ( µ − D − R ) Process 2 Application Library Optimal Checkpoint Interval BiPeriodicCkpt Process 0 Application p P opt BPC , G = 2 C ( µ − D − R ) Library B I P ERIODIC C KPT p P opt Process 1 BPC , L = 2 C L ( µ − D − R ) Application Library Process 2 Application Library G ENERAL L IBRARY Checkpoint Interval Checkpoint Interval 5 5

Rollback recovery modeling Memory/component 40 Rollback Recovery Forward Recovery Nb Faults PeriodicCkpt remains constant # Faults 30 Nb Faults Bi-PeriodicCkpt 20 10 Problem Size 0 0.4 increases O( √ n) PeriodicCkpt Bi-PeriodicCkpt 0.35 Checkpointing Algorithm Based Assuming MTBF of … 0.3 & Restart (C/R) Fault Tolerance 10k nodes at 1day, Evolutionary platforms design 0.25 scaled in O(1/n) (ABFT) Waste 0.2 Checkpoint (and 0.15 restart) cost at 10K 0.1 1 minute scaled in Large Overhead Small O(n) 0.05 0 1k 10k 100k 1M 80% of each Application Specificity Significant None iteration is spent in Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, ABFT-algorithm Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International modifying 80% of Journal of Networking and Computing, ISSN 2185-2847 the data 6 6

Rollback recovery modeling Memory/component 40 Rollback Recovery Forward Recovery Nb Faults PeriodicCkpt remains constant # Faults 30 Nb Faults Bi-PeriodicCkpt 20 10 Problem Size 0 0.4 increases O( √ n) PeriodicCkpt Bi-PeriodicCkpt 0.35 Checkpointing Algorithm Based Assuming MTBF of … 0.3 & Restart (C/R) Fault Tolerance 10k nodes at 1day, Too many checkpoints !!! 0.25 scaled in O(1/n) (ABFT) Waste 0.2 Checkpoint (and 0.15 restart) cost at 10K 0.1 1 minute scaled in Large Overhead Small O(n) 0.05 0 1k 10k 100k 1M 80% of each Application Specificity Significant None iteration is spent in Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, ABFT-algorithm Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International modifying 80% of Journal of Networking and Computing, ISSN 2185-2847 the data 7 7

Rollback recovery modeling Memory/component 6 Rollback Recovery Forward Recovery Nb Faults PeriodicCkpt remains constant # Faults Nb Faults Bi-PeriodicCkpt 4 2 Problem Size 0 0.4 increases O( √ n) 0.35 Checkpointing Algorithm Based Assuming MTBF of … 0.3 & Restart (C/R) Fault Tolerance 10k nodes at 1 day, ReEvolutionary platforms design 0.25 scaled in O(1/n) (ABFT) Waste 0.2 PeriodicCkpt Bi-PeriodicCkpt Checkpoint (and 0.15 restart) cost at 10K 0.1 1 minute scaled in Large Overhead Small 0.05 O(1) 0 1k 10k 100k 1M O(n^3) vs O(n^2) of Application Specificity Significant α = 0.55 α = 0.8 α = 0.92 α = 0.975 None Nodes each iteration is Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, spent in ABFT- Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International algorithm modifying Journal of Networking and Computing, ISSN 2185-2847 80% of the data 8 8

Rollback recovery modeling Memory/component 6 Rollback Recovery Forward Recovery Nb Faults PeriodicCkpt remains constant # Faults Nb Faults Bi-PeriodicCkpt 4 2 Problem Size 0 0.4 increases O( √ n) 0.35 Checkpointing Algorithm Based Assuming MTBF of … 0.3 & Restart (C/R) Fault Tolerance 10k nodes at 1 day, Still too many checkpoints !!! 0.25 scaled in O(1/n) (ABFT) Waste 0.2 PeriodicCkpt Bi-PeriodicCkpt Checkpoint (and 0.15 restart) cost at 10K 0.1 1 minute scaled in Large Overhead Small 0.05 O(1) 0 1k 10k 100k 1M O(n^3) vs O(n^2) of Application Specificity Significant α = 0.55 α = 0.8 α = 0.92 α = 0.975 None Nodes each iteration is Assessing the impact of ABFT & Checkpoint composite strategies, George Bosilca, spent in ABFT- Aurelien Bouteiller, Thomas Herault, Yves Robert and Jack Dongarra, International algorithm modifying Journal of Networking and Computing, ISSN 2185-2847 80% of the data 9 9

Research Status Anatomy Rollback Recovery Forward Recovery Checkpointing Algorithm Based … & Restart (C/R) Fault Tolerance (ABFT) Large Overhead Small Large Application Specificity Small This situation can be improved by moving investments from the hardware, more I/O bandwidth, future technologies (NVRAM) and increasing the MTBF of components, into software and developers. 10 10

Forward Recovery • Any technique that permit the application to continue without rollback • Repli lication n (the only system level Forward Recovery) • Master-Worker w with s h simple le r resubmi mission n • It Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms • Alg lgorithm B hm Based F Fault lt T Tole leranc nce • No checkpoint I/O overhead • No rollback, no loss of completed work • May require (sometime expensive, like replicates) protection/recovery operations, but still generally more scalable than checkpoint • “Why is not everybody doing this already, then?” • Often requires in-depths algorithm rewrite (in contrast to automatic system based C/R) • Supposes es t that M MPI c I continues es t to o oper erate a e across f failures es

Forward Recovery • Any technique that permit the application to continue without rollback • Repli lication n (the only system level Forward Recovery) • Master-Worker w with s h simple le r resubmi mission n • It Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms • Alg lgorithm B hm Based F Fault lt T Tole leranc nce • No checkpoint I/O overhead • Minimal or no rollback, no loss of completed work • May require (sometime expensive, like replicates) protection/recovery operations, but still generally more scalable than checkpoint • “Why is not everybody doing this already, then?” • Often requires in-depths algorithm rewrite (in contrast to automatic system based C/R) • Supposes es t that M MPI c I continues es t to o oper erate a e across f failures es

Forward Recovery • Any technique that permit the application to continue without rollback • Master-Worker w with s h simple le r resubmi mission n • It Iterative me metho hods, N , Naturally f lly fault lt t tole lerant nt a alg lgorithms hms • Alg lgorithm B hm Based F Fault lt T Tole leranc nce • Repli lication n (the only system level Forward Recovery) • No checkpoint I/O overhead • No rollback, no loss of completed work • May require (sometime expensive, like replicates) protection/recovery operations, but still generally more scalable than checkpoint • “Why is not everybody doing this already, then?” • Often requires in-depths algorithm rewrite (in contrast to automatic system based C/R) Standardization o of p programming p paradigms b beh ehavior a after er • Supposes es t that M MPI c I continues es t to o oper erate a e across f failures es failures es i is a a k key m missing i infrastructure e

USE SER LEVEL EVEL FAI AILURE MIT ITIG IGATION ION ULFM ULFM Expend the MPI communication infrastructure to integrate faults as a first class citizen of the message passing concepts

Practical foundations for resilient applications George Bosilca - PowerPoint PPT Presentation

Practical foundations for resilient applications George Bosilca Algorithms and Scheduling Techniques to Manage Resilience and Power Dagstuhl 2015 Failures are bad for business In HPC: Today, 20% or more of the computing capacity

Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids

recap to this point foundations foundations foundations foundations genetics =

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Practical Experience with Practical Experience with Practical Experience with Practical

Resilient Chicago 100 Resilient Cities is a global initiative that seeks to help cities around

Resilient Modulus Unbound Materials 1 Resilient Modulus M R Deviator stress Axial strain

Resilient Modulus Unbound Materials M R Resilient Modulus Axial strain Deviator stress

Resilient Food Systems, Resilient Cities Presented by Kim Zeuli Resilience A

New Initiatives in Community Resilient Power January 30, 2015 Hosted by Lewis Milford

Implementing Practical leakage-resilient symmetric cryptography Daniel J. Bernstein

Real-World applications of Boosting Yoav Freund UCSD Practical Advantages of AdaBoost

BUILDING THE FOUNDATIONS OF A WORLD BUILDING THE FOUNDATIONS OF A WORLD CLASS BUILDING THE

For personal use only BUILDING THE FOUNDATIONS OF A WORLD BUILDING THE FOUNDATIONS OF A WORLD

For personal use only BUILDING THE FOUNDATIONS OF A WORLD BUILDING THE FOUNDATIONS OF A WORLD

Outline Foundations of Data and Knowledge Systems EPCL Basic Training Camp 2012 3. Foundations

Change from a Practical Perspective Change from a Practical Perspective Change from a Practical

Status and challenges of security in distributed compu6ng

Least branch hod pairs pairs Hod pair capturing and HOD . John R. Steel University of

Byzan&ne Fault Tolerance CS 425: Distributed Systems Fall 2011 Material drived from slides by

Querencia Talk: Its a Dangerous (Cyber) World Dr. Bill Young Department of Computer Science

Analyzing Embedded Theories in an Age of Paradigm Shift Timothy C. Weiskel Discussion Session 10

The Testers Toolkit: Start Testing Your Projects Today Pete Krawczyk Testing? How boring!

Domain Partitioning for Open Reactive Systems Scott D. Stoller Computer Science Department

PROFESSIONAL DEVELOPMENT FOR TEACHERS, COUNSELORS, YOUTH WORKERS 3 day course A$675 + GST JUNE

Practical foundations for resilient applications George Bosilca - PowerPoint PPT Presentation

Practical foundations for resilient applications George Bosilca Algorithms and Scheduling Techniques to Manage Resilience and Power Dagstuhl 2015 Failures are bad for business In HPC: Today, 20% or more of the computing capacity

Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids

recap to this point foundations foundations foundations foundations genetics =

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Practical Experience with Practical Experience with Practical Experience with Practical

Resilient Chicago 100 Resilient Cities is a global initiative that seeks to help cities around

Resilient Modulus Unbound Materials 1 Resilient Modulus M R Deviator stress Axial strain

Resilient Modulus Unbound Materials M R Resilient Modulus Axial strain Deviator stress

Resilient Food Systems, Resilient Cities Presented by Kim Zeuli Resilience A

New Initiatives in Community Resilient Power January 30, 2015 Hosted by Lewis Milford

Implementing Practical leakage-resilient symmetric cryptography Daniel J. Bernstein

Real-World applications of Boosting Yoav Freund UCSD Practical Advantages of AdaBoost

BUILDING THE FOUNDATIONS OF A WORLD BUILDING THE FOUNDATIONS OF A WORLD CLASS BUILDING THE

For personal use only BUILDING THE FOUNDATIONS OF A WORLD BUILDING THE FOUNDATIONS OF A WORLD

For personal use only BUILDING THE FOUNDATIONS OF A WORLD BUILDING THE FOUNDATIONS OF A WORLD

Outline Foundations of Data and Knowledge Systems EPCL Basic Training Camp 2012 3. Foundations

Change from a Practical Perspective Change from a Practical Perspective Change from a Practical

Status and challenges of security in distributed compu6ng

Least branch hod pairs pairs Hod pair capturing and HOD . John R. Steel University of

Byzan&amp;ne Fault Tolerance CS 425: Distributed Systems Fall 2011 Material drived from slides by

Querencia Talk: Its a Dangerous (Cyber) World Dr. Bill Young Department of Computer Science

Analyzing Embedded Theories in an Age of Paradigm Shift Timothy C. Weiskel Discussion Session 10

The Testers Toolkit: Start Testing Your Projects Today Pete Krawczyk Testing? How boring!

Domain Partitioning for Open Reactive Systems Scott D. Stoller Computer Science Department

PROFESSIONAL DEVELOPMENT FOR TEACHERS, COUNSELORS, YOUTH WORKERS 3 day course A$675 + GST JUNE

Byzan&ne Fault Tolerance CS 425: Distributed Systems Fall 2011 Material drived from slides by