algorithm based fault tolerance for linear algebra
play

Algorithm-Based Fault Tolerance for Linear Algebra Thomas Herault - PowerPoint PPT Presentation

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Algorithm-Based Fault Tolerance for Linear Algebra Thomas Herault University of Tennessee Knoxville http://icl.utk.edu/~herault/slides/AER-2013.pdf AES 2013,


  1. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Algorithm-Based Fault Tolerance for Linear Algebra Thomas Herault University of Tennessee Knoxville http://icl.utk.edu/~herault/slides/AER-2013.pdf AES 2013, Eugene, OR herault@icl.utk.edu ABFT for Linear Algebra 1/ 66

  2. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Thanks UT Knoxville George Bosilca Aur´ elien Bouteiller Jack Dongarra PhD students (Wesley Bland, Peng Du) INRIA & ENS Lyon Yves Robert, Fr´ ed´ eric Vivien PhD students (Guillaume Aupy, Dounia Zaidouni) Others Franck Cappello, UIUC-Inria joint lab Henri Casanova, Univ. Hawai‘i Amina Guermouche, UIUC-Inria joint lab herault@icl.utk.edu ABFT for Linear Algebra 2/ 66

  3. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Outline 1 Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches 2 Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU 3 Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures 4 Conclusion herault@icl.utk.edu ABFT for Linear Algebra 3/ 66

  4. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Outline 1 Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches 2 Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU 3 Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures 4 Conclusion herault@icl.utk.edu ABFT for Linear Algebra 4/ 66

  5. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Outline 1 Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches 2 Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU 3 Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures 4 Conclusion herault@icl.utk.edu ABFT for Linear Algebra 5/ 66

  6. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Exascale platforms (courtesy Jack Dongarra) Potential System Architecture with a cap of $200M and 20MW Systems 2011 2019 Difference K computer Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s O(100) 12.7 MW ~20 MW Power System memory 1.6 PB 32 - 64 PB O(10) Node performance 128 GF 1,2 or 15TF O(10) – O(100) Node memory BW 64 GB/s 2 - 4TB/s O(100) Node concurrency 8 O(1k) or 10k O(100) – O(1000) Total Node Interconnect BW 20 GB/s 200-400GB/s O(10) System size (nodes) 88,124 O(100,000) or O(1M) O(10) – O(100) Total concurrency 705,024 O(billion) O(1,000) MTTI days O(1 day) - O(10) herault@icl.utk.edu ABFT for Linear Algebra 6/ 66

  7. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Exascale platforms Hierarchical • 10 5 or 10 6 nodes • Each node equipped with 10 4 or 10 3 cores Failure-prone MTBF – one node 1 year 10 years 120 years MTBF – platform 30sec 5mn 1h of 10 6 nodes More nodes ⇒ Shorter MTBF (Mean Time Between Failures) herault@icl.utk.edu ABFT for Linear Algebra 7/ 66

  8. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Exascale platforms Hierarchical • 10 5 or 10 6 nodes • Each node equipped with 10 4 or 10 3 cores Failure-prone MTBF – one node 1 year 10 years 120 years MTBF – platform 30sec 5mn 1h of 10 6 nodes Exascale � = Petascale × 1000 More nodes ⇒ Shorter MTBF (Mean Time Between Failures) herault@icl.utk.edu ABFT for Linear Algebra 7/ 66

  9. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Outline 1 Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches 2 Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU 3 Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures 4 Conclusion herault@icl.utk.edu ABFT for Linear Algebra 8/ 66

  10. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Coordinated checkpointing protocols P 0 m 1 m 2 m 3 Coordinated checkpoints over all P 1 processes m 4 m 5 Global restart after a failure P 2 � No risk of cascading rollbacks � No need to log messages � All processors need to roll back herault@icl.utk.edu ABFT for Linear Algebra 9/ 66

  11. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Message logging protocols P 0 Message Payload Logging: in m 1 m 2 m 3 sender memory P 1 Events Logging: in stable memory m 4 m 5 (replicated) P 2 Restart only the failed processes � No cascading rollbacks � Number of processes to roll back � Memory occupation � Overhead herault@icl.utk.edu ABFT for Linear Algebra 10/ 66

  12. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Hierarchical protocols P 0 Clusters of processes Coordinated checkpointing P 1 protocol within clusters m 1 m 3 m 5 Message logging protocols between P 2 clusters m 2 m 4 Only processors from group(s) with P 3 failed process(es) need to roll back � Need to log inter-groups message payload • Slowdowns failure-free execution • Increases checkpoint size/time � Avoid to log intra-groups message payload � Faster re-execution with logged messages herault@icl.utk.edu ABFT for Linear Algebra 11/ 66

  13. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Which checkpointing protocol to use? Coordinated checkpointing � No risk of cascading rollbacks � No need to log messages � All processors need to roll back � Rumor: May not scale to very large platforms Hierarchical checkpointing � Need to log inter-groups messages • Slowdowns failure-free execution • Increases checkpoint size/time � Only processors from failed group need to roll back � Faster re-execution with logged messages � Rumor: Should scale to very large platforms herault@icl.utk.edu ABFT for Linear Algebra 12/ 66

  14. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Periodical Checkpointing: what Period to use? Intuition Short Period = ⇒ small risk of losing work Short Period = ⇒ more overhead Long Period = ⇒ low overhead Long Period = ⇒ high risk of losing work Optimal Period Computation Model the Waste as a function of T , µ P , etc... Find minimal of this function dWaste = 0 dT herault@icl.utk.edu ABFT for Linear Algebra 13/ 66

  15. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Outline 1 Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches 2 Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU 3 Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures 4 Conclusion herault@icl.utk.edu ABFT for Linear Algebra 14/ 66

  16. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Checkpointing cost Time spent working Time spent checkpointing Time Computing the first chunk Checkpointing the first chunk Processing the first chunk Processing the second chunk Blocking model: checkpointing blocks all computations herault@icl.utk.edu ABFT for Linear Algebra 15/ 66

  17. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Checkpointing cost Time spent working Time spent checkpointing Time Computing the first chunk Checkpointing the first chunk Processing the first chunk Processing the second chunk Non-blocking model: checkpointing has no impact on computations (e.g., first copy state to RAM, then copy RAM to disk) herault@icl.utk.edu ABFT for Linear Algebra 15/ 66

  18. Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Checkpointing cost Time spent working Time spent checkpointing Time spent working with slowdown Time Computing the first chunk Checkpointing the first chunk Processing the first chunk General model: checkpointing slows computations down: during a checkpoint of duration C , the same amount of computation is done as during a time α C without checkpointing (0 ≤ α ≤ 1) herault@icl.utk.edu ABFT for Linear Algebra 15/ 66

Recommend


More recommend