algorithm based fault tolerance
play

Algorithm-Based Fault Tolerance erault 1 , Yves Robert 1 , 2 & Fr - PowerPoint PPT Presentation

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm-Based Fault Tolerance erault 1 , Yves Robert 1 , 2 & Fr eric Vivien 2 Thomas H ed 1 University of Tennessee Knoxville, USA 2


  1. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm-Based Fault Tolerance erault 1 , Yves Robert 1 , 2 & Fr´ eric Vivien 2 Thomas H´ ed´ 1 – University of Tennessee Knoxville, USA 2 – ENS Lyon & INRIA, France frederic.vivien@inria.fr 3 rd JLESC Summer School Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 1/ 45

  2. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Outline 1 Introduction: Matrix-Matrix Multiplication 2 ABFT for block LU factorization 3 Composite approach: ABFT & Checkpointing Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 2/ 45

  3. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Outline 1 Introduction: Matrix-Matrix Multiplication 2 ABFT for block LU factorization 3 Composite approach: ABFT & Checkpointing Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 3/ 45

  4. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Generic vs. Application specific approaches Generic solutions Universal Very low prerequisite One size fits all (pros and cons) Application specific solutions Requires (deep) study of the application/algorihtm Tailored solution: higher efficiency Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 4/ 45

  5. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Backward Recovery vs. Forward Recovery Backward Recovery Rollback / Backward Recovery: returns in the history to recover from failures. Spends time to re-execute computations Rebuilds states already reached Typical: checkpointing techniques Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 5/ 45

  6. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Backward Recovery vs. Forward Recovery Forward Recovery Forward Recovery: proceeds without returning Pays additional costs during (failure-free) computation to maintain consistent redundancy Or pays additional computations when failures happen General technique: Replication Application-Specific techniques: Iterative algorithms with fixed point convergence, ABFT, ... Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 5/ 45

  7. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm Based Fault Tolerance (ABFT) Principle Limited to Linear Algebra computations Matrices are extended with rows and/or columns of checksums Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 6/ 45

  8. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm Based Fault Tolerance (ABFT) Principle Limited to Linear Algebra computations Matrices are extended with rows and/or columns of checksums   5 1 7 M = 4 3 5   4 6 9 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 6/ 45

  9. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm Based Fault Tolerance (ABFT) Principle Limited to Linear Algebra computations Matrices are extended with rows and/or columns of checksums   5 1 7 13 M = 4 3 5 12   4 6 9 19 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 6/ 45

  10. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and fail-stop errors Missing checksum data  5 1 7 13  M = 4 3 5   4 6 9 19 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45

  11. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and fail-stop errors Missing checksum data  5 1 7 13  M = 4 3 5   4 6 9 19 Simple recomputation: 4+3+5 = 12. Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45

  12. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and fail-stop errors Missing checksum data  5 1 7 13  M = 4 3 5   4 6 9 19 Simple recomputation: 4+3+5 = 12. Missing original data   5 1 7 13 M = 4 5 12   4 6 9 19 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45

  13. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and fail-stop errors Missing checksum data  5 1 7 13  M = 4 3 5   4 6 9 19 Simple recomputation: 4+3+5 = 12. Missing original data   5 1 7 13 M = 4 5 12   4 6 9 19 Simple recomputation: 12-(4+5) = 3. Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45

  14. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption   5 1 7 13 M = 4 3 5 13   4 6 9 19 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 8/ 45

  15. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption   5 1 7 13 M = 4 3 5 13   4 6 9 19 Error detection: 4 + 3 + 5 � = 13 Limitations The following matrix would have successfully passed the sanity check:  5 1 7 13  M = 5 3 5 13   4 6 9 19 Can detect one error and correct zero . Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 8/ 45

  16. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption One row and one column of checksums   5 1 7 13 4 3 5 11   M =   4 6 9 19   13 9 21 43 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 9/ 45

  17. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption One row and one column of checksums   5 1 7 13 4 3 5 11   M =   4 6 9 19   13 9 21 43 Checksum recomputation to look for silent data corruptions:  5 + 1 + 7 = 13  4 + 3 + 5 = 12     4 + 6 + 9 = 19   13 + 10 + 21 = 44 Checksums do not match ! Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 9/ 45

  18. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption     5 1 7 13 5 + 1 + 7 = 13 4 3 5 11 4 + 3 + 5 = 12     M =     4 6 9 19 4 + 6 + 9 = 19     13 9 21 43 13 + 10 + 21 = 44 Both checksums are affected, giving out the location of the error. Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 10/ 45

  19. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption     5 1 7 13 5 + 1 + 7 = 13 4 3 5 11 4 + 3 + 5 = 12     M =     4 6 9 19 4 + 6 + 9 = 19     13 9 21 43 13 + 10 + 21 = 44 Both checksums are affected, giving out the location of the error. We solve: 4 + x + 5 = 11 1 + x + 6 = 9 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 10/ 45

  20. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption     5 1 7 13 5 + 1 + 7 = 13 4 3 5 11 4 + 3 + 5 = 12     M =     4 6 9 19 4 + 6 + 9 = 19     13 9 21 43 13 + 10 + 21 = 44 Both checksums are affected, giving out the location of the error. We solve: 4 + x + 5 = 11 1 + x + 6 = 9 Recomputing the checksums we find that:   5 + 1 + 7 = 13 4 + 2 + 5 = 11    Checksums match �   4 + 6 + 9 = 19  13 + 9 + 21 = 43 Can detect two errors and correct one Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 10/ 45

  21. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT for Matrix-Matrix multiplication Aim: Computation of C = A × B Let e T = [1 , 1 , · · · , 1], we define � A � C � � Ce A c := , B r := , C f := � � B Be . e T A e T C e T Ce Where A c is the column checksum matrix , B r is the row checksum matrix and C f is the full checksum matrix . Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 11/ 45

  22. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT for Matrix-Matrix multiplication Aim: Computation of C = A × B Let e T = [1 , 1 , · · · , 1], we define � A � C � � Ce A c := , B r := , C f := � � B Be . e T A e T C e T Ce Where A c is the column checksum matrix , B r is the row checksum matrix and C f is the full checksum matrix . � A � A c × B r � � = × B Be e T A � AB � C � � ABe Ce = C f = = e T AB e T ABe e T C e T Ce Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 11/ 45

Recommend


More recommend