Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm-Based Fault Tolerance erault 1 , Yves Robert 1 , 2 & Fr´ eric Vivien 2 Thomas H´ ed´ 1 – University of Tennessee Knoxville, USA 2 – ENS Lyon & INRIA, France frederic.vivien@inria.fr 3 rd JLESC Summer School Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 1/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Outline 1 Introduction: Matrix-Matrix Multiplication 2 ABFT for block LU factorization 3 Composite approach: ABFT & Checkpointing Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 2/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Outline 1 Introduction: Matrix-Matrix Multiplication 2 ABFT for block LU factorization 3 Composite approach: ABFT & Checkpointing Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 3/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Generic vs. Application specific approaches Generic solutions Universal Very low prerequisite One size fits all (pros and cons) Application specific solutions Requires (deep) study of the application/algorihtm Tailored solution: higher efficiency Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 4/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Backward Recovery vs. Forward Recovery Backward Recovery Rollback / Backward Recovery: returns in the history to recover from failures. Spends time to re-execute computations Rebuilds states already reached Typical: checkpointing techniques Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 5/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Backward Recovery vs. Forward Recovery Forward Recovery Forward Recovery: proceeds without returning Pays additional costs during (failure-free) computation to maintain consistent redundancy Or pays additional computations when failures happen General technique: Replication Application-Specific techniques: Iterative algorithms with fixed point convergence, ABFT, ... Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 5/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm Based Fault Tolerance (ABFT) Principle Limited to Linear Algebra computations Matrices are extended with rows and/or columns of checksums Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 6/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm Based Fault Tolerance (ABFT) Principle Limited to Linear Algebra computations Matrices are extended with rows and/or columns of checksums 5 1 7 M = 4 3 5 4 6 9 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 6/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm Based Fault Tolerance (ABFT) Principle Limited to Linear Algebra computations Matrices are extended with rows and/or columns of checksums 5 1 7 13 M = 4 3 5 12 4 6 9 19 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 6/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and fail-stop errors Missing checksum data 5 1 7 13 M = 4 3 5 4 6 9 19 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and fail-stop errors Missing checksum data 5 1 7 13 M = 4 3 5 4 6 9 19 Simple recomputation: 4+3+5 = 12. Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and fail-stop errors Missing checksum data 5 1 7 13 M = 4 3 5 4 6 9 19 Simple recomputation: 4+3+5 = 12. Missing original data 5 1 7 13 M = 4 5 12 4 6 9 19 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and fail-stop errors Missing checksum data 5 1 7 13 M = 4 3 5 4 6 9 19 Simple recomputation: 4+3+5 = 12. Missing original data 5 1 7 13 M = 4 5 12 4 6 9 19 Simple recomputation: 12-(4+5) = 3. Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption 5 1 7 13 M = 4 3 5 13 4 6 9 19 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 8/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption 5 1 7 13 M = 4 3 5 13 4 6 9 19 Error detection: 4 + 3 + 5 � = 13 Limitations The following matrix would have successfully passed the sanity check: 5 1 7 13 M = 5 3 5 13 4 6 9 19 Can detect one error and correct zero . Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 8/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption One row and one column of checksums 5 1 7 13 4 3 5 11 M = 4 6 9 19 13 9 21 43 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 9/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption One row and one column of checksums 5 1 7 13 4 3 5 11 M = 4 6 9 19 13 9 21 43 Checksum recomputation to look for silent data corruptions: 5 + 1 + 7 = 13 4 + 3 + 5 = 12 4 + 6 + 9 = 19 13 + 10 + 21 = 44 Checksums do not match ! Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 9/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption 5 1 7 13 5 + 1 + 7 = 13 4 3 5 11 4 + 3 + 5 = 12 M = 4 6 9 19 4 + 6 + 9 = 19 13 9 21 43 13 + 10 + 21 = 44 Both checksums are affected, giving out the location of the error. Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 10/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption 5 1 7 13 5 + 1 + 7 = 13 4 3 5 11 4 + 3 + 5 = 12 M = 4 6 9 19 4 + 6 + 9 = 19 13 9 21 43 13 + 10 + 21 = 44 Both checksums are affected, giving out the location of the error. We solve: 4 + x + 5 = 11 1 + x + 6 = 9 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 10/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption 5 1 7 13 5 + 1 + 7 = 13 4 3 5 11 4 + 3 + 5 = 12 M = 4 6 9 19 4 + 6 + 9 = 19 13 9 21 43 13 + 10 + 21 = 44 Both checksums are affected, giving out the location of the error. We solve: 4 + x + 5 = 11 1 + x + 6 = 9 Recomputing the checksums we find that: 5 + 1 + 7 = 13 4 + 2 + 5 = 11 Checksums match � 4 + 6 + 9 = 19 13 + 9 + 21 = 43 Can detect two errors and correct one Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 10/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT for Matrix-Matrix multiplication Aim: Computation of C = A × B Let e T = [1 , 1 , · · · , 1], we define � A � C � � Ce A c := , B r := , C f := � � B Be . e T A e T C e T Ce Where A c is the column checksum matrix , B r is the row checksum matrix and C f is the full checksum matrix . Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 11/ 45
Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT for Matrix-Matrix multiplication Aim: Computation of C = A × B Let e T = [1 , 1 , · · · , 1], we define � A � C � � Ce A c := , B r := , C f := � � B Be . e T A e T C e T Ce Where A c is the column checksum matrix , B r is the row checksum matrix and C f is the full checksum matrix . � A � A c × B r � � = × B Be e T A � AB � C � � ABe Ce = C f = = e T AB e T ABe e T C e T Ce Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 11/ 45
Recommend
More recommend