which verification for soft error detection
play

Which verification for soft error detection? Leonardo Bautista-Gomez - PowerPoint PPT Presentation

Problem statement Theoretical analysis Performance evaluation Conclusion Which verification for soft error detection? Leonardo Bautista-Gomez 1 , Anne Benoit 2 , Aur elien Cavelan 2 , Saurabh K. Raina 3 , Yves Robert 2 , 4 and Hongyang Sun 2 1


  1. Problem statement Theoretical analysis Performance evaluation Conclusion Which verification for soft error detection? Leonardo Bautista-Gomez 1 , Anne Benoit 2 , Aur´ elien Cavelan 2 , Saurabh K. Raina 3 , Yves Robert 2 , 4 and Hongyang Sun 2 1 . Argonne National Laboratory, USA 2 . ENS Lyon & INRIA, France 3 . Jaypee Institute of Information Technology, India 4 . University of Tennessee Knoxville, USA Anne.Benoit@ens-lyon.fr December 17, HiPC’2015, Bengaluru, India 1/25

  2. Problem statement Theoretical analysis Performance evaluation Conclusion Computing at Exascale Exascale platform: 10 5 or 10 6 nodes, each equipped with 10 2 or 10 3 cores Shorter Mean Time Between Failures (MTBF) µ Theorem: µ p = µ ind for arbitrary distributions p MTBF (individual node) 1 year 10 years 120 years MTBF (platform of 10 6 nodes) 30 sec 5 mn 1 h 2/25

  3. Problem statement Theoretical analysis Performance evaluation Conclusion Computing at Exascale Exascale platform: 10 5 or 10 6 nodes, each equipped with 10 2 or 10 3 cores Shorter Mean Time Between Failures (MTBF) µ Theorem: µ p = µ ind for arbitrary distributions p MTBF (individual node) 1 year 10 years 120 years MTBF (platform of 10 6 nodes) 30 sec 5 mn 1 h Need more reliable components!! Need more resilient techniques!!! 2/25

  4. Problem statement Theoretical analysis Performance evaluation Conclusion General-purpose approach Periodic checkpoint, rollback and recovery: Error C C C Time W W Fail-stop errors: instantaneous error detection, e.g., resource crash 3/25

  5. Problem statement Theoretical analysis Performance evaluation Conclusion General-purpose approach Periodic checkpoint, rollback and recovery: Error C C C Time W W Fail-stop errors: instantaneous error detection, e.g., resource crash Silent errors (aka silent data corruptions): e.g., soft faults in L1 cache, ALU, double bit flip 3/25

  6. Problem statement Theoretical analysis Performance evaluation Conclusion General-purpose approach Periodic checkpoint, rollback and recovery: Error Corrupt Detect C C C Time W W Fail-stop errors: instantaneous error detection, e.g., resource crash Silent errors (aka silent data corruptions): e.g., soft faults in L1 cache, ALU, double bit flip Silent error is detected only when corrupted data is activated, which could happen long after its occurrence Detection latency is problematic ⇒ risk of saving corrupted checkpoint! 3/25

  7. Problem statement Theoretical analysis Performance evaluation Conclusion Coping with silent errors Couple checkpointing with verification: Detect Error V ∗ C V ∗ C V ∗ C Time W W Before each checkpoint, run some verification mechanism (checksum, ECC, coherence tests, TMR, etc) Silent error is detected by verification ⇒ checkpoint always valid � 4/25

  8. Problem statement Theoretical analysis Performance evaluation Conclusion Coping with silent errors Couple checkpointing with verification: Detect Error V ∗ C V ∗ C V ∗ C Time W W Before each checkpoint, run some verification mechanism (checksum, ECC, coherence tests, TMR, etc) Silent error is detected by verification ⇒ checkpoint always valid � Optimal period (Young/Daly): Fail-stop (classical) Silent errors T = W + V ∗ + C Pattern T = W + C W ∗ = √ 2 C µ W ∗ = � Optimal ( C + V ∗ ) µ 4/25

  9. Problem statement Theoretical analysis Performance evaluation Conclusion One step further Perform several verifications before each checkpoint: Detect Error V ∗ C V ∗ V ∗ V ∗ C V ∗ V ∗ V ∗ C Time Pro: silent error is detected earlier in the pattern � Con: additional overhead in error-free executions � 5/25

  10. Problem statement Theoretical analysis Performance evaluation Conclusion One step further Perform several verifications before each checkpoint: Detect Error V ∗ C V ∗ V ∗ V ∗ C V ∗ V ∗ V ∗ C Time Pro: silent error is detected earlier in the pattern � Con: additional overhead in error-free executions � How many intermediate verifications to use and the positions? 5/25

  11. Problem statement Theoretical analysis Performance evaluation Conclusion Partial verification Guaranteed/perfect verifications ( V ∗ ) can be very expensive! Partial verifications ( V ) are available for many HPC applications! Lower accuracy: recall r = #detected errors < 1 � #total errors Much lower cost, i.e., V < V ∗ � 6/25

  12. Problem statement Theoretical analysis Performance evaluation Conclusion Partial verification Guaranteed/perfect verifications ( V ∗ ) can be very expensive! Partial verifications ( V ) are available for many HPC applications! Lower accuracy: recall r = #detected errors < 1 � #total errors Much lower cost, i.e., V < V ∗ � Detect! Detect? Error V ∗ C V 1 V 2 V ∗ C V 1 V 2 V ∗ C Time 6/25

  13. Problem statement Theoretical analysis Performance evaluation Conclusion Partial verification Guaranteed/perfect verifications ( V ∗ ) can be very expensive! Partial verifications ( V ) are available for many HPC applications! Lower accuracy: recall r = #detected errors < 1 � #total errors Much lower cost, i.e., V < V ∗ � Detect! Detect? Error V ∗ C V 1 V 2 V ∗ C V 1 V 2 V ∗ C Time Which verification(s) to use? How many? Positions? 6/25

  14. Problem statement Theoretical analysis Performance evaluation Conclusion Outline Problem statement 1 Theoretical analysis 2 Performance evaluation 3 Conclusion 4 7/25

  15. Problem statement Theoretical analysis Performance evaluation Conclusion Model and objective Silent errors Poisson process: arrival rate λ = 1 /µ , where µ is platform MTBF Strike only computations; checkpointing, recovery, and verifications are protected Resilience parameters Cost of checkpointing C , cost of recovery R k types of partial detectors and a perfect detector � D (1) , D (2) , . . . , D ( k ) , D ∗ � D ( i ) : cost V ( i ) and recall r ( i ) < 1 D ∗ : cost V ∗ and recall r ∗ = 1 Design an optimal periodic computing pattern that minimizes execution time (or makespan) of the application 8/25

  16. Problem statement Theoretical analysis Performance evaluation Conclusion Pattern Formally, a pattern Pattern ( W , n , α , D ) is defined by W : pattern work length (or period) n : number of work segments, of lengths w i (with � n i =1 w i = W ) α = [ α 1 , α 2 , . . . , α n ]: work fraction of each segment ( α i = w i / W and � n i =1 α i = 1) D = [ D 1 , D 2 , . . . , D n − 1 , D ∗ ]: detectors used at the end of each segment ( D i = D ( j ) for some type j ) D n − 1 D ∗ C D 1 D 2 D 3 D ∗ C · · · Time w 1 w 2 w 3 w n · · · 9/25

  17. Problem statement Theoretical analysis Performance evaluation Conclusion Pattern Formally, a pattern Pattern ( W , n , α , D ) is defined by W : pattern work length (or period) n : number of work segments, of lengths w i (with � n i =1 w i = W ) α = [ α 1 , α 2 , . . . , α n ]: work fraction of each segment ( α i = w i / W and � n i =1 α i = 1) D = [ D 1 , D 2 , . . . , D n − 1 , D ∗ ]: detectors used at the end of each segment ( D i = D ( j ) for some type j ) D n − 1 D ∗ C D 1 D 2 D 3 D ∗ C · · · Time w 1 w 2 w 3 w n · · · - Last detector is perfect to avoid saving corrupted checkpoints - The same detector type D ( j ) could be used at the end of several segments 9/25

  18. Problem statement Theoretical analysis Performance evaluation Conclusion Outline Problem statement 1 Theoretical analysis 2 Performance evaluation 3 Conclusion 4 10/25

  19. Problem statement Theoretical analysis Performance evaluation Conclusion Summary of results In a nutshell: Given a pattern Pattern ( W , n , α , D ), We show how to compute the expected execution time We are able to characterize its optimal length We can compute the optimal positions of the partial verifications 11/25

  20. Problem statement Theoretical analysis Performance evaluation Conclusion Summary of results In a nutshell: Given a pattern Pattern ( W , n , α , D ), We show how to compute the expected execution time We are able to characterize its optimal length We can compute the optimal positions of the partial verifications However, we prove that finding the optimal pattern is NP-hard We design an FPTAS (Fully Polynomial-Time Approximation Scheme) that gives a makespan within (1 + ǫ ) times the optimal with running time polynomial in the input size and 1 /ǫ We show a simple greedy algorithm that works well in practice 11/25

  21. Problem statement Theoretical analysis Performance evaluation Conclusion Summary of results Algorithm to determine a pattern Pattern ( W , n , α , D ): Use FPTAS or Greedy (or even brute force for small instances) to find (optimal) number n of segments and set D of used detectors Arrange the n − 1 partial detectors in any order Compute W ∗ = � 1 − g i − 1 g i o ff 1 λ f re and α ∗ i = U n · (1+ g i − 1 )(1+ g i ) for 1 ≤ i ≤ n , n − 1 V i + V ∗ + C and f re = 1 1 + 1 � � � where o ff = 2 U n i =1 n − 1 1 − g i � with g i = 1 − r i and U n = 1 + 1 + g i i =1 12/25

Recommend


More recommend