Which verification for soft error detection? Leonardo Bautista-Gomez 1 , Anne Benoit 2 , Aur´ elien Cavelan 2 , Saurabh K. Raina 3 , Yves Robert 2 , 4 and Hongyang Sun 2 1 . Argonne National Laboratory, USA 2 . ENS Lyon & INRIA, France 3 . Jaypee Institute of Information Technology, India 4 . University of Tennessee Knoxville, USA Anne.Benoit@ens-lyon.fr Dagstuhl Seminar #15281: Algorithms and Scheduling Techniques to Manage Resilience and Power Consumption in Distributed Systems July 6, 2015, Schloss Dagstuhl, Germany 1/1
Computing at Exascale Exascale platform: 10 5 or 10 6 nodes, each equipped with 10 2 or 10 3 cores Shorter Mean Time Between Failures (MTBF) µ Theorem: µ p = µ ind for arbitrary distributions p MTBF (individual node) 1 year 10 years 120 years MTBF (platform of 10 6 nodes) 30 sec 5 mn 1 h 2/1
Computing at Exascale Exascale platform: 10 5 or 10 6 nodes, each equipped with 10 2 or 10 3 cores Shorter Mean Time Between Failures (MTBF) µ Theorem: µ p = µ ind for arbitrary distributions p MTBF (individual node) 1 year 10 years 120 years MTBF (platform of 10 6 nodes) 30 sec 5 mn 1 h Need more reliable components!! Need more resilient techniques!!! 2/1
General-purpose approach Periodic checkpoint, rollback and recovery: Error C C C Time W W Fail-stop errors: instantaneous error detection, e.g., resource crash 3/1
General-purpose approach Periodic checkpoint, rollback and recovery: Error C C C Time W W Fail-stop errors: instantaneous error detection, e.g., resource crash Silent errors (aka silent data corruptions): e.g., soft faults in L1 cache, ALU, double bit flip Silent error is detected only when corrupted data is activated, which could happen long after its occurrence Detection latency is problematic ⇒ risk of saving corrupted checkpoint! 3/1
General-purpose approach Periodic checkpoint, rollback and recovery: Error Corrupt Detect C C C Time W W Fail-stop errors: instantaneous error detection, e.g., resource crash Silent errors (aka silent data corruptions): e.g., soft faults in L1 cache, ALU, double bit flip Silent error is detected only when corrupted data is activated, which could happen long after its occurrence Detection latency is problematic ⇒ risk of saving corrupted checkpoint! 3/1
Coping with silent errors Couple checkpointing with verification: Detect Error V ∗ C V ∗ C V ∗ C Time W W Before each checkpoint, run some verification mechanism (checksum, ECC, coherence tests, TMR, etc) Silent error is detected by verification ⇒ checkpoint always valid � 4/1
Coping with silent errors Couple checkpointing with verification: Detect Error V ∗ C V ∗ C V ∗ C Time W W Before each checkpoint, run some verification mechanism (checksum, ECC, coherence tests, TMR, etc) Silent error is detected by verification ⇒ checkpoint always valid � Optimal period (Young/Daly): Fail-stop (classical) Silent errors T = W + V ∗ + C Pattern T = W + C W ∗ = √ 2 C µ W ∗ = � Optimal ( C + V ∗ ) µ 4/1
One step further Perform several verifications before each checkpoint: Detect Error V ∗ C V ∗ V ∗ V ∗ C V ∗ V ∗ V ∗ C Time Pro: silent error is detected earlier in the pattern � Con: additional overhead in error-free executions � 5/1
One step further Perform several verifications before each checkpoint: Detect Error V ∗ C V ∗ V ∗ V ∗ C V ∗ V ∗ V ∗ C Time Pro: silent error is detected earlier in the pattern � Con: additional overhead in error-free executions � How many intermediate verifications to use and the positions? 5/1
Partial verification Guaranteed/perfect verifications ( V ∗ ) can be very expensive! Partial verifications ( V ) are available for many HPC applications! Lower accuracy: recall r = #detected errors < 1 � #total errors Much lower cost, i.e., V < V ∗ � 6/1
Partial verification Guaranteed/perfect verifications ( V ∗ ) can be very expensive! Partial verifications ( V ) are available for many HPC applications! Lower accuracy: recall r = #detected errors < 1 � #total errors Much lower cost, i.e., V < V ∗ � Detect! Detect? Error V ∗ C V 1 V 2 V ∗ C V 1 V 2 V ∗ C Time 6/1
Partial verification Guaranteed/perfect verifications ( V ∗ ) can be very expensive! Partial verifications ( V ) are available for many HPC applications! Lower accuracy: recall r = #detected errors < 1 � #total errors Much lower cost, i.e., V < V ∗ � Detect! Detect? Error V ∗ C V 1 V 2 V ∗ C V 1 V 2 V ∗ C Time Which verification(s) to use? How many? Positions? 6/1
Outline 7/1
Model and objective Silent errors Poisson process: arrival rate λ = 1 /µ , where µ is platform MTBF Strike only computations; checkpointing, recovery, and verifications are protected Resilience parameters Cost of checkpointing C , cost of recovery R k types of partial detectors and a perfect detector � D (1) , D (2) , . . . , D ( k ) , D ∗ � D ( i ) : cost V ( i ) and recall r ( i ) < 1 D ∗ : cost V ∗ and recall r ∗ = 1 Design an optimal periodic computing pattern that minimizes execution time (or makespan) of the application 8/1
Pattern Formally, a pattern Pattern ( W , n , α , D ) is defined by W : pattern work length (or period) n : number of work segments, of lengths w i (with � n i =1 w i = W ) α = [ α 1 , α 2 , . . . , α n ]: work fraction of each segment ( α i = w i / W and � n i =1 α i = 1) D = [ D 1 , D 2 , . . . , D n − 1 , D ∗ ]: detectors used at the end of each segment ( D i = D ( j ) for some type j ) D n − 1 D ∗ C D 1 D 2 D 3 D ∗ C · · · Time w 1 w 2 w 3 w n · · · 9/1
Pattern Formally, a pattern Pattern ( W , n , α , D ) is defined by W : pattern work length (or period) n : number of work segments, of lengths w i (with � n i =1 w i = W ) α = [ α 1 , α 2 , . . . , α n ]: work fraction of each segment ( α i = w i / W and � n i =1 α i = 1) D = [ D 1 , D 2 , . . . , D n − 1 , D ∗ ]: detectors used at the end of each segment ( D i = D ( j ) for some type j ) D n − 1 D ∗ C D 1 D 2 D 3 D ∗ C · · · Time w 1 w 2 w 3 w n · · · - Last detector is perfect to avoid saving corrupted checkpoints - The same detector type D ( j ) could be used at the end of several segments 9/1
Outline 10/1
Summary of results In a nutshell: Given a pattern Pattern ( W , n , α , D ), We show how to compute the expected execution time We are able to characterize its optimal length We can compute the optimal positions of the partial verifications 11/1
Summary of results In a nutshell: Given a pattern Pattern ( W , n , α , D ), We show how to compute the expected execution time We are able to characterize its optimal length We can compute the optimal positions of the partial verifications However, we prove that finding the optimal pattern is NP-hard We design an FPTAS (Fully Polynomial-Time Approximation Scheme) that gives a makespan within (1 + ǫ ) times the optimal with running time polynomial in the input size and 1 /ǫ We show a simple greedy algorithm that works well in practice 11/1
Summary of results Algorithm to determine a pattern Pattern ( W , n , α , D ): Use FPTAS or Greedy (or even brute force for small instances) to find (optimal) number n of segments and set D of used detectors Arrange the n − 1 partial detectors in any order Compute W ∗ = � 1 − g i − 1 g i o ff 1 λ f re and α ∗ i = U n · (1+ g i − 1 )(1+ g i ) for 1 ≤ i ≤ n , n − 1 V i + V ∗ + C and f re = 1 1 + 1 � � � where o ff = 2 U n i =1 n − 1 1 − g i � with g i = 1 − r i and U n = 1 + 1 + g i i =1 12/1
Expected execution time of a pattern Proposition The expected time to execute a pattern Pattern ( W , n , α , D ) is n − 1 V i + V ∗ + C + λ W ( R + W α T A α + d T α ) + o ( λ ) , � E ( W ) = W + i =1 � � 1 + � j − 1 where A is a symmetric matrix defined by A ij = 1 k = i g k for 2 �� j − 1 � i ≤ j and d is a vector defined by d i = � n k = i g k V i for 1 ≤ i ≤ n. j = i First-order approximation (as in Young/Daly’s classic formula) Matrix A is essential to analysis. For instance, when n = 4 we have: 2 1 + g 1 1 + g 1 g 2 1 + g 1 g 2 g 3 A = 1 1 + g 1 2 1 + g 2 1 + g 2 g 3 1 + g 1 g 2 1 + g 2 2 1 + g 3 2 1 + g 1 g 2 g 3 1 + g 2 g 3 1 + g 3 2 13/1
Minimizing makespan For an application with total work W base , the makespan is E ( W ) W final ≈ × W base W = W base + H ( W ) × W base , where H ( W ) = E ( W ) − 1 is the execution overhead W For instance, if W base = 100 , W final = 120, we have H ( W ) = 20% 14/1
Minimizing makespan For an application with total work W base , the makespan is E ( W ) W final ≈ × W base W = W base + H ( W ) × W base , where H ( W ) = E ( W ) − 1 is the execution overhead W For instance, if W base = 100 , W final = 120, we have H ( W ) = 20% Minimizing makespan is equivalent to minimizing overhead! H ( W ) = o ff W + λ f re W + λ ( R + d T α ) + o ( λ ) n − 1 V i + V ∗ + C � fault-free overhead: o ff = i =1 f re = α T A α re-execution fraction: 14/1
Optimal pattern length to minimize overhead Proposition The execution overhead of a pattern Pattern ( W , n , α , D ) is minimized when its length is � o ff W ∗ = . λ f re The optimal overhead is √ � H ( W ∗ ) = 2 λ o ff f re + o ( λ ) . 15/1
Recommend
More recommend