Problem statement Theoretical analysis Performance evaluation Conclusion Two-level checkpointing and partial verifications for linear task graphs Anne Benoit, Aur´ elien Cavelan, Yves Robert and Hongyang Sun ENS Lyon, France Anne.Benoit@ens-lyon.fr http://graal.ens-lyon.fr/˜abenoit 6th Int. Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS15) @ SC’15 November 15, 2015, Austin, TX Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 1/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion Computing at Exascale Exascale platform: 10 5 or 10 6 nodes, each equipped with 10 2 or 10 3 cores Shorter Mean Time Between Failures (MTBF) µ Theorem: µ p = µ ind for arbitrary distributions p MTBF (individual node) 1 year 10 years 120 years MTBF (platform of 10 6 nodes) 30 sec 5 mn 1 h Need more reliable components!! Need more resilient techniques!!! Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 2/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion Computing at Exascale Exascale platform: 10 5 or 10 6 nodes, each equipped with 10 2 or 10 3 cores Shorter Mean Time Between Failures (MTBF) µ Theorem: µ p = µ ind for arbitrary distributions p MTBF (individual node) 1 year 10 years 120 years MTBF (platform of 10 6 nodes) 30 sec 5 mn 1 h Need more reliable components!! Need more resilient techniques!!! Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 2/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion Two main sources of errors Fail-stop errors: instantaneous error detection, e.g., resource crash Silent errors (aka silent data corruptions), e.g., soft faults in L1 cache, ALU, double bit flip Silent error is detected only when corrupted data is activated, which could happen long after its occurrence � Detection latency is problematic Before each checkpoint, run some verification mechanism (checksum, ECC, coherence tests, TMR, etc) Silent error is detected by verification ⇒ checkpoint always valid � Verified checkpoints, rollback and recovery Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 3/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion One step further and partial verifications Perform several verifications before each checkpoint: Pro: silent error is detected earlier in the pattern � Con: additional overhead in error-free executions � V ∗ i V ∗ C i +1 j V ∗ C 0 1 2 Guaranteed/perfect verifications ( V ∗ ) can be very expensive! Partial verifications ( V ) are available for many HPC applications! Lower accuracy: recall r = #detected errors < 1 � #total errors Much lower cost, i.e., V < V ∗ � How many intermediate verifications to use and the positions? Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 4/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion One step further and partial verifications Perform several verifications before each checkpoint: Pro: silent error is detected earlier in the pattern � Con: additional overhead in error-free executions � V ∗ i V ∗ C i +1 j V ∗ C 0 1 2 Guaranteed/perfect verifications ( V ∗ ) can be very expensive! Partial verifications ( V ) are available for many HPC applications! Lower accuracy: recall r = #detected errors < 1 � #total errors Much lower cost, i.e., V < V ∗ � How many intermediate verifications to use and the positions? Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 4/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion Two-level checkpointing Silent errors: use of a lightweight mechanism of in-memory checkpoints C M Local copies lost in case of fail-stop errors: use (less frequent) copies on stable storage (classical disk checkpoints) C D Always C M before C D : little overhead, enforced in practice Always V ∗ before C M : all checkpoints are valid Verifications, memory copies and I/O transfers protected from errors i V ∗ C M i +1 j V ∗ C M C D 0 1 V 2 Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 5/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion Outline Problem statement 1 Theoretical analysis 2 Performance evaluation 3 Conclusion 4 Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 6/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion Application and errors Linear chain of tasks T 1 , T 2 , . . . , T n Each task T i has a weight w i (computational load) W i , j = � j k = i +1 w k : time to execute tasks T i +1 to T j Subject to fail-stop and silent errors, independent and following a Poisson process with arrival rates λ f and λ s p f i , j = 1 − e − λ f W i , j : probability of having at least a fail-stop error while executing T i +1 to T j p s i , j = 1 − e − λ s W i , j : idem for silent errors Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 7/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion Resilience parameters and objective Cost of disk checkpointing C D , cost of disk recovery R D Cost of memory checkpointing C M , cost of memory recovery R M For simplicity, R M included in R D Cost V ∗ for guaranteed verification V for partial verification, with recall r , and g = 1 − r is the proportion of undetected errors ⇒ Decide where to place disk checkpoints, memory checkpoints, guaranteed verifications and partial verifications, in order to minimize the expected execution time (or makespan) of the application Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 8/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion Outline Problem statement 1 Theoretical analysis 2 Performance evaluation 3 Conclusion 4 Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 9/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion Dynamic programming algorithm Several dynamic programming levels: First decide where to place disk checkpoints Then memory checkpoints between any two disk checkpoints And finally, guaranteed or partial verifications between any two memory checkpoints Compute the expected execution time between any two verifications m 1 v 1 v 2 m 2 d 0 d 1 d 2 E ( d 1 , m 1 , v 1 , v 2 ) E verif ( d 1 , m 1 , v 2 ) E mem ( d 1 , m 2 ) E disk ( d 2 ) Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 10/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion Without partial verifications Placing disk checkpoints: d 0 d 1 d 2 E mem ( d 1 , d 2 ) E disk ( d 1 ) E disk ( d 2 ) E disk ( d 2 ): expected time needed to successfully execute tasks T 1 to T d 2 , where T d 2 is followed by V ∗ C M C D : E disk ( d 2 ) = 0 ≤ d 1 < d 2 { E disk ( d 1 ) + E mem ( d 1 , d 2 ) + C D } min Objective: E disk ( n ) Initialization: E disk (0) = 0 Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 11/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion Without partial verifications Placing memory checkpoints: m 1 m 2 d 0 d 1 d 2 E verif ( d 1 , m 1 , m 2 ) E mem ( d 1 , m 1 ) E mem ( d 1 , m 2 ) E mem ( d 1 , d 2 ) E mem ( d 1 , m 2 ): expected time needed to successfully execute tasks T d 1 +1 to T m 2 , where T d 1 is followed by V ∗ C M C D and T m 2 is followed by V ∗ C M : E mem ( d 1 , m 2 ) = d 1 ≤ m 1 < m 2 { E mem ( d 1 , m 1 )+ E verif ( d 1 , m 1 , m 2 )+ C M } min Initialization: E mem ( d 1 , d 1 ) = 0 Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 12/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion Without partial verifications Placing additional guaranteed verifications: m 1 v 1 v 2 m 2 d 1 E ( d 1 , m 1 , v 1 , v 2 ) E verif ( d 1 , m 1 , v 1 ) E verif ( d 1 , m 1 , v 2 ) E verif ( d 1 , m 1 , m 2 ) E verif ( d 1 , m 1 , v 2 ): expected time needed to successfully execute tasks T m 1 +1 to T v 2 , where T d 1 is followed by V ∗ C M C D , T m 1 is followed by V ∗ C M , T v 2 is followed by V ∗ : E verif ( d 1 , m 1 , v 2 ) = m 1 ≤ v 1 < v 2 { E verif ( d 1 , m 1 , v 1 )+ E ( d 1 , m 1 , v 1 , v 2 ) } min Initialization: E verif ( d 1 , m 1 , m 1 ) = 0 Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 13/ 23
Problem statement Theoretical analysis Performance evaluation Conclusion Without partial verifications Expected execution time between two verifications E ( d 1 , m 1 , v 1 , v 2 ), knowing positions of last C D and last C M : If p f v 1 , v 2 , recover from C D Otherwise, if p s v 1 , v 2 , detect error at v 2 and recover from C M E ( d 1 , m 1 , v 1 , v 2 ) = T lost p f � � v 1 , v 2 + R D + E mem ( d 1 , m 1 ) + E verif ( d 1 , m 1 , v 1 ) + E ( d 1 , m 1 , v 1 , v 2 ) v 1 , v 2 �� � 1 − p f + W v 1 , v 2 + V ∗ v 1 , v 2 �� + p s � R M + E verif ( d 1 , m 1 , v 1 ) + E ( d 1 , m 1 , v 1 , v 2 ) v 1 , v 2 W v 1 , v 2 Compute T lost 1 v 1 , v 2 = λ f − e λ f Wv 1 , v 2 − 1 and simplify Anne.Benoit@ens-lyon.fr PMBS’15 Two-level checkpointing and partial verifications 14/ 23
Recommend
More recommend