Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Optimal checkpointing periods with fail-stop and silent errors Anne Benoit ENS Lyon Anne.Benoit@ens-lyon.fr http://graal.ens-lyon.fr/~abenoit 3rd JLESC Summer School June 30, 2016 Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 1/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Exascale platforms Hierarchical • 10 5 or 10 6 nodes • Each node equipped with 10 4 or 10 3 cores Failure-prone MTBF – one node 1 year 10 years 120 years MTBF – platform 30sec 5mn 1h of 10 6 nodes More nodes ⇒ Shorter MTBF (Mean Time Between Failures) Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 2/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Exascale platforms Hierarchical • 10 5 or 10 6 nodes • Each node equipped with 10 4 or 10 3 cores Failure-prone MTBF – one node 1 year 10 years 120 years MTBF – platform 30sec 5mn 1h of 10 6 nodes Exascale � = Petascale × 1000 More nodes ⇒ Shorter MTBF (Mean Time Between Failures) Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 2/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Even for today’s platforms (courtesy F. Cappello) Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 3/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Even for today’s platforms (courtesy F. Cappello) Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 4/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion A few definitions Many types of faults: software error, hardware malfunction, memory corruption Many possible behaviors: silent, transient, unrecoverable Restrict to faults that lead to application failures This includes all hardware faults, and some software ones Will use terms fault and failure interchangeably Silent errors (SDC) will be addressed later in the course First question: quantify the rate or frequency at which these faults strike! Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 5/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion A few definitions Many types of faults: software error, hardware malfunction, memory corruption Many possible behaviors: silent, transient, unrecoverable Restrict to faults that lead to application failures This includes all hardware faults, and some software ones Will use terms fault and failure interchangeably Silent errors (SDC) will be addressed later in the course First question: quantify the rate or frequency at which these faults strike! Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 5/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Exponential failure distributions Sequential Machine 1 0.9 0.8 Failure Probability 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Exp(1/100) 0 0 200 400 600 800 1000 Time (years) Exp ( λ ): Exponential distribution law of parameter λ : Probability density function (pdf): f ( t ) = λ e − λ t dt for t ≥ 0 Cumulative distribution function (cdf): F ( t ) = 1 − e − λ t Mean: µ = 1 λ Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 6/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Exponential failure distributions Sequential Machine 1 0.9 0.8 Failure Probability 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Exp(1/100) 0 0 200 400 600 800 1000 Time (years) X random variable for Exp ( λ ) failure inter-arrival times: P ( X ≤ t ) = 1 − e − λ t dt (by definition) Memoryless property: P ( X ≥ t + s | X ≥ s ) = P ( X ≥ t ) (for all t , s ≥ 0): at any instant, time to next failure does not depend on time elapsed since last failure Mean Time Between Failures (MTBF) µ = E ( X ) = 1 λ Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 6/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion With several processors Rebooting only faulty processor Platform failure distribution ⇒ superposition of p IID processor distributions ⇒ IID only for Exponential Define µ p by n ( F ) = 1 lim µ p F F → + ∞ n ( F ) = number of platform failures until time F is exceeded Theorem: µ p = µ p for arbitrary distributions Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 7/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Summary for the road MTBF key parameter and µ p = µ p � Exponential distribution OK for most purposes � Assume failure independence while not (completely) true � Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 8/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion General purpose approach Periodic checkpointing, rollback and recovery Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 9/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Outline 1 Probabilistic models Young/Daly’s approximation Assessing protocols at scale 2 In-memory checkpointing 3 Dealing with silent errors 4 Conclusion Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 10/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Outline 1 Probabilistic models Young/Daly’s approximation Assessing protocols at scale 2 In-memory checkpointing 3 Dealing with silent errors 4 Conclusion Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 11/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Checkpointing cost Time spent working Time spent checkpointing Time Computing the first chunk Checkpointing the first chunk Processing the first chunk Processing the second chunk Blocking model: while a checkpoint is taken, no computation can be performed Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 12/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Framework Periodic checkpointing policy of period T Independent and identically distributed (IID) failures Applies to a single processor with MTBF µ = µ ind Applies to a platform with p processors with MTBF µ = µ ind p coordinated checkpointing tightly-coupled application progress ⇔ all processors available ⇒ platform = single (powerful, unreliable) processor � Waste : fraction of time not spent for useful computations Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 13/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Waste in fault-free execution Time base : application base time Time spent working Time spent checkpointing Time Time FF : with periodic checkpoints Computing the first chunk Checkpointing the first chunk but failure-free Processing the first chunk Processing the second chunk Time FF = Time base + # checkpoints × C � Time base � ≈ Time base # checkpoints = (valid for large jobs) T − C T − C Waste [ FF ] = Time FF − Time base = C T Time FF Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 14/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Waste due to failures Time base : application base time Time FF : with periodic checkpoints but failure-free Time final : expectation of time with failures Time final = Time FF + N faults × T lost N faults : number of failures during execution T lost : average time lost per failure N faults = Time final µ T lost ? Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 15/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Waste due to failures Time base : application base time Time FF : with periodic checkpoints but failure-free Time final : expectation of time with failures Time final = Time FF + N faults × T lost N faults : number of failures during execution T lost : average time lost per failure N faults = Time final µ T lost ? Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 15/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Computing T lost Time spent working Time spent checkpointing Downtime Recovery time Time P 0 P 1 P 2 P 3 T / 2 T − C D R C T T lost = D + R + T 2 Rationale ⇒ Instants when periods begin and failures strike are independent ⇒ Approximation used for all distribution laws ⇒ Exact for Exponential and uniform distributions Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 16/ 57
Introduction Probabilistic models Buddy algorithm Silent errors Conclusion Waste due to failures Time final = Time FF + N faults × T lost Waste [ fail ] = Time final − Time FF = 1 � D + R + T � Time final µ 2 Anne.Benoit@ens-lyon.fr 3rd JLESC Summer School Optimal checkpointing periods 17/ 57
Recommend
More recommend