On the Combination of Silent Error Detection and Checkpointing - PowerPoint PPT Presentation

On the Combination of Silent Error Detection and Checkpointing Guillaume Aupy, Anne Benoit, Thomas H´ erault, Yves Robert, Fr´ ed´ eric Vivien & Dounia Zaidouni PRDC 2013

Silent error detection 1 Introduction, motivation G. Aupy Introduction, motivation 2 Optimal Checkpointing strategy Optimal Exponential distribution Checkpointing strategy Arbitrary distribution Exponential distribution Arbitrary distribution 3 Limited resources Limited resources Incorporating 4 Incorporating detection detection k checkpoints k checkpoints for 1 verification for 1 verification k verifications k verifications for 1 checkpoint for 1 checkpoint Conclusion, future work 5 Conclusion, future work Announcement 6 Announcement 1.0

Silent error A few definitions detection G. Aupy Introduction, motivation Optimal Checkpointing strategy • Many types of faults: software error, hardware Exponential distribution malfunction, memory corruption Arbitrary distribution • Many possible behaviors: transient, unrecoverable, silent Limited resources • Restrict to silent errors Incorporating detection • This includes some software faults, some hardware errors k checkpoints for 1 verification (soft errors in L1 cache), double bit flip k verifications for 1 checkpoint • Silent error detected when corrupt data is activated Conclusion, future work Announcement 2.0

Silent error A few definitions detection G. Aupy Introduction, motivation Optimal Checkpointing strategy • Many types of faults: software error, hardware Exponential distribution malfunction, memory corruption Arbitrary distribution • Many possible behaviors: transient, unrecoverable, silent Limited resources • Restrict to silent errors Incorporating detection • This includes some software faults, some hardware errors k checkpoints for 1 verification (soft errors in L1 cache), double bit flip k verifications for 1 checkpoint • Silent error detected when corrupt data is activated Conclusion, future work • Silent errors are the black swans of errors (Marc Snir) Announcement 2.0

Silent error Error sources (courtesy Franck Cappello) detection G. Aupy Introduction, motivation • Analysis of error and failure logs Optimal Checkpointing strategy • In 2005 (Ph. D. of CHARNG-DA LU) : “Software halts account for the most number of Exponential outages (59-84 percent), and take the shortest time to repair (0.6-1.5 hours). Hardware distribution problems, albeit rarer, need 6.3-100.7 hours to solve.” Arbitrary distribution • In 2007 (Garth Gibson, ICPP Keynote): Limited resources Hardware Incorporating detection 50% k checkpoints • In 2008 (Oliner and J. Stearley, DSN Conf.): for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement Software errors: Applications, OS bug (kernel panic), communication libs, File system error and other. Hardware errors, Disks, processors, memory, network Conclusion: Both Hardware and Software failures have to be considered 3.0

Silent error detection 1 Introduction, motivation G. Aupy Introduction, motivation 2 Optimal Checkpointing strategy Optimal Exponential distribution Checkpointing strategy Arbitrary distribution Exponential distribution Arbitrary distribution 3 Limited resources Limited resources Incorporating 4 Incorporating detection detection k checkpoints k checkpoints for 1 verification for 1 verification k verifications k verifications for 1 checkpoint for 1 checkpoint Conclusion, future work 5 Conclusion, future work Announcement 6 Announcement 4.0

Silent error detection G. Aupy Introduction, motivation Error Detection Optimal Checkpointing strategy Exponential Time X e X d distribution Arbitrary distribution Figure : Error and detection latency. Limited resources Incorporating detection • X e inter arrival time between errors; mean time µ e k checkpoints for 1 verification k verifications • X d error detection time; mean time µ d for 1 checkpoint Conclusion, • Assume X d and X e independent future work Announcement 5.0

Silent error Notations detection G. Aupy Introduction, motivation Optimal Checkpointing strategy Exponential distribution Arbitrary • C checkpointing time distribution Limited • R recovery time resources Incorporating • W total work detection k checkpoints • w some piece of work for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 6.0

Silent error For one chunk detection G. Aupy Introduction, motivation 1 When X e follows an Exponential law of parameter λ e = µ e , in Optimal order to execute a total work of w + C , we need: Checkpointing strategy Exponential distribution Arbitrary distribution Limited resources Incorporating detection k checkpoints for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 7.0

Silent error For one chunk detection G. Aupy Introduction, motivation 1 When X e follows an Exponential law of parameter λ e = µ e , in Optimal order to execute a total work of w + C , we need: Checkpointing strategy Exponential • Probability of execution without error distribution Arbitrary distribution Limited resources E ( T ( w )) = e − λ e ( w + C ) ( w + C ) Incorporating detection k checkpoints + (1 − e − λ e ( w + C ) ) ( E ( T lost ) + E ( X d ) + E ( T rec ) + E ( T ( w ))) for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 7.0

Silent error For one chunk detection G. Aupy Introduction, motivation 1 When X e follows an Exponential law of parameter λ e = µ e , in Optimal order to execute a total work of w + C , we need: Checkpointing strategy Exponential • Probability of execution without error distribution Arbitrary distribution Limited resources E ( T ( w )) = e − λ e ( w + C ) ( w + C ) Incorporating detection k checkpoints + (1 − e − λ e ( w + C ) ) ( E ( T lost ) + E ( X d ) + E ( T rec ) + E ( T ( w ))) for 1 verification k verifications for 1 checkpoint Conclusion, future work • Probability of error during w + C Announcement 7.0

Silent error For one chunk detection G. Aupy Introduction, motivation 1 When X e follows an Exponential law of parameter λ e = µ e , in Optimal order to execute a total work of w + C , we need: Checkpointing strategy Exponential • Probability of execution without error distribution Arbitrary distribution Limited resources E ( T ( w )) = e − λ e ( w + C ) ( w + C ) Incorporating detection k checkpoints + (1 − e − λ e ( w + C ) ) ( E ( T lost ) + E ( X d ) + E ( T rec ) + E ( T ( w ))) for 1 verification k verifications for 1 checkpoint Conclusion, future work • Probability of error during w + C Announcement • Execution time with an error 7.0

Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal Checkpointing strategy Exponential distribution Arbitrary distribution Limited resources Incorporating detection k checkpoints for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 8.0

Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal This is the time elapsed between the completion of the last Checkpointing strategy checkpoint and the error Exponential distribution Arbitrary � ∞ distribution E ( T lost ) = x P ( X = x | X < w + C ) dx Limited resources 0 � w + C Incorporating 1 detection x λ e e − λ e x dx = k checkpoints P ( X < w + C ) for 1 verification 0 k verifications = 1 w + C for 1 checkpoint − Conclusion, e λ e ( w + C ) − 1 λ e future work Announcement 8.0

Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal Checkpointing strategy Exponential distribution Arbitrary distribution Limited This is the time needed for error detection, E ( X d ) = µ d resources Incorporating detection k checkpoints for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 8.0

Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal Checkpointing This is the time to recover from the error (there can be a fault strategy durnig recovery): Exponential distribution Arbitrary distribution E ( T rec ) = e − λ e R R Limited resources + (1 − e − λ e R )( E ( R lost ) + E ( X d ) + E ( T rec )) Incorporating detection k checkpoints for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 8.0

Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal Checkpointing This is the time to recover from the error (there can be a fault strategy durnig recovery): Exponential distribution Arbitrary distribution E ( T rec ) = e − λ e R R Limited resources + (1 − e − λ e R )( E ( R lost ) + E ( X d ) + E ( T rec )) Incorporating detection k checkpoints 1 R for 1 verification Similarly to E ( T lost ), we have: E ( R lost ) = λ e − e λ e R − 1 . k verifications for 1 checkpoint Conclusion, future work Announcement 8.0

On the Combination of Silent Error Detection and Checkpointing - PowerPoint PPT Presentation

On the Combination of Silent Error Detection and Checkpointing Guillaume Aupy, Anne Benoit, Thomas H erault, Yves Robert, Fr ed eric Vivien & Dounia Zaidouni PRDC 2013 Silent error detection 1 Introduction, motivation G. Aupy

Silent Shout 2.009 SILVER Silent Shout The Market 65 million users $9.7 billion market 2.009

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

Error Detection Two types Error Detection Codes (e.g. CRC, Parity, Checksums) Error

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

THE SILENT EPIDEMIC OF TBIS: LISTENING FOR DEPRESSION AND SUICIDE THE SILENT EPIDEMIC OF

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

MT System Combination Silja Hildebrand MT System Combination System Combination in MT

Machine Learning for NLP SVMs for semantic error detection Aurlie Herbelot 2018 Centre for

Measurement of Timing Error Detection Performance of Software-based Error Detection Mechanisms

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Physical layer Error detection, correction Martin Heusse X L A TEX E Error detection

Questions From Chapter 1 Figure 1.1: Testing life cycle Ch 12 Error vocabulary 1

llvm::Error Rich Error Handling in LLVM Error Handling History LLVMs APIs historically

The securitization of the The securitization of the Disi Aquifer: a silent Aquifer: a silent

Scintillation Light from Cosmic-Ray Muons in Liquid Argon 5 November, 2015 Denver Whittington

Course Script INF 5110: Compiler con- struction INF5110/ spring 2018 Martin Steffen Contents

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 4: Q-Value based RL Animesh

The double-trace spectrum of planar N = 4 SYM: an unexpected 10d conformal symmetry [arXiv:

Doubly Truncated Generalized Entropy Mohammadreza Nourbakhsh, Gholamhossein Yari School of

Rendering: Monte Carlo Integration I Bernhard Kerbl Research Division of Computer Graphics

Complexity of domain-independent planning Jos Luis Ambite 1 Decidability Decision problem: a

Limit-Deterministic Bchi Automata for Probabilistic Model Checking Jan Ketnsk Javier

Sambuz

Useful Links

Newsletter

Mail Us

On the Combination of Silent Error Detection and Checkpointing - PowerPoint PPT Presentation

On the Combination of Silent Error Detection and Checkpointing Guillaume Aupy, Anne Benoit, Thomas H erault, Yves Robert, Fr ed eric Vivien & Dounia Zaidouni PRDC 2013 Silent error detection 1 Introduction, motivation G. Aupy

Silent Shout 2.009 SILVER Silent Shout The Market 65 million users $9.7 billion market 2.009

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

Error Detection Two types Error Detection Codes (e.g. CRC, Parity, Checksums) Error

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

THE SILENT EPIDEMIC OF TBIS: LISTENING FOR DEPRESSION AND SUICIDE THE SILENT EPIDEMIC OF

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

MT System Combination Silja Hildebrand MT System Combination System Combination in MT

Machine Learning for NLP SVMs for semantic error detection Aurlie Herbelot 2018 Centre for

Measurement of Timing Error Detection Performance of Software-based Error Detection Mechanisms

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Physical layer Error detection, correction Martin Heusse X L A TEX E Error detection

Questions From Chapter 1 Figure 1.1: Testing life cycle Ch 12 Error vocabulary 1

llvm::Error Rich Error Handling in LLVM Error Handling History LLVMs APIs historically

The securitization of the The securitization of the Disi Aquifer: a silent Aquifer: a silent

Scintillation Light from Cosmic-Ray Muons in Liquid Argon 5 November, 2015 Denver Whittington

Course Script INF 5110: Compiler con- struction INF5110/ spring 2018 Martin Steffen Contents

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 4: Q-Value based RL Animesh

The double-trace spectrum of planar N = 4 SYM: an unexpected 10d conformal symmetry [arXiv:

Doubly Truncated Generalized Entropy Mohammadreza Nourbakhsh, Gholamhossein Yari School of

Rendering: Monte Carlo Integration I Bernhard Kerbl Research Division of Computer Graphics

Complexity of domain-independent planning Jos Luis Ambite 1 Decidability Decision problem: a

Limit-Deterministic Bchi Automata for Probabilistic Model Checking Jan Ketnsk Javier

Sambuz

Useful Links

Newsletter

Mail Us

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits