On the Combination of Silent Error Detection and Checkpointing Guillaume Aupy, Anne Benoit, Thomas H´ erault, Yves Robert, Fr´ ed´ eric Vivien & Dounia Zaidouni PRDC 2013
Silent error detection 1 Introduction, motivation G. Aupy Introduction, motivation 2 Optimal Checkpointing strategy Optimal Exponential distribution Checkpointing strategy Arbitrary distribution Exponential distribution Arbitrary distribution 3 Limited resources Limited resources Incorporating 4 Incorporating detection detection k checkpoints k checkpoints for 1 verification for 1 verification k verifications k verifications for 1 checkpoint for 1 checkpoint Conclusion, future work 5 Conclusion, future work Announcement 6 Announcement 1.0
Silent error A few definitions detection G. Aupy Introduction, motivation Optimal Checkpointing strategy • Many types of faults: software error, hardware Exponential distribution malfunction, memory corruption Arbitrary distribution • Many possible behaviors: transient, unrecoverable, silent Limited resources • Restrict to silent errors Incorporating detection • This includes some software faults, some hardware errors k checkpoints for 1 verification (soft errors in L1 cache), double bit flip k verifications for 1 checkpoint • Silent error detected when corrupt data is activated Conclusion, future work Announcement 2.0
Silent error A few definitions detection G. Aupy Introduction, motivation Optimal Checkpointing strategy • Many types of faults: software error, hardware Exponential distribution malfunction, memory corruption Arbitrary distribution • Many possible behaviors: transient, unrecoverable, silent Limited resources • Restrict to silent errors Incorporating detection • This includes some software faults, some hardware errors k checkpoints for 1 verification (soft errors in L1 cache), double bit flip k verifications for 1 checkpoint • Silent error detected when corrupt data is activated Conclusion, future work • Silent errors are the black swans of errors (Marc Snir) Announcement 2.0
Silent error Error sources (courtesy Franck Cappello) detection G. Aupy Introduction, motivation • Analysis of error and failure logs Optimal Checkpointing strategy • In 2005 (Ph. D. of CHARNG-DA LU) : “Software halts account for the most number of Exponential outages (59-84 percent), and take the shortest time to repair (0.6-1.5 hours). Hardware distribution problems, albeit rarer, need 6.3-100.7 hours to solve.” Arbitrary distribution • In 2007 (Garth Gibson, ICPP Keynote): Limited resources Hardware Incorporating detection 50% k checkpoints • In 2008 (Oliner and J. Stearley, DSN Conf.): for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement Software errors: Applications, OS bug (kernel panic), communication libs, File system error and other. Hardware errors, Disks, processors, memory, network Conclusion: Both Hardware and Software failures have to be considered 3.0
Silent error detection 1 Introduction, motivation G. Aupy Introduction, motivation 2 Optimal Checkpointing strategy Optimal Exponential distribution Checkpointing strategy Arbitrary distribution Exponential distribution Arbitrary distribution 3 Limited resources Limited resources Incorporating 4 Incorporating detection detection k checkpoints k checkpoints for 1 verification for 1 verification k verifications k verifications for 1 checkpoint for 1 checkpoint Conclusion, future work 5 Conclusion, future work Announcement 6 Announcement 4.0
Silent error detection G. Aupy Introduction, motivation Error Detection Optimal Checkpointing strategy Exponential Time X e X d distribution Arbitrary distribution Figure : Error and detection latency. Limited resources Incorporating detection • X e inter arrival time between errors; mean time µ e k checkpoints for 1 verification k verifications • X d error detection time; mean time µ d for 1 checkpoint Conclusion, • Assume X d and X e independent future work Announcement 5.0
Silent error Notations detection G. Aupy Introduction, motivation Optimal Checkpointing strategy Exponential distribution Arbitrary • C checkpointing time distribution Limited • R recovery time resources Incorporating • W total work detection k checkpoints • w some piece of work for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 6.0
Silent error For one chunk detection G. Aupy Introduction, motivation 1 When X e follows an Exponential law of parameter λ e = µ e , in Optimal order to execute a total work of w + C , we need: Checkpointing strategy Exponential distribution Arbitrary distribution Limited resources Incorporating detection k checkpoints for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 7.0
Silent error For one chunk detection G. Aupy Introduction, motivation 1 When X e follows an Exponential law of parameter λ e = µ e , in Optimal order to execute a total work of w + C , we need: Checkpointing strategy Exponential • Probability of execution without error distribution Arbitrary distribution Limited resources E ( T ( w )) = e − λ e ( w + C ) ( w + C ) Incorporating detection k checkpoints + (1 − e − λ e ( w + C ) ) ( E ( T lost ) + E ( X d ) + E ( T rec ) + E ( T ( w ))) for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 7.0
Silent error For one chunk detection G. Aupy Introduction, motivation 1 When X e follows an Exponential law of parameter λ e = µ e , in Optimal order to execute a total work of w + C , we need: Checkpointing strategy Exponential • Probability of execution without error distribution Arbitrary distribution Limited resources E ( T ( w )) = e − λ e ( w + C ) ( w + C ) Incorporating detection k checkpoints + (1 − e − λ e ( w + C ) ) ( E ( T lost ) + E ( X d ) + E ( T rec ) + E ( T ( w ))) for 1 verification k verifications for 1 checkpoint Conclusion, future work • Probability of error during w + C Announcement 7.0
Silent error For one chunk detection G. Aupy Introduction, motivation 1 When X e follows an Exponential law of parameter λ e = µ e , in Optimal order to execute a total work of w + C , we need: Checkpointing strategy Exponential • Probability of execution without error distribution Arbitrary distribution Limited resources E ( T ( w )) = e − λ e ( w + C ) ( w + C ) Incorporating detection k checkpoints + (1 − e − λ e ( w + C ) ) ( E ( T lost ) + E ( X d ) + E ( T rec ) + E ( T ( w ))) for 1 verification k verifications for 1 checkpoint Conclusion, future work • Probability of error during w + C Announcement • Execution time with an error 7.0
Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal Checkpointing strategy Exponential distribution Arbitrary distribution Limited resources Incorporating detection k checkpoints for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 8.0
Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal This is the time elapsed between the completion of the last Checkpointing strategy checkpoint and the error Exponential distribution Arbitrary � ∞ distribution E ( T lost ) = x P ( X = x | X < w + C ) dx Limited resources 0 � w + C Incorporating 1 detection x λ e e − λ e x dx = k checkpoints P ( X < w + C ) for 1 verification 0 k verifications = 1 w + C for 1 checkpoint − Conclusion, e λ e ( w + C ) − 1 λ e future work Announcement 8.0
Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal Checkpointing strategy Exponential distribution Arbitrary distribution Limited This is the time needed for error detection, E ( X d ) = µ d resources Incorporating detection k checkpoints for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 8.0
Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal Checkpointing This is the time to recover from the error (there can be a fault strategy durnig recovery): Exponential distribution Arbitrary distribution E ( T rec ) = e − λ e R R Limited resources + (1 − e − λ e R )( E ( R lost ) + E ( X d ) + E ( T rec )) Incorporating detection k checkpoints for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 8.0
Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal Checkpointing This is the time to recover from the error (there can be a fault strategy durnig recovery): Exponential distribution Arbitrary distribution E ( T rec ) = e − λ e R R Limited resources + (1 − e − λ e R )( E ( R lost ) + E ( X d ) + E ( T rec )) Incorporating detection k checkpoints 1 R for 1 verification Similarly to E ( T lost ), we have: E ( R lost ) = λ e − e λ e R − 1 . k verifications for 1 checkpoint Conclusion, future work Announcement 8.0
Recommend
More recommend