The Impact of Recovery Mechanisms on the Likelihood of Saving - PowerPoint PPT Presentation

The Impact of Recovery Mechanisms on the Likelihood of Saving Corrupted State Subhachandra Chandra Cosine Communications Peter M. Chen University of Michigan

� � � � Motivation Computer software is not reliable Recovery from failures is vital for usability and availability Successful recovery requires that the system does not save data that has been corrupted by the fault The recovery system itself may increase the chances of saving corrupted state

� � � � Main factors Quality of error detection Location of the fault Frequency of state saves Comprehensiveness of state saved

Comprehensiveness / Frequency of State Commits Comprehensiveness More Less Frequency more little automatic state reconstruction failure transparency transparent visible commits of corrupt state more likely less likely

� � � � � � � � Recovery System Determines Comprehensiveness and Frequency Generic mechanisms have to save all state have to save state for all visible events e.g. checkpointing, logging Application-specific mechanisms know which state is important know which visible events are important e.g. auto-save

� � � � Strategies for Saving State Three strategies by varying comprehensiveness and frequency LC/LF - Less Comprehensive / Less Frequent application-specific recovery C/LF - Comprehensive / Less Frequent modified generic recovery C/F - Comprehensive / Frequent generic recovery like Discount Checking

� � � Obtaining Faulty Runs Inject faults either into the source code or dynamically into the process address space during execution Detect failures by comparing output of the run into which faults have been injected with output from a good run If the run did not complete or completed with faulty output then it is counted as a failure or faulty run

� � � Detecting Corrupted Committed State: Application-Specific Recovery Have a reference run generate all the possible states saved by the application on the disk Compare the final state saved by the faulty run on the disk with the list of reference states If the final state does not match any of the reference states then corrupted state was committed by the recovery mechanism

� � � Detecting Corrupted Committed State: Generic Recovery Recover the application from the last saved checkpoint If the application does not complete with the correct results then the run recovered from corrupted state Another way to detect if the committed state was corrupted is to check if the last checkpoint was committed after the activation of the fault

Workload and Fault Models nvi, postgres, oleo Fault Type Example of Programming Error stack flip random bit allocation move use(ptr) to after free(ptr) heap flip random bit off-by-one substitute < with <= initialization delete i=0; delete branch substitute "if" for a "while" delete random instruction delete a simple statement "i=j+k;" destination variable substitute one dest. variable with another

Results for nvi - Application Faults Low Freq Undetected Fault Faulty Runs App-specific App-Generic App-Generic Errors Stack 50 0 0 0 0 Alloc 50 24 40 50 0 Heap 50 6 12 35 8 Off by One 50 6 7 9 12 Init Errors 50 0 2 2 0 Delete Branch 50 25 27 34 8 Delete Inst 50 12 14 24 3 Change Dest Var 50 1 5 8 5 Total 400 74 (19%) 107(27%) 162(41%) 36(9%)

Results for postgres - Application Faults Low Freq Undetected Fault Faulty Runs App-specific App-Generic App-Generic Errors Stack 50 0 16 17 1 Alloc 50 0 22 24 0 Heap 50 0 0 44 2 Off by One 50 0 0 0 8 Init Errors 50 0 2 3 2 Delete Branch 50 0 0 38 6 Delete Inst 50 1 2 6 5 Change Dest Var 50 2 2 3 0 Total 400 3(1%) 44(11%) 135(34%) 24(6%)

Results for oleo - Application Faults Low Freq Undetected Fault Faulty Runs App-specific App-Generic App-Generic Errors Stack 50 0 0 3 0 Alloc 50 0 2 34 9 Heap 50 0 0 12 19 Off by One 50 0 0 10 7 Init Errors 50 0 3 15 8 Delete Branch 50 0 0 19 7 Delete Inst 50 0 2 9 18 Change Dest Var 50 3(1%) 3 5 20 Total 400 3(1%) 10(3%) 107(27%) 88(22%)

Faults in the Operating System Fault Application Error Recovery Mech. System Call OS Fault Hardware

Results for nvi - OS Faults Low Freq Undetected Fault Faulty Runs App-specific App-Generic App-Generic Errors Stack 50 0 1 6 0 Alloc 50 1 5 19 0 Heap 50 2 3 4 0 Off by One 50 0 6 11 0 Init Errors 50 3 2 8 1 Delete Branch 50 1 2 12 0 Delete Inst 50 0 1 6 0 Change Dest Var 50 2 0 5 0 Total 400 9(2%) 20(5%) 71(18%) 1(0%)

Results for postgres - OS Faults Low Freq Undetected Fault Faulty Runs App-specific App-Generic App-Generic Errors Stack 50 0 5 5 0 Heap 50 1 3 3 0 Off by One 50 0 0 0 0 Init Errors 50 0 0 0 0 Delete Branch 50 1 2 2 0 Delete Inst 50 0 1 2 0 Change Dest Var 50 0 0 0 1(0%) Total 350 2(1%) 11(3%) 12(3%) 1(0%)

Results for oleo - OS Faults Low Freq Undetected Fault Faulty Runs App-specific App-Generic App-Generic Errors Stack 50 4 0 3 0 Alloc 50 0 0 0 0 Heap 50 1 1 1 0 Off by One 50 3 0 0 0 Init Errors 50 0 1 1 0 Delete Branch 50 1 3 4 0 Delete Inst 50 5 0 1 0 Change Dest Var 50 3 4 4 0 Total 400 17(4%) 9(2%) 14(3%) 0(0%)

� � � Conclusions Generic recovery mechanisms are of little use in the presence of application-level faults as they save corrupted state very frequently The increased frequency seems to be more due to the frequency of state saves than the comprehensiveness When the faults are in the operating system layer the likelihood of saving corrupt state is reduced significantly. Generic recovery mechanisms can be useful in such cases.

The Impact of Recovery Mechanisms on the Likelihood of Saving - PowerPoint PPT Presentation

The Impact of Recovery Mechanisms on the Likelihood of Saving Corrupted State Subhachandra Chandra Cosine Communications Peter M. Chen University of Michigan Motivation Computer software is not reliable Recovery from

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Flexure Mechanisms: Why? Design Principles for Precision Miniaturization Mechanisms No

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Applied Statistics Lecturer: Serena Arima Likelihood ML estimator Summaries ML properties LR

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C

Strip Recovery: Strip Recovery: Strip Recovery: Strip Recovery: A 12 A 12- -Step

Causal Mechanisms and Process Tracing Department of Government London School of Economics and

Outline Mechanisms Mechanisms Mechanisms for Generating Random Walks Random Walks Power-Law

Community Recovery Forum Presenter: Cr Mary Brown Overview of Recovery Structure

RECOVERY OPERATIONS Performing recovery and related operations Acronis Training and Certification

Continuity and Recovery Planning Continuity and Recovery Planning Continuity and Recovery

Contents What is Recovery? What is Better Recovery? What is Community

Distributed Systems - III Open a file, check status on a file, close a file; Read data

Observing Internet Path Transparency Brian Trammell , ETH Zrich (with Mirja Khlewind, Elio

CMPT-401 Operating System II Instructor: Byron Gao (bgao@sfu.ca) Office hour: Tues &

Background Dist r ibut ed f ile syst em (DFS) a dist r ibut ed implement at ion of t he

bUiLdiNG eVoLuTiONaRy ARcHitECtuREs S UPPORT C ONSTANT C HANGE @neal4d @rebeccaparsons @patkua

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Lecture VII:

Distributed Smart Space Orchestration System 2pace Marc-Oliver Pahl Distributed Smart

Preserv rvation Storage Criteria: Ongoing Work September 2018 9/18/2018 For LC DSA meeting

The Impact of Recovery Mechanisms on the Likelihood of Saving - PowerPoint PPT Presentation

The Impact of Recovery Mechanisms on the Likelihood of Saving Corrupted State Subhachandra Chandra Cosine Communications Peter M. Chen University of Michigan Motivation Computer software is not reliable Recovery from

Max. likelihood &amp; Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Flexure Mechanisms: Why? Design Principles for Precision Miniaturization Mechanisms No

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Applied Statistics Lecturer: Serena Arima Likelihood ML estimator Summaries ML properties LR

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C

Strip Recovery: Strip Recovery: Strip Recovery: Strip Recovery: A 12 A 12- -Step

Causal Mechanisms and Process Tracing Department of Government London School of Economics and

Outline Mechanisms Mechanisms Mechanisms for Generating Random Walks Random Walks Power-Law

Community Recovery Forum Presenter: Cr Mary Brown Overview of Recovery Structure

RECOVERY OPERATIONS Performing recovery and related operations Acronis Training and Certification

Continuity and Recovery Planning Continuity and Recovery Planning Continuity and Recovery

Contents What is Recovery? What is Better Recovery? What is Community

Distributed Systems - III Open a file, check status on a file, close a file; Read data

Observing Internet Path Transparency Brian Trammell , ETH Zrich (with Mirja Khlewind, Elio

CMPT-401 Operating System II Instructor: Byron Gao (bgao@sfu.ca) Office hour: Tues &amp;

Background Dist r ibut ed f ile syst em (DFS) a dist r ibut ed implement at ion of t he

bUiLdiNG eVoLuTiONaRy ARcHitECtuREs S UPPORT C ONSTANT C HANGE @neal4d @rebeccaparsons @patkua

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Lecture VII:

Distributed Smart Space Orchestration System 2pace Marc-Oliver Pahl Distributed Smart

Preserv rvation Storage Criteria: Ongoing Work September 2018 9/18/2018 For LC DSA meeting

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

CMPT-401 Operating System II Instructor: Byron Gao (bgao@sfu.ca) Office hour: Tues &