CSCI654 Advanced Computer Architecture CSCI654 Advanced Computer Architecture CSCI654 Advanced Computer Architecture CSCI654 Advanced Computer Architecture When Software meets Hardware Faults Hao Han hhan@cs.wm.edu 7 April 2009 Some slides are adapted from talks of "SWAT"[ASPLOS'08], "SymPlFIED" [DSN'08], "Trace- based diagnosis"[DSN'08], and "Likely program invariants"[DSN'08]
Outline • Motivation • Background • Research points – Program verification: SymPLFIED – Error detection: SWAT • Experimental methodology (see report) • Limitations • Conclusion 2
Motivation • Goal Goal Goal Goal : highly reliable systems • Conventional illusion: fault-free hardware devices to software ⇒ Can not only focus on software bugs of programs • Hardware faults will happen in the field – Traditional solutions: (1) Hardware redundancy (2) special circuits to verify hardware ⇒ Too expensive: area, power, and so on Today: Re-think about the reliability problem when considering hardware faults, especially in the core 3
Background - Location of H/W faults Microarchitectural structure Microarchitectural structure Faults Faults Microarchitectural structure Microarchitectural structure Faults Faults Instruction decoder Instruction decoder Instruction decoder Instruction decoder Decoding instruction is corrupted Integer ALU Integer ALU Integer ALU Integer ALU Output latch of one of the ALUs FP ALU FP ALU FP ALU FP ALU Output latch of one of the ALUs Address or data bus Address or data bus Address or data bus Address or data bus Bus of register, cache, memory Physical reg file Physical reg file Physical reg file Physical reg file Physical regs in the reg file Reorder buffer (ROB) Reorder buffer (ROB) Reorder buffer (ROB) Reorder buffer (ROB) Src/dest reg of instr in ROB entry Address gen unit (AGEN) Address gen unit (AGEN) Address gen unit (AGEN) Address gen unit (AGEN) Virtual address generated by the unit Register alias table (RAT) Register alias table (RAT) Register alias table (RAT) Register alias table (RAT) Logical -> phys map of a logical reg 4
Background - Hardware Faults • Category of H/W faults: (1) permanent (2) transient (3) intermittent • Impact of H/W faults 5
Research Points • Program verification under hardware faults SymPLFIED [DSN'08] (Best paper award) • Error detection for hardware faults with low cost SWAT [ASPLOS ’08] SWAT Trace-Based Fault Diagnosis [DSN'08] Likely Program Invariants [DSN'08] Accurate Fault Models [HPCA'09] 6
SymPLFIED [DSN'08] Goal: Goal: Goal: Goal: A formal framework to evaluate the effects of hardware faults on arbitrary programs independent of the detection mechanism Conceptual Design Flow of SymPLFIED 7
Techniques of SymPFLIED • Model error propagation by representingl errors in program as abstract symbol <symbolic execution> – Represents all kinds of faults – Avoids explosion of exhaustive fault injection • Automatically search possible values of symoblic error that escape from detecion and cause SDC <model checking> – Bounded model checking using satisfiability solving 8
SWAT System • Assumptions: – Multicore system where a fault-free core is always available – Checkpoint/rollback mechanism • Goals: – Provide low-cost software-level detection methods for permanent hardware fault, and low-level diagnosis for recovery and possibly repair/reconfiguration • SWAT components – Detection: Symptoms of software for detecting – Diagnosis: Identify the source of faulty unit 9
1. Detectors w/ simple symtoms 1. Detectors w/ simple symtoms 1. Detectors w/ simple symtoms 1. Detectors w/ simple symtoms [ASPLOS [ASPLOS [ASPLOS [ASPLOS ’ ’ ’08] ’ 08] 08] 08] 2. Detectors w/ compiler support 2. Detectors w/ compiler support 2. Detectors w/ compiler support 2. Detectors w/ compiler support [DSN [DSN [DSN [DSN ’ ’ ’ ’08] 08] 08] 08] Checkpoint Checkpoint Checkpoint Checkpoint Checkpoint Checkpoint Checkpoint Checkpoint Fault Fault Fault Fault Error Error Error Error Symptom Symptom Symptom Symptom Recovery Recovery Recovery Recovery detected detected detected detected Diagnosis Repair 4. Accurate Fault Models 4. Accurate Fault Models Diagnosis Diagnosis Diagnosis Repair Repair Repair 4. Accurate Fault Models 4. Accurate Fault Models [HPCA [HPCA [HPCA [HPCA’ ’ ’09] ’ 09] 09] 09] 3. Trace-Based Fault Diagnosis 3. Trace-Based Fault Diagnosis 3. Trace-Based Fault Diagnosis 3. Trace-Based Fault Diagnosis [DSN [DSN [DSN ’ ’ 08] 08] [DSN ’ ’08] 08] 10
Simple Symptoms • Observe anomalous symptoms for fault detection – Incur low overheads for “always-on” detectors – Minimal support from hardware, no software support • Anomalous symptoms – Fatal hardware traps Fatal hardware traps Fatal hardware traps Fatal hardware traps • For example, division by zero, RED State, etc. – Abnormal application exit Abnormal application exit Abnormal application exit , indicated by OS Abnormal application exit • For example, application terminates due to segmentation fault – Hangs Hangs Hangs Hangs • The whole system becomes unresponsive • Detected by setting up counter – High OS activity High OS activity High OS activity High OS activity • Monitoring the amount of time the execution remains in the OS, without returning to the application 11
1. SWAT 1. SWAT 1. SWAT 1. SWAT [ASPLOS [ASPLOS [ASPLOS [ASPLOS ’ ’ ’08] ’ 08] 08] 08] 2. Detectors w/ compiler support 2. Detectors w/ compiler support 2. Detectors w/ compiler support 2. Detectors w/ compiler support [DSN [DSN [DSN [DSN ’ ’ ’ ’08] 08] 08] 08] Checkpoint Checkpoint Checkpoint Checkpoint Checkpoint Checkpoint Checkpoint Checkpoint Fault Fault Fault Fault Error Error Error Error Symptom Symptom Symptom Symptom Recovery Recovery Recovery Recovery detected detected detected detected Diagnosis Repair 4. Accurate Fault Models 4. Accurate Fault Models Diagnosis Diagnosis Diagnosis Repair Repair Repair 4. Accurate Fault Models 4. Accurate Fault Models [HPCA [HPCA [HPCA [HPCA’ ’ ’09] ’ 09] 09] 09] 3. Trace-Based Fault Diagnosis 3. Trace-Based Fault Diagnosis 3. Trace-Based Fault Diagnosis 3. Trace-Based Fault Diagnosis [DSN [DSN [DSN ’ ’ 08] 08] [DSN ’ ’08] 08] 12
Likely Program Invariant Application Application Application Application Training Phase Training Phase Training Phase Training Phase Compiler Pass in LLVM Compiler Pass in LLVM Compiler Pass in LLVM Compiler Pass in LLVM Test, Test, Test, Test, train, train, train, train, Invariant Invariant Invariant Invariant external external external external - - - - - - - - - - - - - - - - - - - - Monitoring Monitoring Monitoring Monitoring Application Application Application Application inputs inputs inputs inputs Code Code Code Code - - - - - - - - - - - - - - - - - - - - Range Range Range Range Range Range Range Range . . . . . . . . . . . . . . . . i/p #1 i/p #1 i/p #1 i/p #1 i/p #n i/p #n i/p #n i/p #n MIN ≤ value ≤ MAX Invariant Ranges Invariant Ranges Invariant Ranges Invariant Ranges 13
Likely Program Invariant Application Application Application Application Training Phase Training Phase Fault Detection Phase Fault Detection Phase Training Phase Training Phase Fault Detection Phase Fault Detection Phase Compiler Pass in LLVM Compiler Pass in LLVM Compiler Pass in LLVM Compiler Pass in LLVM Compiler Pass in LLVM Compiler Pass in LLVM Compiler Pass in LLVM Compiler Pass in LLVM - - - - - - - - - - - - - - - - - - - - Invariant Invariant Invariant Invariant Test, Test, Test, Test, Ref Ref Ref Ref Application Application Application Application Checking Checking Checking Checking train, train, train, train, input input input input Invariant Invariant Invariant Invariant - - - - - - - - - - - - - - - - - - - - Code Code Code Code external external external external - - - - - - - - - - - - - - - - - - - - Monitoring Monitoring Monitoring Monitoring Inject Inject Inject Inject Application Application Application Application inputs inputs inputs inputs Full System Full System Full System Full System Code Code Code Code Faults Faults Faults Faults - - - - - - - - - - - - - - - - - - - - Simulation Simulation Simulation Simulation Invariant Invariant Invariant Invariant Violation Violation Violation Violation Ranges Ranges Ranges Ranges Range Range Range Range . . . . . . . . . . . . . . . . SWAT Diagnosis SWAT Diagnosis SWAT Diagnosis SWAT Diagnosis i/p #1 i/p #1 s i/p #n s i/p #n i/p #1 i/p #1 s i/p #n s i/p #n Fault Fault False Positive False Positive Fault Fault False Positive False Positive Invariant Ranges Invariant Ranges Invariant Ranges Invariant Ranges Detection Detection Detection Detection (Disable Invariant) (Disable Invariant) (Disable Invariant) (Disable Invariant) 14
Recommend
More recommend