Fault-tolerant techniques Fault-tolerant techniques What causes - PowerPoint PPT Presentation

EDA421/DIT171 - Parallel and Distributed Real-Time Systems, Chalmers/GU, 2011/2012 Lecture #14 Updated May 2, 2012 Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the effects if the hardware or software is not • Specification or design faults: fault-free in a real-time system? – Incomplete or erroneous models – Lack of techniques for formal checking • Component defects: – Manufacturing effects (in hardware or software) – Wear and tear due to component use • Environmental effects: – High stress (temperature, G-forces, vibrations) – Electromagnetic or elementary-particle radiation Fault-tolerant techniques Fault-tolerant techniques What types of (hardware) faults are there? What types of (software) faults are there? • Permanent faults: • Permanent faults: – Total failure of a component – Total failure of a component – Caused by, for example, short-circuits or melt-down – Caused by, for example, corrupted data structures – Remains until component is repaired or replaced – Remains until component is repaired or replaced • Transient faults: • Transient faults: – Temporary malfunctions of a component – Temporary malfunctions of a component – Caused by magnetic or ionizing radiation, or power fluctuation – Caused by data-dependent bugs in the program code • Intermittent faults: • Intermittent faults: – Repeated occurrences of transient faults – Repeated occurrences of transient faults – Caused by, for example, loose wires – Caused by, for example, dangling-pointer problems 1

EDA421/DIT171 - Parallel and Distributed Real-Time Systems, Chalmers/GU, 2011/2012 Lecture #14 Updated May 2, 2012 Fault-tolerant techniques Fault-tolerant techniques How are faults handled at run-time? How are errors detected? • Error detection: • Watchdog mechanism: – Erroneous data or program behavior is detected – A monitor looks for signs that hardware or software is faulty – Watchdog mechanism, comparisons, diagnostic tests – For example: time-outs, signature checking, or checksums • Error correction: • Comparisons: – The originally-intended data/behavior is restored – The output of redundant components are compared – Intelligent codes used for restoring corrupt data – A ”golden run” of intended behavior can be available – Check-pointing used for restoring corrupt program flow • Diagnostic tests: • Fault masking: – Tests on hardware or software are (transparently) executed – Effects of erroneous data or program behavior are ”hidden” as part of the schedule – Voting mechanism Fault-tolerant techniques Fault-tolerant techniques How is fault-tolerance obtained? Hardware redundancy: • Voting mechanism: • Hardware redundancy: – Majority voter (largest group must have majority of values) – Additional hardware components are used – k-plurality voter (largest group must have at least k values) • Software redundancy: – Median voter – Different application software versions are used • N-modular redundancy (NMR): • Time redundancy: – 2 m +1 units are needed to mask the effects of m faults – Schedule contains ample slack so tasks can be re-executed – One or more voters can be used in parallel • Information redundancy: This technique is very expensive, which means that it is only – Data is coded so that errors can be detected and/or corrected justified in the most critical applications. 2

EDA421/DIT171 - Parallel and Distributed Real-Time Systems, Chalmers/GU, 2011/2012 Lecture #14 Updated May 2, 2012 Fault-tolerant techniques Fault-tolerant techniques Software redundancy: Time redundancy (backward error recovery): • N-version programming: • Retry: – Different versions of the program are run in parallel – The failed instruction is repeated – Voting is used for fault masking • Rollback: – Software development is diversified using different languages – Execution is re-started from the beginning of the program and even different software development teams • Recovery-block approach: – Execution is re-started from a checkpoint where sufficient program state has been saved – Different versions of the program are used, but only one version is run at a time This technique does not require additional hardware, which – Acceptance test is used for determining validity of results significantly reduces the weight, size, power-consumption and cost of the system. This technique is also very expensive, because of the development of independent program versions. Fault-tolerant techniques Fault-tolerant scheduling Information redundancy (forward error recovery): To extend real-time computing towards fault-tolerance, • Duplication: the following issues must be considered: – Errors are detected by duplicating each data word 1. What is the fault model used? • Parity encoding: – What type of fault is assumed? – How and when are faults detected? – Errors are detected/corrected by keeping the number of ones in the data word odd or even 2. How should fault-tolerance be implemented? • Checksum codes: – Using temporal redundancy (re-execution)? – Errors are detected by adding the data words into sums – Using spatial redundancy (replicated tasks/processors)? • Cyclic codes: 3. What scheduling policy should be used? – Errors are detected/corrected by interpreting the data bits as – Extend existing policies (for example, RM or EDF)? coefficients in a polynomial and deriving redundant bits – Suggest new policies? through division of a generator polynomial 3

EDA421/DIT171 - Parallel and Distributed Real-Time Systems, Chalmers/GU, 2011/2012 Lecture #14 Updated May 2, 2012 Fault-tolerant scheduling Fault-tolerant scheduling What fault model is used? How is fault-tolerance implemented? Type of fault: Temporal redundancy: – Transient, intermittent and/or permanent faults – Tasks are re-executed to provide replicas for voting decisions – For transient/intermittent faults: is there a minimum interarrival – Tasks are re-executed to recover from a fault time between two subsequent faults? – Re-execution may be from beginning or from check-point – Re-executed task may be original or simplified version Error detection: – Voting (after task execution) Spatial redundancy: – Checksums or signature checking (during task execution) – Replicas of tasks are distributed on multiple processors – Watchdogs or diagnostic testing (during task execution) – Identical or different implementations of tasks – Voting decisions are made to detect errors or mask faults Note: the fault model assumed is a key part of the method used for validating the system. If the true system behavior differs from the Note: the choice of fault-tolerance mechanism should be made in assumed, any guarantees we have made may not be correct! conjunction with the choice of scheduling policy. Fault-tolerant scheduling Fault-tolerant scheduling What do existing scheduling policies offer? How do we extend existing techniques to FT? Static scheduling: Uniprocessor scheduling: – Simple to implement (unfortunately, supported by very few – Use RM, DM or EDF and use any surplus capacity (slack) to commercial real-time operating systems) re-execute tasks that experience errors during their execution. – High observability (facilitates monitoring, testing & debugging) – The slack is reserved a priori and can be accounted for in a schedulability test. This allows for performance guarantees – Natural points in time for self-check & synchronization (under the assumed fault model) (facilitates implementation of task redundancy) – Or: re-executions can be modeled as aperiodic tasks. The Dynamic scheduling: slack is then extracted dynamically at run-time by dedicated – RM simple to implement (supported by most commercial aperiodic servers. This allows for statistical guarantees. real-time operating systems) – RM and EDF are optimal scheduling policies – RM and EDF comes with a solid analysis framework 4

Fault-tolerant techniques Fault-tolerant techniques What causes - PowerPoint PPT Presentation

EDA421/DIT171 - Parallel and Distributed Real-Time Systems, Chalmers/GU, 2011/2012 Lecture #14 Updated May 2, 2012 Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the effects if the

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

REVIEW OF FAULT TOLERANT TECHNIQUES FOR DIFFERENT TYPES OF GRAPHS BY- HATEM NASSRAT TARAK

Overview ECE 753: FAULT-TOLERANT Fault Modeling COMPUTING References Introduction

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Fault-Tolerant Distributed Optimization Lili Su, Arun Padakandla, Qiong Hu, Seyyed A. Fatemi,

Computability Abstractions for Fault-tolerant Asynchronous Distributed Computing Julien Stainer

A Fault-Tolerant Alternative to Lockstep Triple Modular Redundancy Andrew L. Baldwin, BS 09,

Fault-tolerant Quantum Computing Bryan Eastin Northrop Grumman Corporation Aurora, CO December

Fault Tolerant Computing Coping with errors Steven Janke February 2013 Steven Janke (Seminar)

Panel on Intrusion Tolerance RAID 2001 UC Davis October 11, 2001 Participants Crispin

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A.

Updated Backward EMC Geometry Luigi Capozza for the Mainz EMC group PANDA CM 17/2 Computing

Week 10.2, Wednesday, Oct 23 Homework 5 Due October 26 @ 11:59PM on Gradescope Practice Midterm

Why FT-Software? Safe and reliable software operation is a significant requirement for many

A Generic Policy-free Framework for Fault-tolerant Systems: Experiments on WSNs Delano M. Beder 1

Database Management Recovery and the ACID properties Systems A tomicity: all or nothing A

XtreemOS European Project: Achievements & Perspectives Christine Morin XtreemOS scientific

Fault-tolerant techniques Fault-tolerant techniques What causes - PowerPoint PPT Presentation

EDA421/DIT171 - Parallel and Distributed Real-Time Systems, Chalmers/GU, 2011/2012 Lecture #14 Updated May 2, 2012 Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the effects if the

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

REVIEW OF FAULT TOLERANT TECHNIQUES FOR DIFFERENT TYPES OF GRAPHS BY- HATEM NASSRAT TARAK

Overview ECE 753: FAULT-TOLERANT Fault Modeling COMPUTING References Introduction

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Fault-Tolerant Distributed Optimization Lili Su, Arun Padakandla, Qiong Hu, Seyyed A. Fatemi,

Computability Abstractions for Fault-tolerant Asynchronous Distributed Computing Julien Stainer

A Fault-Tolerant Alternative to Lockstep Triple Modular Redundancy Andrew L. Baldwin, BS 09,

Fault-tolerant Quantum Computing Bryan Eastin Northrop Grumman Corporation Aurora, CO December

Fault Tolerant Computing Coping with errors Steven Janke February 2013 Steven Janke (Seminar)

Panel on Intrusion Tolerance RAID 2001 UC Davis October 11, 2001 Participants Crispin

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A.

Updated Backward EMC Geometry Luigi Capozza for the Mainz EMC group PANDA CM 17/2 Computing

Week 10.2, Wednesday, Oct 23 Homework 5 Due October 26 @ 11:59PM on Gradescope Practice Midterm

Why FT-Software? Safe and reliable software operation is a significant requirement for many

A Generic Policy-free Framework for Fault-tolerant Systems: Experiments on WSNs Delano M. Beder 1

Database Management Recovery and the ACID properties Systems A tomicity: all or nothing A

XtreemOS European Project: Achievements &amp; Perspectives Christine Morin XtreemOS scientific

XtreemOS European Project: Achievements & Perspectives Christine Morin XtreemOS scientific