2/4/2014 Overview ECE 753: FAULT-TOLERANT • Introduction - Sources COMPUTING • Hardware redundancy Kewal K Saluja Kewal K.Saluja • Information redundancy Information redundancy Department of Electrical and Computer • Time redundancy Engineering • Software redundancy Basic Concepts in Fault-Tolerance ECE 753 Fault Tolerant Computing 2 Introduction Introduction (contd.) • Scope - Explain using the example of a filter • Sources • inputs • Main source – Text Chapters 2 and 3 • A/D • digital subsystem - DSP/custom design • Other sources • D/A • [prad:96] Chapter 1 • outputs • [siew:99] Chapter 3 • Problems and solutions • [Shooman:02] Chapter 4 • inputs out of range These three books contain sufficient • add extra code to check out of range inputs and outputs • can also add code to check large deviations between samples material covering this part of the course. • software redundancy normally - could do in hardware but costly ECE 753 Fault Tolerant Computing 3 ECE 753 Fault Tolerant Computing 4 Hardware redundancy Introduction (contd.) • Passive hardware redundancy • Problems and solutions - contd. • TMR with a voter • Power transients may corrupt the values or fault algorithm • main problem • read values twice, execute algorithm twice and compare results • single point of failure in hardware or software • justification - voter is much lower complexity • Time redundancy and can be designed using more reliable d b d i d i li bl • Values transmitted by A/D to the digital system may get corrupted technology • encode the values and decode them at the destination • alternative - use of restoring organ • Information redundancy – TMR with triplicated voter • Components (DSP processor or A/D or D/A) may fail • NMR voter based generalization • duplicate such parts • Hardware voter (1-bit), software voter - simple • Hardware redundancy • Timing issue - sandwich between pairs of FFs ECE 753 Fault Tolerant Computing 5 ECE 753 Fault Tolerant Computing 6 1
2/4/2014 Hardware redundancy (contd.) Hardware redundancy (contd.) • Passive hardware redundancy (contd.) • Passive hardware redundancy (contd.) – Comparison between hw and sw voter schemes – types of voting hw sw • majority cost high low – in many practical situations it is meaningless flexibilty inflex flexibilty inflex flex flex • average synch tightly loosely – can have poor performance if a sensor always provide perfor high low very low value (fast) (slow) • mid value types of majority diff – a good choice - can be very costly to implement in HW voting* (others costly) (no extra cost) ECE 753 Fault Tolerant Computing 7 ECE 753 Fault Tolerant Computing 8 Hardware redundancy (contd.) Active approach to FT • Active hardware redundancy – Key - detect fault, locate, reconfigure • See figure 1.6 of [prad:96] – duplicate with comparison Basic operations in • single point of failure active fault tolerance active fault tolerance – standby sparing - Source: Pradhand • one operational unit - it has its own fault detection mechanism 1996 • on occurrence of fault a second unit (spare) is used – cold standby - standby is in unknown state – hot standby - standby is same state as system - quick start • can generalize to n - one active and n-1 standby spares ECE 753 Fault Tolerant Computing 9 ECE 753 Fault Tolerant Computing 10 Hardware redundancy (contd.) Hardware redundancy (contd.) • Hybrid hardware redundancy • Active hardware redundancy (contd.) – Key - combine passive and active redundancy – Pair-and-a-spare - this combines “duplicate with schemes comparison” with “standby sparing” – NMR with spares • duplicate units (pair of units) are used to compare and signal an error to the reconfiguration unit • example - 5 units p • second duplicate (pair, and possibly more in case of pair and k- – 3 in TMR mode spare) is used to take over in case the working duplicate (pair) – 2 spares detects an error – all 5 connected to a switch that can be reconfigured • a pair is always operational • comparison with 5MR – Watchdog timer – 5MR can tolerate only two faults where as hybrid scheme • a “timer” - substantially low cost hardware monitors the can tolerate three faults that occur sequentially function of the working unit – cost of the extra fault-tolerance: switch ECE 753 Fault Tolerant Computing 11 ECE 753 Fault Tolerant Computing 12 2
2/4/2014 Information redundancy Hardware redundancy (contd.) • Key concept - add redundancy to • Hybrid hardware redundancy (contd.) information/data – Self purging redundancy – all schemes use Error detecting or Error correcting • initially start with NMR coding • purge one unit at at time till arrive at 3MR • Use of parity y – can tolerate more faults initially compared to NMR with y p spare – very effective single error detection – cost of the switch - higher? – encoding and decoding cost is low – How does it compare to sift-out redundancy? – commonly used in memories, transmission over short reliable channels – Triple-duplex redundancy – limitations • combines duplication-with-compare and TMR • unable to detect common multiple errors • can not be used in data transformation - for example addition does not preserve parity ECE 753 Fault Tolerant Computing 13 ECE 753 Fault Tolerant Computing 14 Information redundancy (Contd.) Information redundancy (Contd.) • Error correcting codes • Berger codes – triplication – n information bits are encoded into an n+k bit code word. – Hamming code - you have learnt it The k check bits are binary encoding of the number of 1’s (or 0’s) in the n information bits – byte error detection/correction - to be discussed later • can detect all single errors – cyclic code - see book • can detect all unidirectional multiple errors if carefully designed • can detect all unidirectional multiple errors if carefully designed • m-out-of-n codes • Arithmetic codes – encode each word (data/control) such that the coded word is – AN code of length n and each coded word has exactly m 1’s in it • used for arithmetic function unit designs • can detect all single errors • each data word is multiplied by a constant A • can detect all unidirectional multiple errors • makes use of the identity A(N+M) = AN + AM • choice of A is important ECE 753 Fault Tolerant Computing 15 ECE 753 Fault Tolerant Computing 16 Information redundancy (Contd.) Information redundancy (Contd.) • Arithmetic codes (Contd.) • Self-Checking – Residue code – This is a form of hardware redundancy but often it is closely related to ECC techniques, therefore I have chosen to • discussed earlier in the course using modulo addition include it here • makes use of the fact (M+N) mod k = (M mod k + N mod k) mod k – Assumptions: inputs are coded and outputs are coded – Objective: in the presence of a fault the circuit should either – Checksums continue to provide correct output(s) or indicate by providing • data is sent/stored with a checksum and when used the an error indication that there is a fault. checksum is regenerated and compared to the a priory known checksum • Clearly error indication can not be 1-bit output (why?) • functions used for checksum • With 2-bits output, 00 and 11 may indicate no failure • add, exclusive-OR (bit wise), end with end around carry, LFSR, … • other output combinations (10, 01) may indicate a failure • limitation • can only perform (normally) error detection ECE 753 Fault Tolerant Computing 17 ECE 753 Fault Tolerant Computing 18 3
Recommend
More recommend