Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2013 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jörg Henkel - 1 - Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Reconfigurable and Adaptive Systems (RAS) 8. Fault Tolerance and Reliability in FPGA based Systems - 2 -
RAS Topic Overview 1. Introduction 2. Overview 3. Special Instructions 4. Fine-Grained Reconfigurable Processors • Introduction 5. Configuration Prefetching • Fault Detection and Mitigation 6. Coarse-Grained Techniques Reconfigurable Processors • Applications of 7. Adaptive Reliability Techniques Reconfigurable Processors • LHC 8. Fault-tolerance • Space by Reconfiguration • OTERA - 3 - L. Bauer, CES, KIT, 2013 Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 8 8.1 Introduction - 4 -
Why Fault Tolerance? � CMOS Scaling increases occurrence of ◦ Manufacturing defects ◦ Post-deployment degradation ◦ Especially important for FPGAs as they have a high amount of Gordon E. Moore transistors and interconnect wires (co-founded Intel in 1968) � Environmental conditions can incur temporary faults ITRS ◦ E.g. Aerospace industry – use hardened # of dopant atoms in T-channel devices for mission critical tasks, FPGAs for non-critical data processing � Unlike ASICs, FPGAs can adapt # dopant atoms to deal with permanent and temporary faults - 5 - L. Bauer, CES, KIT, 2013 Types of Faults � Permanent Faults: e.g. stuck-at failures in CLBs and opens, bridges, shorts in the programmable switching matrix ◦ could occur during the fabrication process without being detected ◦ Damage of device resources may also appear in the life cycle of FPGAs � Intermittent Faults: have a permanent cause in the structure of the circuit but their effect is intermittent, e.g. depending on temperature or power consumption � Transient Faults: have a temporary cause that can alter signal values or state stored in memory cells, which creates indefinite and incorrect states in the computation ◦ e.g. by a high energy particle strike resulting in an energy exchange and charge displacement - 6 - L. Bauer, CES, KIT, 2013
Negative Bias Temperature Instability (NBTI) V g � Breakdown of Si-H bonds at gate the silicon-oxide interface S D oxide due to voltage/thermal stress p p n � causes interface traps P-type MOSFET � Affects mostly P-MOSFETs because of negative gate bias ◦ Effect in N-MOSFETS is H + negligible trap O H O H � Despite research focus: Si Si Si Si Si NBTI is observed, but not yet fully understood V g < 0 � STRESS! - 7 - L. Bauer, CES, KIT, 2013 Negative Bias Temperature Instability (NBTI) (cont‘d) � NBTI manifests itself as a shift in V th V th shift [V] ◦ Causes increase in transistor delay ◦ NBTI leads to delay faults and resulting circuit failure Stress Recovery � Recovery effect in periods of no stress ◦ When voltage and temperature are low, V th can shift back towards 0 V g [V] its original value ◦ Full recovery from a stress period -1 only possible in infinite time � In practice, overall V th shift increases over longer periods, e.g. months or years Time - 8 - L. Bauer, CES, KIT, 2013
NBTI and Temperature � Temperature plays important aspect in NBTI modeling � Higher temperatures increase shift in threshold voltage � ΔVth approximately 50% higher at 75°C than 55°C � NBTI effect at 75°C is approximately equal to alternating between 85°C and 25°C - 9 - L. Bauer, CES, KIT, 2013 NBTI Impact on Lifetime of SRAM The NBTI effect is minimum here 40% (SNM) degradation after 7 because the NBTI stress will equally Signal to Noise Margin 35% be distributed between the two PMOS transistors existing in the SRAM 30% years in 32nm 25% 20% 15% 10% 5% 0% Percentage of time that the cell stores zero [%] src: S. Kothawade, K. Chakraborty, S. Roy, "Analysis and mitigation of NBTI aging in register file: An end-to-end approach" - 10 - L. Bauer, CES, KIT, 2013
Types of Degradation (cont’d) � Hot-Carrier Injection (HCI): build up of trapped charges in the gate-channel interface region ◦ progressive reduction of carrier mobility � increase in CMOS threshold voltage ◦ Switching speed slower, leads to timing problems - 11 - L. Bauer, CES, KIT, 2013 Types of Degradation (cont‘d) � Time-Dependent Dielectric Breakdown (TDDB): over time conducting path forms in thin oxide layers [CCMA10] G S D - 12 - L. Bauer, CES, KIT, 2013
Example: Effect of TDDB on SRAM � Example: Read noise margin p -source fresh V dd breakdown � Worst case: � half-selected � state (wordline+bitlines high) V R 0 V dd n -source drain breakdown breakdown WL p-source pass- V R gate "1" "0" drain V V L R 0 n-source 0 V dd 0 V dd BL BR V L V L src: Stathis, IRPS (2008) - 13 - L. Bauer, CES, KIT, 2013 Main Reason for many of these effects: High-Fields � Most of device problems can be tracked down to high-field effects – related to the failure to follow Dennard Scaling src: Radhakrishnan et al ., IEDM (2001) - 14 - L. Bauer, CES, KIT, 2013
Dennard Scaling vs. Power Density � Transistor and power scaling are no longer balanced ◦ Scaling is limited by power � Higher power density leads to thermal problems ◦ Accelerates aging effects Classical scaling (Dennard) Power Limited Scaling Device count S 2 Device count S 2 Device frequency S Device frequency S Device power (cap) 1/S Device power (cap) 1/S Device power (V dd ) 1/S 2 Device power (V dd ) ~1 S 2 Power Density 1 Power Density S: Scaling Factor src: G. Venkatesh et al., “Conservation Cores: Reducing the Energy of Mature Computations”, ASPLOS ‘10 - 15 - L. Bauer, CES, KIT, 2013 Types of Degradation (cont‘d) � Electromigration: thermally activated metal ions may leave their potential wells ◦ electric field and momentum exchange through electrons direct metal ion migration ◦ can lead to open/short circuits [wikipedia] - 16 - L. Bauer, CES, KIT, 2013
Radiation induced faults � Radiation induced faults ◦ Single Event Upsets/Single Event Transients ◦ Most common: single bit flip in SRAM cell ◦ SEU effect on ASIC � Transient (only variation is time duration of fault) � Even if latched, will be eventually overwritten High-Energy Particle (Neutron or Proton) ◦ SEU effect on FPGAs p+ Isolation � Permanent (until reset/ Gate reconfiguration) if n+ n+ N-Well configuration memory + + + - + - + - + - - + - - + - + + - Depletion hit by SEU P-Well - Region P-Substrate Sources: Intel, S. Borker@DAC’03, Patrick-Emil Zörner, W.D. Nix, 1992, L.Finkelstein, Intel 2005, R. Baumann, - 17 - L. Bauer, CES, KIT, 2013 TI@Design&Test’05, Ziegler, IBM@IBM JRD’96 Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 8 8.2 Fault Detection and Mitigation Techniques - 18 -
Modular Redundancy � Masks errors, but does not correct underlying fault ◦ Problem: error accumulation � External ◦ Multiple FPGAs working in lockstep, i.e. per- forming the same operation in each cycle ◦ Output sent to radiation hardened voter � Internal ◦ Replicate functional block in FPGA � Popular configuration: Triple Modular Redundancy (TMR) - 19 - L. Bauer, CES, KIT, 2013 Fault detection methods comparison src: [SCC08] Detection Resource Performance Granularity Coverage Speed Overhead O Overhead Modular Fast: as Very large: Very small: Coarse: Good: All Redundancy soon as fault triplicate + Voter delay protect manifest is manifest voter module errors sized detected blocks - 20 - L. Bauer, CES, KIT, 2013
Concurrent Error Detection � More space efficient design than modular redundancy � Error coding algorithms (e.g. parity) at data flows/stores � Time redundancy can be used for concurrent error detection ◦ Repeat computation in a way that allows errors to be detected ◦ First computation at t0: compute result in combinational logic, store result ◦ Second computation at t0+d: encode operands, compute in combinational logic, decode result, compare to first result - 21 - L. Bauer, CES, KIT, 2013 Concurrent Error Detection (cont’d) src: [LCR03] - 22 - L. Bauer, CES, KIT, 2013
Concurrent Error Detection (cont’d) � Different techniques for encode/decode, e.g. bit inversion to detect stuck-at faults � Recomputation with shifted operands (RESO) for faulty arithmetic slices ◦ Encode: left shift operands ◦ Decode: right shift result � Combine with double modular redundancy (DMR) ◦ RESO determines which module is faulty, DMR uses result of other module ◦ Less area required than TMR ◦ Slightly slower (time-shifted re-computation) - 23 - L. Bauer, CES, KIT, 2013 Fault detection methods comparison src: [SCC08] Detection Resource Performance Granularity Coverage Speed Overhead Overhead O Modular Fast: as soon Very large: Very small: Coarse: Good: All Redundancy as fault is triplicate + Voter delay protect manifest manifest voter module errors sized blocks detected Concurrent Fast: as soon Medium: Small: CRC Medium: Medium: Not error as fault is tradeoff logic delay tradeoff with practical for detection manifest with resource all types of coverage functionality - 24 - L. Bauer, CES, KIT, 2013
Recommend
More recommend