Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2014 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jörg Henkel - 1 - Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Reconfigurable and Adaptive Systems (RAS) 8. Fault Tolerance and Reliability in FPGA based Systems - 2 -
RAS Topic Overview 1. Introduction 2. Overview 3. Special Instructions 4. Fine-Grained Reconfigurable Processors • Introduction 5. Configuration Prefetching • Fault Detection and Mitigation 6. Coarse-Grained Techniques Reconfigurable Processors • Applications of 7. Adaptive Reliability Techniques Reconfigurable Processors • LHC 8. Fault-tolerance • Space by Reconfiguration • OTERA - 3 - L. Bauer, CES, KIT, 2014 Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 8 8.1 Introduction - 4 -
Why Fault Tolerance? � CMOS Scaling increases occurrence of � Manufacturing defects � Post-deployment degradation � Especially important for FPGAs as they have a high amount of transistors and interconnect wires Gordon E. Moore (co-founded Intel in 1968) � Environmental conditions can incur temporary faults ITRS � E.g. Aerospace industry – use hardened # of dopant atoms in Transistor-channel devices for mission critical tasks, FPGAs for non-critical data processing � Unlike ASICs, FPGAs can adapt # dopant atoms to deal with permanent and temporary faults - 5 - L. Bauer, CES, KIT, 2014 Types of Faults � Permanent Faults: e.g. stuck-at failures in CLBs and opens, bridges, shorts in the programmable switching matrix � Could occur during the fabrication process without being detected � Damage of device resources may also appear in the life cycle of FPGAs � Transient Faults: have a temporary cause that can alter signal values or state stored in memory cells, which creates indefinite and incorrect states in the computation � E.g. by a high energy particle strike resulting in an energy exchange and charge displacement � Intermittent Faults: have a permanent cause in the structure of the circuit but their effect is intermittent, e.g. depending on temperature or power consumption - 6 - L. Bauer, CES, KIT, 2014
Negative Bias Temperature Instability (NBTI) V g � Breakdown of Si-H bonds at gate the silicon-oxide interface S D oxide due to voltage/thermal stress p p n � causes interface traps P-type MOSFET � Affects mostly P-MOSFETs because of negative gate bias � Effect in N-MOSFETS is H + negligible trap O H O H � Despite research focus: Si Si Si Si Si NBTI is observed, but not yet fully understood V g < 0 � STRESS! - 7 - L. Bauer, CES, KIT, 2014 Negative Bias Temperature Instability (NBTI) ( cont‘d ) � NBTI manifests itself as a shift in V th V th shift [V] � Causes increase in transistor delay � NBTI leads to delay faults and resulting circuit failure Stress Recovery � Recovery effect in periods of no stress � When voltage and temperature are low, V th can shift back towards 0 V g [V] its original value � Full recovery from a stress period -1 only possible in infinite time � In practice, overall V th shift increases over longer periods, e.g. months or years Time - 8 - L. Bauer, CES, KIT, 2014
NBTI and Temperature � Temperature plays important aspect in NBTI modeling � Higher temperatures increase shift in threshold voltage � � Vth approximately 50% higher at 75°C than 55°C � NBTI effect at 75°C is approximately equal to alternating between 85°C and 25°C - 9 - L. Bauer, CES, KIT, 2014 NBTI Impact on Lifetime of SRAM The NBTI effect is minimum here 40% because the NBTI stress will equally (SNM) degradation after 7 Signal to Noise Margin be distributed between the two PMOS 35% transistors existing in the SRAM 30% years in 32nm 25% 20% 15% 10% 5% 0% Percentage of time that the cell stores zero [%] src: S. Kothawade, K. Chakraborty, S. Roy, "Analysis and mitigation of NBTI aging in register file: An end-to-end approach" - 10 - L. Bauer, CES, KIT, 2014
Types of Degradation (cont’d) � Hot-Carrier Injection (HCI): build up of trapped charges in the gate-channel interface region � progressive reduction of carrier mobility � increase in CMOS threshold voltage � Switching speed slower, leads to timing problems - 11 - L. Bauer, CES, KIT, 2014 Types of Degradation ( cont‘d ) � Time-Dependent Dielectric Breakdown (TDDB): over time conducting path forms in thin oxide layers [CCMA10] G S D - 12 - L. Bauer, CES, KIT, 2014
Main Reason for many of these effects: High-Fields � Most of device problems can be tracked down to high-field effects – related to the failure to follow Dennard Scaling src: Radhakrishnan et al ., IEDM (2001) - 13 - L. Bauer, CES, KIT, 2014 Dennard Scaling vs. Power Density � Transistor and power scaling are no longer balanced � Scaling is limited by power � Higher power density leads to thermal problems � Accelerates aging effects Classical scaling (Dennard) Device count S 2 Assuming a constant area Device frequency S Chip freq. may reduce due to wire delay Device power (cap) 1/S Device power (V dd ) 1/S 2 Voltage scales 1/S � Power squared Power Density 1 [W/mm 2 ] S: Scaling Factor; Device: Transistor src: G. Venkatesh et al., “Conservation Cores: Reducing the Energy of Mature Computations”, ASPLOS ‘10 - 14 - L. Bauer, CES, KIT, 2014
Dennard Scaling vs. Power Density � Transistor and power scaling are no longer balanced � Scaling is limited by power � Higher power density leads to thermal problems � Accelerates aging effects Classical scaling (Dennard) Power Limited Scaling Device count S 2 Device count S 2 Device frequency S Device frequency S Device power (cap) 1/S Device power (cap) 1/S Device power (V dd ) 1/S 2 Device power (V dd ) ~1 Power Density 1 Power Density S 2 S: Scaling Factor; Device: Transistor src: G. Venkatesh et al., “Conservation Cores: Reducing the Energy of Mature Computations”, ASPLOS ‘10 - 15 - L. Bauer, CES, KIT, 2014 Types of Degradation ( cont‘d ) � Electromigration: thermally activated metal ions may leave their potential wells � electric field and momentum exchange through electrons direct metal ion migration � can lead to open/short circuits [wikipedia] - 16 - L. Bauer, CES, KIT, 2014
Radiation induced faults � Radiation induced faults � Single Event Upsets/Single Event Transients � Most common: single bit flip in SRAM cell � SEU effect on ASIC � Transient (only variation is time duration of fault) � Even if latched, will be eventually overwritten High-Energy Particle (Neutron or Proton) � SEU effect on FPGAs p+ Isolation � Permanent (until reset/ Gate reconfiguration) if n+ n+ N-Well configuration memory + + + - + - + - + - - + - - + - + hit by SEU + - Depletion P-Well - Region P-Substrate Sources: Intel, S. Borker@DAC’03, Patrick-Emil Zörner, W.D. Nix, 1992, L.Finkelstein, Intel 2005, R. Baumann, - 17 - L. Bauer, CES, KIT, 2014 TI@Design&Test’05, Ziegler, IBM@IBM JRD’96 Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 8 8.2 Fault Detection and Mitigation Techniques - 18 -
Modular Redundancy � Masks errors, but does not correct underlying fault � Problem: error accumulation � External � Multiple FPGAs working in lockstep, i.e. per- forming the same operation in each cycle � Output sent to radiation hardened voter � Internal � Replicate functional block in FPGA � Popular configurations � Triple Modular Redundancy (TMR) � Duplication with Comparison (DWC) - 19 - L. Bauer, CES, KIT, 2014 Fault detection methods comparison src: [SCC08] Detection Resource Performance Granularity Coverage Speed Overhead Overhead O Modular Fast: as Very large: Very small: Coarse: Good: All Redundancy soon as fault triplicate + Voter delay protect manifest is manifest voter module errors sized detected blocks - 20 - L. Bauer, CES, KIT, 2014
Concurrent Error Detection � More space efficient design than modular redundancy � Error coding algorithms (e.g. parity) at data flows/stores � Time redundancy can be used for concurrent error detection � Repeat computation in a way that allows errors to be detected � First computation at t0: compute result in combinational logic, store result � Second computation at t0+d: encode operands, compute in combinational logic, decode result, compare to first result - 21 - L. Bauer, CES, KIT, 2014 Concurrent Error Detection (cont’d) src: [LCR03] - 22 - L. Bauer, CES, KIT, 2014
Recommend
More recommend