lars bauer artjom grudnitsky hongyan zhang j rg henkel
play

Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg Henkel - 1 - - PowerPoint PPT Presentation

Institut fr Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2014 Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg Henkel - 1 - Institut fr Technische Informatik Chair for Embedded Systems - Prof.


  1. Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2014 Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jörg Henkel - 1 -

  2. Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 8. Fault Tolerance and Reliability in FPGA based Systems - 2 -

  3. 1. Introduction 2. Overview 3. Special Instructions 4. Fine-Grained Reconfigurable Processors • Introduction 5. Configuration Prefetching • Fault Detection and Mitigation 6. Coarse-Grained Techniques Reconfigurable Processors • Applications of 7. Adaptive Reliability Techniques Reconfigurable Processors • LHC 8. Fault-tolerance • Space by Reconfiguration • OTERA - 3 - L. Bauer, KIT, 2014

  4. Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel - 4 -

  5. � CMOS Scaling increases occurrence of ◦ Manufacturing defects ◦ Post-deployment degradation ◦ Especially important for FPGAs as they have a high amount of transistors and interconnect wires Gordon E. Moore (co-founded Intel in 1968) � Environmental conditions can incur temporary faults ITRS ◦ E.g. Aerospace industry – use hardened # of dopant atoms in Transistor-channel devices for mission critical tasks, FPGAs for non-critical data processing � Unlike ASICs, FPGAs can adapt # dopant atoms to deal with permanent and temporary faults - 5 - L. Bauer, KIT, 2014

  6. � Permanent Faults: e.g. stuck-at failures in CLBs and opens, bridges, shorts in the programmable switching matrix ◦ Could occur during the fabrication process without being detected ◦ Damage of device resources may also appear in the life cycle of FPGAs � Transient Faults: have a temporary cause that can alter signal values or state stored in memory cells, which creates indefinite and incorrect states in the computation ◦ E.g. by a high energy particle strike resulting in an energy exchange and charge displacement � Intermittent Faults: have a permanent cause in the structure of the circuit but their effect is intermittent, e.g. depending on temperature or power consumption - 6 - L. Bauer, KIT, 2014

  7. V g � Breakdown of Si-H bonds at gate the silicon-oxide interface S D oxide due to voltage/thermal stress p p n � causes interface traps P-type MOSFET � Affects mostly P-MOSFETs because of negative gate bias ◦ Effect in N-MOSFETS is H + negligible trap O H O H � Despite research focus: Si Si Si Si Si NBTI is observed, but not yet fully understood V g < 0 � STRESS! - 7 - L. Bauer, KIT, 2014

  8. � NBTI manifests itself as a shift in V th V th shift [V] ◦ Causes increase in transistor delay ◦ NBTI leads to delay faults and resulting circuit failure Stress Recovery � Recovery effect in periods of no stress ◦ When voltage and temperature are low, V th can shift back towards 0 V g [V] its original value ◦ Full recovery from a stress period -1 only possible in infinite time � In practice, overall V th shift increases over longer periods, e.g. months or years Time - 8 - L. Bauer, KIT, 2014

  9. � Temperature plays important aspect in NBTI modeling � Higher temperatures increase shift in threshold voltage � Δ Vth approximately 50% higher at 75°C than 55°C � NBTI effect at 75°C is approximately equal to alternating between 85°C and 25°C - 9 - L. Bauer, KIT, 2014

  10. The NBTI effect is minimum here 40% because the NBTI stress will equally (SNM) degradation after 7 Signal to Noise Margin be distributed between the two PMOS 35% transistors existing in the SRAM 30% years in 32nm 25% 20% 15% 10% 5% 0% Percentage of time that the cell stores zero [%] src: S. Kothawade, K. Chakraborty, S. Roy, "Analysis and mitigation of NBTI aging in register file: An end-to-end approach" - 10 - L. Bauer, KIT, 2014

  11. � Hot-Carrier Injection (HCI): build up of trapped charges in the gate-channel interface region ◦ progressive reduction of carrier mobility � increase in CMOS threshold voltage ◦ Switching speed slower, leads to timing problems - 11 - L. Bauer, KIT, 2014

  12. � Time-Dependent Dielectric Breakdown (TDDB): over time conducting path forms in thin oxide layers [CCMA10] G S D - 12 - L. Bauer, KIT, 2014

  13. � Most of device problems can be tracked down to high-field effects – related to the failure to follow Dennard Scaling src: Radhakrishnan et al ., IEDM (2001) - 13 - L. Bauer, KIT, 2014

  14. � Transistor and power scaling are no longer balanced ◦ Scaling is limited by power � Higher power density leads to thermal problems ◦ Accelerates aging effects Classical scaling (Dennard) Assuming a constant area Device count S 2 Device frequency S Chip freq. may reduce due to wire delay Device power (cap) 1/S Device power (V dd ) 1/S 2 Voltage scales 1/S � Power squared P Power Density 1 [W/mm 2 ] S: Scaling Factor; Device: Transistor src: G. Venkatesh et al., “Conservation Cores: Reducing the Energy of Mature Computations”, ASPLOS ‘10 - 14 - L. Bauer, KIT, 2014

  15. � Transistor and power scaling are no longer balanced ◦ Scaling is limited by power � Higher power density leads to thermal problems ◦ Accelerates aging effects Classical scaling (Dennard) Power Limited Scaling Device count S 2 Device count S 2 Device frequency S Device frequency S Device power (cap) 1/S Device power (cap) 1/S Device power (V dd ) 1/S 2 Device power (V dd ) ~1 P Power Density 1 Power Density S 2 S: Scaling Factor; Device: Transistor src: G. Venkatesh et al., “Conservation Cores: Reducing the Energy of Mature Computations”, ASPLOS ‘10 - 15 - L. Bauer, KIT, 2014

  16. � Electromigration: thermally activated metal ions may leave their potential wells ◦ electric field and momentum exchange through electrons direct metal ion migration ◦ can lead to open/short circuits [wikipedia] - 16 - L. Bauer, KIT, 2014

  17. � Radiation induced faults ◦ Single Event Upsets/Single Event Transients ◦ Most common: single bit flip in SRAM cell ◦ SEU effect on ASIC � Transient (only variation is time duration of fault) � Even if latched, will be eventually overwritten High-Energy Particle (Neutron or Proton) ◦ SEU effect on FPGAs p+ Isolation � Permanent (until reset/ Gate reconfiguration) if n+ n+ N-Well configuration memory + + + - - + - + + - - + - - + - + hit by SEU + - Depletion P-Well - Region P-Substrate Sources: Intel, S. Borker@DAC’03, Patrick-Emil Zörner, W.D. Nix, 1992, L.Finkelstein, Intel 2005, R. Baumann, - 17 - L. Bauer, KIT, 2014 TI@Design&Test’05, Ziegler, IBM@IBM JRD’96

  18. Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel - 18 -

  19. � Masks errors, but does not correct underlying fault ◦ Problem: error accumulation � External ◦ Multiple FPGAs working in lockstep, i.e. per- forming the same operation in each cycle ◦ Output sent to radiation hardened voter � Internal ◦ Replicate functional block in FPGA � Popular configurations ◦ Triple Modular Redundancy (TMR) ◦ Duplication with Comparison (DWC) - 19 - L. Bauer, KIT, 2014

  20. src: [SCC08] Detection Resource Performance Granularity Coverage Speed Overhead Overhead O Modular Fast: as Very large: Very small: Coarse: Good: All Redundancy soon as fault triplicate + Voter delay protect manifest is manifest voter module errors sized detected blocks - 20 - L. Bauer, KIT, 2014

  21. � More space efficient design than modular redundancy � Error coding algorithms (e.g. parity) at data flows/stores � Time redundancy can be used for concurrent error detection ◦ Repeat computation in a way that allows errors to be detected ◦ First computation at t0: compute result in combinational logic, store result ◦ Second computation at t0+d: encode operands, compute in combinational logic, decode result, compare to first result - 21 - L. Bauer, KIT, 2014

  22. src: [LCR03] - 22 - L. Bauer, KIT, 2014

  23. � Different techniques for encode/decode, e.g. bit inversion to detect stuck-at faults � Recomputation with shifted operands (RESO) for faulty arithmetic slices ◦ Encode: left shift operands ◦ Decode: right shift result � Combine with Duplication with Comparison (DWC) ◦ RESO determines which module is faulty, DWC uses result of other module ◦ Less area required than TMR ◦ Slightly slower (time-shifted re-computation) - 23 - L. Bauer, KIT, 2014

  24. src: [SCC08] Detection Resource Performance Granularity Coverage Speed Overhead O Overhead Modular Fast: as soon Very large: Very small: Coarse: Good: All Redundancy as fault is triplicate + Voter delay protect manifest manifest voter module errors sized blocks detected Concurrent Fast: as soon Medium: Small: CRC Medium: Medium: Not error as fault is tradeoff logic delay tradeoff with practical for detection manifest with resource all types of coverage functionality - 24 - L. Bauer, KIT, 2014

  25. � Built-in Self-Test: does not use external test equipment � In FPGAs: test configurations containing ◦ Test pattern generator (TPG) ◦ Output response analyzer (ORA) ◦ Between them: Device (i.e. logic and interconnect) under test (DUT) � Can test for faults that are difficult to cover in online tests, e.g. clock network � Major drawback: system must enter dedicated test mode - 25 - L. Bauer, KIT, 2014

Recommend


More recommend