reconfigurable and adaptive systems ras
play

Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) - PowerPoint PPT Presentation

Institut fr Technische Informatik Institut fr Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2014 Reconfigurable and Reconfigurable and Adaptive


  1. Institut für Technische Informatik Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2014 Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 8. Fault Tolerance and Reliability Lars Bauer, Artjom Grudnitsky, in FPGA based Systems Hongyan Zhang, Jörg Henkel - 1 - - 2 - RAS Topic Overview Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 1. Introduction 2. Overview 3. Special Instructions 8 8.1 Introduction 4. Fine-Grained Reconfigurable Processors • Introduction 5. Configuration Prefetching • Fault Detection and Mitigation 6. Coarse-Grained Techniques Reconfigurable Processors • Applications of 7. Adaptive Reliability Techniques Reconfigurable Processors • LHC 8. Fault-tolerance • Space by Reconfiguration • OTERA - 3 - - 4 - L. Bauer, CES, KIT, 2014

  2. Why Fault Tolerance? Types of Faults � CMOS Scaling increases � Permanent Faults: e.g. stuck-at failures in CLBs and opens, occurrence of bridges, shorts in the programmable switching matrix � Manufacturing defects � Could occur during the fabrication process without being detected � Post-deployment degradation � Damage of device resources may also appear in the life cycle of � Especially important for FPGAs FPGAs as they have a high amount of transistors and interconnect wires � Transient Faults: have a temporary cause that can alter Gordon E. Moore (co-founded Intel in 1968) signal values or state stored in memory cells, which � Environmental conditions can incur temporary faults creates indefinite and incorrect states in the computation ITRS � E.g. Aerospace industry – use hardened � E.g. by a high energy particle strike resulting in an energy # of dopant atoms in Transistor-channel devices for mission critical tasks, FPGAs for exchange and charge displacement non-critical data processing � Intermittent Faults: have a permanent cause in the � Unlike ASICs, FPGAs can adapt # dopant atoms structure of the circuit but their effect is intermittent, e.g. to deal with permanent and depending on temperature or power consumption temporary faults - 5 - - 6 - L. Bauer, CES, KIT, 2014 L. Bauer, CES, KIT, 2014 Negative Bias Temperature Negative Bias Temperature Instability (NBTI) Instability (NBTI) ( cont‘d ) V g � NBTI manifests itself as a shift in � Breakdown of Si-H bonds at gate V th the silicon-oxide interface S D V th shift [V] oxide � Causes increase in transistor delay due to voltage/thermal stress p p n � NBTI leads to delay faults and � causes interface traps resulting circuit failure P-type MOSFET Stress Recovery � Recovery effect in periods of no � Affects mostly P-MOSFETs stress because of negative gate bias � When voltage and temperature � Effect in N-MOSFETS is are low, V th can shift back towards H + 0 V g [V] its original value negligible trap O H O H � Full recovery from a stress period � Despite research focus: only possible in infinite time -1 Si Si Si Si Si � In practice, overall V th shift NBTI is observed, but not increases over longer periods, e.g. yet fully understood months or years Time V g < 0 � STRESS! - 7 - - 8 - L. Bauer, CES, KIT, 2014 L. Bauer, CES, KIT, 2014

  3. NBTI Impact on Lifetime of NBTI and Temperature SRAM The NBTI effect is minimum here 40% (SNM) degradation after 7 because the NBTI stress will equally � Temperature plays important aspect in NBTI modeling Signal to Noise Margin be distributed between the two PMOS 35% � Higher temperatures transistors existing in the SRAM 30% years in 32nm increase shift in 25% threshold voltage 20% � � Vth approximately 15% 50% higher at 75°C 10% than 55°C 5% � NBTI effect at 75°C 0% is approximately equal to alternating between Percentage of time that the cell stores zero [%] 85°C and 25°C src: S. Kothawade, K. Chakraborty, S. Roy, "Analysis and mitigation of NBTI aging in register file: An end-to-end approach" - 9 - - 10 - L. Bauer, CES, KIT, 2014 L. Bauer, CES, KIT, 2014 Types of Degradation (cont’d) Types of Degradation ( cont‘d ) � Hot-Carrier Injection (HCI): build up of � Time-Dependent Dielectric Breakdown trapped charges in the gate-channel interface (TDDB): over time conducting path forms in region thin oxide layers � progressive reduction of carrier mobility � increase in CMOS threshold voltage � Switching speed slower, leads to timing problems [CCMA10] G S D - 11 - - 12 - L. Bauer, CES, KIT, 2014 L. Bauer, CES, KIT, 2014

  4. Main Reason for many of these Dennard Scaling vs. Power Density effects: High-Fields � Transistor and power scaling are no longer � Most of device problems can be tracked down to high-field balanced effects – related to the failure to follow Dennard Scaling � Scaling is limited by power � Higher power density leads to thermal problems � Accelerates aging effects Classical scaling (Dennard) Assuming a constant area Device count S 2 Chip freq. may reduce due to wire delay Device frequency S Device power (cap) 1/S Device power (V dd ) 1/S 2 Voltage scales 1/S � Power squared Power Density 1 [W/mm 2 ] S: Scaling Factor; Device: Transistor src: G. Venkatesh et al., “Conservation Cores: Reducing the Energy of Mature Computations”, ASPLOS ‘10 src: Radhakrishnan et al ., IEDM (2001) - 13 - - 14 - L. Bauer, CES, KIT, 2014 L. Bauer, CES, KIT, 2014 Types of Degradation ( cont‘d ) Dennard Scaling vs. Power Density � Transistor and power scaling are no longer � Electromigration: thermally activated metal balanced ions may leave their potential wells � Scaling is limited by power � Higher power density leads to thermal problems � electric field and momentum exchange through electrons direct metal ion migration � Accelerates aging effects Classical scaling (Dennard) Power Limited Scaling � can lead to open/short circuits Device count S 2 Device count S 2 Device frequency S Device frequency S Device power (cap) 1/S Device power (cap) 1/S Device power (V dd ) 1/S 2 Device power (V dd ) ~1 Power Density 1 Power Density S 2 S: Scaling Factor; Device: Transistor [wikipedia] src: G. Venkatesh et al., “Conservation Cores: Reducing the Energy of Mature Computations”, ASPLOS ‘10 - 15 - - 16 - L. Bauer, CES, KIT, 2014 L. Bauer, CES, KIT, 2014

  5. Radiation induced faults Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel � Radiation induced faults � Single Event Upsets/Single Event Transients 8 8.2 Fault Detection and � Most common: single bit flip in SRAM cell � SEU effect on ASIC Mitigation Techniques � Transient (only variation is time duration of fault) � Even if latched, will be eventually overwritten High-Energy Particle (Neutron or Proton) � SEU effect on FPGAs p+ Isolation � Permanent (until reset/ Gate reconfiguration) if n+ n+ N-Well configuration memory + + + - + - + - + - - + - - + - + hit by SEU + - Depletion P-Well - Region P-Substrate Sources: Intel, S. Borker@DAC’03, Patrick-Emil Zörner, W.D. Nix, 1992, L.Finkelstein, Intel 2005, R. Baumann, - 17 - - 18 - L. Bauer, CES, KIT, 2014 TI@Design&Test’05, Ziegler, IBM@IBM JRD’96 Fault detection methods Modular Redundancy comparison � Masks errors, but does not correct underlying src: [SCC08] Detection Resource Performance Granularity Coverage fault Speed Overhead Overhead O Modular Fast: as Very large: Very small: Coarse: Good: All � Problem: error accumulation Redundancy soon as fault triplicate + Voter delay protect manifest � External is manifest voter module errors sized detected � Multiple FPGAs working in lockstep, i.e. per- blocks forming the same operation in each cycle � Output sent to radiation hardened voter � Internal � Replicate functional block in FPGA � Popular configurations � Triple Modular Redundancy (TMR) � Duplication with Comparison (DWC) - 19 - - 20 - L. Bauer, CES, KIT, 2014 L. Bauer, CES, KIT, 2014

Recommend


More recommend