Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom - PDF document

Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2014 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jörg Henkel - 1 - Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Reconfigurable and Adaptive Systems (RAS) 8. Fault Tolerance and Reliability in FPGA based Systems - 2 -

RAS Topic Overview 1. Introduction 2. Overview 3. Special Instructions 4. Fine-Grained Reconfigurable Processors • Introduction 5. Configuration Prefetching • Fault Detection and Mitigation 6. Coarse-Grained Techniques Reconfigurable Processors • Applications of 7. Adaptive Reliability Techniques Reconfigurable Processors • LHC 8. Fault-tolerance • Space by Reconfiguration • OTERA - 3 - L. Bauer, CES, KIT, 2014 Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 8 8.1 Introduction - 4 -

Why Fault Tolerance? � CMOS Scaling increases occurrence of � Manufacturing defects � Post-deployment degradation � Especially important for FPGAs as they have a high amount of transistors and interconnect wires Gordon E. Moore (co-founded Intel in 1968) � Environmental conditions can incur temporary faults ITRS � E.g. Aerospace industry – use hardened # of dopant atoms in Transistor-channel devices for mission critical tasks, FPGAs for non-critical data processing � Unlike ASICs, FPGAs can adapt # dopant atoms to deal with permanent and temporary faults - 5 - L. Bauer, CES, KIT, 2014 Types of Faults � Permanent Faults: e.g. stuck-at failures in CLBs and opens, bridges, shorts in the programmable switching matrix � Could occur during the fabrication process without being detected � Damage of device resources may also appear in the life cycle of FPGAs � Transient Faults: have a temporary cause that can alter signal values or state stored in memory cells, which creates indefinite and incorrect states in the computation � E.g. by a high energy particle strike resulting in an energy exchange and charge displacement � Intermittent Faults: have a permanent cause in the structure of the circuit but their effect is intermittent, e.g. depending on temperature or power consumption - 6 - L. Bauer, CES, KIT, 2014

Negative Bias Temperature Instability (NBTI) V g � Breakdown of Si-H bonds at gate the silicon-oxide interface S D oxide due to voltage/thermal stress p p n � causes interface traps P-type MOSFET � Affects mostly P-MOSFETs because of negative gate bias � Effect in N-MOSFETS is H + negligible trap O H O H � Despite research focus: Si Si Si Si Si NBTI is observed, but not yet fully understood V g < 0 � STRESS! - 7 - L. Bauer, CES, KIT, 2014 Negative Bias Temperature Instability (NBTI) ( cont‘d ) � NBTI manifests itself as a shift in V th V th shift [V] � Causes increase in transistor delay � NBTI leads to delay faults and resulting circuit failure Stress Recovery � Recovery effect in periods of no stress � When voltage and temperature are low, V th can shift back towards 0 V g [V] its original value � Full recovery from a stress period -1 only possible in infinite time � In practice, overall V th shift increases over longer periods, e.g. months or years Time - 8 - L. Bauer, CES, KIT, 2014

NBTI and Temperature � Temperature plays important aspect in NBTI modeling � Higher temperatures increase shift in threshold voltage � � Vth approximately 50% higher at 75°C than 55°C � NBTI effect at 75°C is approximately equal to alternating between 85°C and 25°C - 9 - L. Bauer, CES, KIT, 2014 NBTI Impact on Lifetime of SRAM The NBTI effect is minimum here 40% because the NBTI stress will equally (SNM) degradation after 7 Signal to Noise Margin be distributed between the two PMOS 35% transistors existing in the SRAM 30% years in 32nm 25% 20% 15% 10% 5% 0% Percentage of time that the cell stores zero [%] src: S. Kothawade, K. Chakraborty, S. Roy, "Analysis and mitigation of NBTI aging in register file: An end-to-end approach" - 10 - L. Bauer, CES, KIT, 2014

Types of Degradation (cont’d) � Hot-Carrier Injection (HCI): build up of trapped charges in the gate-channel interface region � progressive reduction of carrier mobility � increase in CMOS threshold voltage � Switching speed slower, leads to timing problems - 11 - L. Bauer, CES, KIT, 2014 Types of Degradation ( cont‘d ) � Time-Dependent Dielectric Breakdown (TDDB): over time conducting path forms in thin oxide layers [CCMA10] G S D - 12 - L. Bauer, CES, KIT, 2014

Main Reason for many of these effects: High-Fields � Most of device problems can be tracked down to high-field effects – related to the failure to follow Dennard Scaling src: Radhakrishnan et al ., IEDM (2001) - 13 - L. Bauer, CES, KIT, 2014 Dennard Scaling vs. Power Density � Transistor and power scaling are no longer balanced � Scaling is limited by power � Higher power density leads to thermal problems � Accelerates aging effects Classical scaling (Dennard) Device count S 2 Assuming a constant area Device frequency S Chip freq. may reduce due to wire delay Device power (cap) 1/S Device power (V dd ) 1/S 2 Voltage scales 1/S � Power squared Power Density 1 [W/mm 2 ] S: Scaling Factor; Device: Transistor src: G. Venkatesh et al., “Conservation Cores: Reducing the Energy of Mature Computations”, ASPLOS ‘10 - 14 - L. Bauer, CES, KIT, 2014

Dennard Scaling vs. Power Density � Transistor and power scaling are no longer balanced � Scaling is limited by power � Higher power density leads to thermal problems � Accelerates aging effects Classical scaling (Dennard) Power Limited Scaling Device count S 2 Device count S 2 Device frequency S Device frequency S Device power (cap) 1/S Device power (cap) 1/S Device power (V dd ) 1/S 2 Device power (V dd ) ~1 Power Density 1 Power Density S 2 S: Scaling Factor; Device: Transistor src: G. Venkatesh et al., “Conservation Cores: Reducing the Energy of Mature Computations”, ASPLOS ‘10 - 15 - L. Bauer, CES, KIT, 2014 Types of Degradation ( cont‘d ) � Electromigration: thermally activated metal ions may leave their potential wells � electric field and momentum exchange through electrons direct metal ion migration � can lead to open/short circuits [wikipedia] - 16 - L. Bauer, CES, KIT, 2014

Radiation induced faults � Radiation induced faults � Single Event Upsets/Single Event Transients � Most common: single bit flip in SRAM cell � SEU effect on ASIC � Transient (only variation is time duration of fault) � Even if latched, will be eventually overwritten High-Energy Particle (Neutron or Proton) � SEU effect on FPGAs p+ Isolation � Permanent (until reset/ Gate reconfiguration) if n+ n+ N-Well configuration memory + + + - + - + - + - - + - - + - + hit by SEU + - Depletion P-Well - Region P-Substrate Sources: Intel, S. Borker@DAC’03, Patrick-Emil Zörner, W.D. Nix, 1992, L.Finkelstein, Intel 2005, R. Baumann, - 17 - L. Bauer, CES, KIT, 2014 TI@Design&Test’05, Ziegler, IBM@IBM JRD’96 Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 8 8.2 Fault Detection and Mitigation Techniques - 18 -

Modular Redundancy � Masks errors, but does not correct underlying fault � Problem: error accumulation � External � Multiple FPGAs working in lockstep, i.e. per- forming the same operation in each cycle � Output sent to radiation hardened voter � Internal � Replicate functional block in FPGA � Popular configurations � Triple Modular Redundancy (TMR) � Duplication with Comparison (DWC) - 19 - L. Bauer, CES, KIT, 2014 Fault detection methods comparison src: [SCC08] Detection Resource Performance Granularity Coverage Speed Overhead Overhead O Modular Fast: as Very large: Very small: Coarse: Good: All Redundancy soon as fault triplicate + Voter delay protect manifest is manifest voter module errors sized detected blocks - 20 - L. Bauer, CES, KIT, 2014

Concurrent Error Detection � More space efficient design than modular redundancy � Error coding algorithms (e.g. parity) at data flows/stores � Time redundancy can be used for concurrent error detection � Repeat computation in a way that allows errors to be detected � First computation at t0: compute result in combinational logic, store result � Second computation at t0+d: encode operands, compute in combinational logic, decode result, compare to first result - 21 - L. Bauer, CES, KIT, 2014 Concurrent Error Detection (cont’d) src: [LCR03] - 22 - L. Bauer, CES, KIT, 2014

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom - PDF document

Institut fr Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2014 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg Henkel - 1 - Institut fr Technische

Reconfigurable Computing Reconfigurable Computing Reconfigurable Architectures Reconfigurable

Reconfigurable Computing Computing Reconfigurable Reconfigurable Architectures Architectures

Reconfigurable Computing Reconfigurable Computing Design and implementation Design and

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 7. Adaptive

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 4.

Reconfigurable Computing Computing Reconfigurable Design and implementation implementation

Using Reconfigurable Logic Using Reconfigurable Logic to Simulate Computer Systems Derek Chiou

Reconfigurable Computing Reconfigurable Computing Applications Applications Chapter 9 Chapter

Reconfigurable Computing Computing Reconfigurable Partial reconfiguration reconfiguration

Reconfigurable Computing Reconfigurable Computing Partitioning Partitioning Chapter 5 Chapter

Reconfigurable Computing Reconfigurable Computing for System on a Chip for System on a Chip

Reconfigurable Computing Reconfigurable Computing VHDL Crash Course VHDL Crash Course Chapter 2

Reconfigurable Computing Reconfigurable Computing Introduction Introduction Chapter 1 1

Reconfigurable Computing Computing Reconfigurable On- -line line communication communication

Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 8. Fault Tolerance and

Concurrent and Real-Time Task Management for Self-Reconfigurable Robots Harris Chiu &

CS137: Electronic Design Automation Day 8: February 4, 2004 Fault Detection CALTECH CS137

LQS01a Test Results LARP Collaboration Meeting 14 Fermilab - April 26-28, 2010 Guram Chlachidze

Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurlien Bouteiller 1 , Amina

Structured Output Learning for Automatic Geophysical Feature Detection Chiyuan Zhang, Charlie

A Note on Fault Diagnosis Algorithms Franck Cassez National ICT Australia & CNRS Sydney,

Minimization of Sensor Activation in Decentralized Fault Diagnosis of Discrete Event Systems

Overview Problem and motivation ECE 553: TESTING AND Fault simulation algorithms

Testing of Digital Systems Dr. Hao Zheng Comp. Sci & Eng U of South Florida Why Testing?

Sambuz

Useful Links

Newsletter

Mail Us

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom - PDF document

Institut fr Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2014 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg Henkel - 1 - Institut fr Technische

Reconfigurable Computing Reconfigurable Computing Reconfigurable Architectures Reconfigurable

Reconfigurable Computing Computing Reconfigurable Reconfigurable Architectures Architectures

Reconfigurable Computing Reconfigurable Computing Design and implementation Design and

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 7. Adaptive

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 4.

Reconfigurable Computing Computing Reconfigurable Design and implementation implementation

Using Reconfigurable Logic Using Reconfigurable Logic to Simulate Computer Systems Derek Chiou

Reconfigurable Computing Reconfigurable Computing Applications Applications Chapter 9 Chapter

Reconfigurable Computing Computing Reconfigurable Partial reconfiguration reconfiguration

Reconfigurable Computing Reconfigurable Computing Partitioning Partitioning Chapter 5 Chapter

Reconfigurable Computing Reconfigurable Computing for System on a Chip for System on a Chip

Reconfigurable Computing Reconfigurable Computing VHDL Crash Course VHDL Crash Course Chapter 2

Reconfigurable Computing Reconfigurable Computing Introduction Introduction Chapter 1 1

Reconfigurable Computing Computing Reconfigurable On- -line line communication communication

Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 8. Fault Tolerance and

Concurrent and Real-Time Task Management for Self-Reconfigurable Robots Harris Chiu &amp;

CS137: Electronic Design Automation Day 8: February 4, 2004 Fault Detection CALTECH CS137

LQS01a Test Results LARP Collaboration Meeting 14 Fermilab - April 26-28, 2010 Guram Chlachidze

Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurlien Bouteiller 1 , Amina

Structured Output Learning for Automatic Geophysical Feature Detection Chiyuan Zhang, Charlie

A Note on Fault Diagnosis Algorithms Franck Cassez National ICT Australia &amp; CNRS Sydney,

Minimization of Sensor Activation in Decentralized Fault Diagnosis of Discrete Event Systems

Overview Problem and motivation ECE 553: TESTING AND Fault simulation algorithms

Testing of Digital Systems Dr. Hao Zheng Comp. Sci &amp; Eng U of South Florida Why Testing?

Sambuz

Useful Links

Newsletter

Mail Us

Concurrent and Real-Time Task Management for Self-Reconfigurable Robots Harris Chiu &

A Note on Fault Diagnosis Algorithms Franck Cassez National ICT Australia & CNRS Sydney,

Testing of Digital Systems Dr. Hao Zheng Comp. Sci & Eng U of South Florida Why Testing?