Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Fault Injection-based Assessment of Softw are Techniques for Hardw are Fault Tolerance Johan Karlsson (work with Ruben Alexandersson, Daniel Skarin, Raul Barbosa, Peter Öhman, Domenico Di Leo, Behrooz Sangchoolie, Fatemeh Ayat) g , y ) Department of Computer Science and Engineering Chalmers University of Technology Göteborg, Sweden Transistor reliability trends Shekhar Borkar, Intel Corp: “ As technology scales, variability in transistor performance will continue to increase, making transistors less and less reliable. …. Finding solutions to these challenges will require a concerted effort on the part of all the players in a system design .” Borkar, S.; "Designing reliable systems from unreliable components: the challenges of transistor variability and degradation," IEEE Micro, December 2005. Johan Karlsson NODES Winter Seminar, February 3, 2012 2 Johan Karlsson 1 Chalmers University of Technology, Göteborg, Sweden
Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Outline • Hardware reliability trends • Layered fault tolerance Layered fault tolerance • Fault injection • Target application: Brake-by-wire controller • Low-cost software techniques • High-cost software techniques • Tool and experimental set-up Tool and experimental set-up • Summary • Future work Johan Karlsson NODES Winter Seminar, February 3, 2012 3 Main sources of transistor faults • Process variations – Random variations related to lithography, etching, dopant count – Voltage and temperature variations • Wear out effects (degradation) – NBTI - negative bias temperature instability – HCI - hot carrier injection – Gate oxide breakdown – Electromigration – … Electromigration • Soft errors – Bit-flips in latches, flip-flops and memory cells – Mainly caused by cosmic-induced high energy neutrons (cosmic neutrons) – Soft errors no permanent damage to hardware Johan Karlsson NODES Winter Seminar, February 3, 2012 4 Johan Karlsson 2 Chalmers University of Technology, Göteborg, Sweden
Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Trends in the bathtube curve Infant mortality Constant failure rate Wear out Failure rate 1 – 20 weeks 3 – 10 years Time • Infant mortality: Increasing manufacturing defects • Constant failure rate: Increasing rate of transient, intermittent and permanent faults • Wearout: Acceleration of aging phenomena Source: Vikas Chandra, ARM R&D, Dependable Design in Nanoscale CMOS Technologies: Challenges and Solutions Keynote address, WDSN, Estoril, Portugal, June 29, 2009 Johan Karlsson NODES Winter Seminar, February 3, 2012 5 Soft error rate trend for SRAM (Radiation test data from Sun Microsystems) 1 FIT = 10 -9 faults per hour Source: A. Dixit, R. Heald, and A. Wood, “Trends from Ten Years of Soft Error Experimentation, SELSE´09, Stanford, CA, USA . Johan Karlsson NODES Winter Seminar, February 3, 2012 6 Johan Karlsson 3 Chalmers University of Technology, Göteborg, Sweden
Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Raw soft error rate trend for microprocessors (Data from Sun Microsystems) Technology Year Relative SEU Mbits/processor Relative node (nm) introduced rate in uncorrected FITs/kbit FITs/kbit SEU rate / SEU rate / microproces sor 250 1998 3.2 1.52 5.0 180 1999 3.0 1.52 4.3 130 2000 2.4 3.28 7.9 90 2002 1.0 33.6 33.6 65 2006 0.7 44.3 30.5 40 2008 0.94 71 67 1 FIT = 10 -9 faults per hour Source: A. Dixit, R. Heald, and A. Wood, “The Impact of New Technology on Soft Error Rates, SELSE-6, Stanford, CA, USA, 2010 Johan Karlsson NODES Winter Seminar, February 3, 2012 7 Outline • Hardware reliability trends • Layered fault tolerance Layered fault tolerance • Fault injection • Target application: Brake-by-wire controller • Low-cost software techniques • High-cost software techniques • Tool and experimental set-up Tool and experimental set-up • Future work Johan Karlsson NODES Winter Seminar, February 3, 2012 8 Johan Karlsson 4 Chalmers University of Technology, Göteborg, Sweden
Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Layered fault tolerance Critical Benign Safe System failure modes System failure modes f il failure f il failure Sh td Shutdow n Error System mechanisms 3 rd line of defense corrected Timing Bounded Fail Processor failure modes Value Fail failure failure failure silent signal ost balancing Error Software mechanisms 2 nd line of defense corrected Detected Undetected Error Error Error Error C Error Hardware mechanisms 1 st line of defense Corrected Focus of my talk SW Design HW Design Physical Faults Faults Faults Johan Karlsson NODES Winter Seminar, February 3, 2012 9 Error handling in hardw are Some examples • Duplication and comparison – E.g., lock-stepped processors – High cost, high energy consumption and high failure rate • Error correction code (ECC) and Parity bits – Commonly used to protect caches and other memory arrays • Instruction retry – Re-execution of machine instruction after ECC or parity error • Reloading of untouched data from main memory when g y uncorrectable errors occurs in the cache • Etc … Johan Karlsson NODES Winter Seminar, February 3, 2012 10 Johan Karlsson 5 Chalmers University of Technology, Göteborg, Sweden
Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Outline • Hardware reliability trends • Layered fault tolerance Layered fault tolerance • Fault injection • Target application: Brake-by-wire controller • Low-cost software techniques • High-cost software techniques • Tool and experimental set-up Tool and experimental set-up • Summary • Future work Johan Karlsson NODES Winter Seminar, February 3, 2012 11 Fault Injection • Fault injection is a technique for verification and validation of fault and error handling mechanisms validation of fault and error handling mechanisms • Exposes a system, subsystem or component to artificial faults • Sometimes called FMET – Failure Mode Effects Testing (cf. FMEA) • Main benefit: improves our understanding of how a Main benefit: improves our understanding of how a system behaves in the presence of faults and errors Johan Karlsson NODES Winter Seminar, February 3, 2012 12 Johan Karlsson 6 Chalmers University of Technology, Göteborg, Sweden
Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Uses of fault injection • Fault forecasting – E.g., estimation of error detection coverage E.g., estimation of error detection coverage • Fault removal – To find bugs in fault and error handling mechanisms • Benchmarking – Comparison of alternative design solutions – Identify weaknesses • Evaluation-driven design – Iterative process of design, evaluation and improvement Johan Karlsson NODES Winter Seminar, February 3, 2012 13 Error model • We use single bit-flip errors to benchmark the error sensitivity of executable programs with respect to transistor faults in microprocessors i • The single bit-flip model is an engineering approximation • Bit-flips injected in CPU registers and the data segment of main memory • Injection is done just before the register or memory word is read by a machine instruction. This ensures injection of errors in live data • • We use pre injection analysis of a fault free execution trace to avoid We use pre-injection analysis of a fault-free execution trace to avoid injection in registers or memory that hold dead data • No guarantee for not injecting errors in data items that are transitively dead Johan Karlsson NODES Winter Seminar, February 3, 2012 14 Johan Karlsson 7 Chalmers University of Technology, Göteborg, Sweden
Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Failure mode distributions for three programs from the MiBench suite Failure modes Detected by D d b Value failure # Injected Program Program No Effect hardware (Non ‐ detected errors hang exception erroneous output) CRC ‐ 32 224999 56.3% 31.5% 5.6% 6.6% 32 ‐ bit cyclic redundancy check SHA 225000 14.6% 39.7% 1.5% 44.2% Secure hash algorithm Quicksort 175000 30.7% 46.7% 3.7% 18.9% Recursive sorting algorithm Injected errors: Single bit-flips in CPU registers and volatile main memory The failure mode distribution varies for different programs! Johan Karlsson NODES Winter Seminar, February 3, 2012 15 Outline • Hardware reliability trends • Layered fault tolerance Layered fault tolerance • Fault injection • Target application: Brake-by-wire controller • Low-cost software techniques • High-cost software techniques • Tool and experimental set-up Tool and experimental set-up • Summary • Future work Johan Karlsson NODES Winter Seminar, February 3, 2012 16 Johan Karlsson 8 Chalmers University of Technology, Göteborg, Sweden
Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Brake system emulator Release request Johan Karlsson NODES Winter Seminar, February 3, 2012 17 Workload: Brake-by-w ire control loop Parts of the program subjected to error injection encircled Johan Karlsson NODES Winter Seminar, February 3, 2012 18 Johan Karlsson 9 Chalmers University of Technology, Göteborg, Sweden
Recommend
More recommend