Hardware Reliability of Embedded Systems: Are We There Yet? Bashir M. Al-Hashimi, FREng, FIEEE March 19 th 2014 PAnDA - Programmable Digital and Analogue Array York, 18-19 March 2014
Overview • Where we are? – academic and industrial research highlights • Where we are heading to? – personal perspectives 2
Hardware Reliability • Reliability* as described by IBM – Computers designed with reliability to protect data integrity and stay available for long periods of time without failure • Unreliability sources – Logic faults Low power design Process variation • Radiation Exacerbated by Technology scaling – Timing faults • Transistor wear-out 3 * Wikipedia
Hardware Reliability Trends Voltage scaling and process variation degrades reliability Critical charge of flip-flops for 45nm node* S. Yang, S. Khursheed, B. M. Al-Hashimi, D. Flynn, and S. Idgunji, “ Reliable State Retention-Based Embedded Processors Through Monitoring and Recovery, ” IEEE TCAD , vol. 30, no. 12, pp. 1773–1785, Dec. 2011. 4
Where Does Reliability Matter? 5 Source: ARM
Embedded Systems Reliability Processor #1 Processor #n Data Data path path …… …… Control Control Cache Cache logic logic Register Register files files Interconnect Peripherals Memory #1 …… Memory #n 6
Where are we in dealing with hardware reliability? 7
Reliability Publications 1600 number of publications 1400 1200 1000 9000+ publications over 800 the past 12 years 600 400 200 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Year Reliability conference publications in 2011 DATE DAC ICCAD ASPDAC DSN Publications from both academia and industry 8
Academic & industrial Research Examples • Hazucha and Svensson, Impact of CMOS technology scaling on the atmospheric neutron soft error rate, IEEE Trans. Nuclear Science, 2000 ( citations > 330 ) • Srinivasan, The impact of technology scaling on lifetime reliability, DSN’04 ( citations > 350 ) • Intel: Borkar et al ., Parameter variations and impact on circuits and microarchitecture, DAC’03 ( citations > 1000 ) • IBM: Ziegler et al., "IBM experiments in soft fails in computer electronics (1978–1994)," IBM Journal of Research and Development , vol.40, no.1, pp.3,18, Jan. 1996 (citations > 400) • TI : McPherson, Reliability challenges for 45nm and beyond, DAC’06 ( citations > 330 ) 9
Reliability Research Approaches Hardware approach Software approach Compilers • Redundancy Operating System • (DMR, TMR, ECC, (scheduling, mapping) Parity, etc.) Runtime Management • 10
Tried and Tested Method • Triple modular redundancy Module 1 Module 2 Module 3 MUX Voting • High cost rules out this method 11
Low-Cost Hardware Methods: Examples • Selective duplication (timing faults) • only insert RAZOR flip-flops in critical paths • Re-use existing circuitry (logic faults) • scan flip-flops in BISER • idle register files for red undancy RAZOR BISER Register files * Ernst et al, “ Razor: a low-power pipeline based on circuit-level timing speculation ” , 2003. MICRO-36., pp. 7–18. * Mitra et al, “ Robust System Design with Built-In Soft-Error Resilience, ” Computer, vol. 38, no. 2, pp. 43–52, 2005. 12 * Memik et al, “ Increasing Register File Immunity to Transient Errors, ” in DATE05, pp. 586–591.
Low-Cost HW-SW Method: Example Hardware detection • Parity through scan-chains • Software correction • Interrupt service routine as firmware • S. Yang, S. Khursheed, B. M. Al-Hashimi, D. Flynn, and G. V. Merrett, “ Improved State Integrity of Flip-Flops for Voltage 13 Scaled Retention Under PVT Variation, ” IEEE TCAS-I: Regular Papers, vol. 1, pp. 1–9, 2013.
Software Approach Hardware approach emphasizes detection and correction, Software approach emphasizes software failure prevention 14
Unreliable Hardware: Software Approach Compilers Source code ― Improves software program reliability by input quantifying vulnerability of instructions Vulnerable periods of processor register ― Instruction scheduling impacts vulnerable variables analysis periods of instruction ’ s variables ― Reduce critical instructions occupancy in Estimation of program pipeline and their operands’ vulnerable periods reliability ― Schedule instruction with highest vulnerability first Reliability-optimised instruction- scheduling J. Henkel et al, “ RAISE: Reliability-Aware Instruction Scheduling ” output • T. Jones, Energy-aware compilers, Cambridge University, Reliability-aware binary http://www.cl.cam.ac.uk/~tmj32/ Complier flow • S. Garg et al, Cross-layer reliability modelling and optimisation for embedded systems under PV, Tutorial, CODES-ISSS 2013 15
Unreliable Hardware: Software Approach Reliability Requirement Operating Systems Tasks Task reliability Input profile Mapping - Heuristics decide on mapping of application tasks (Duplication) to processors, scheduling and FT policies to meet reliability requirement Scheduling (re-execution) - Many heuristics have been proposed, examples Reliability analysis Fail Reliable? • V. Izosimov, P. Pop, and P. Eles, “ Design Optimization of Time-and Cost-Constrained Fault-Tolerant Distributed Embedded Systems, ” DATE05 , pp. 864–869. Pass • R. Shafik, B.M. Al-Hashimi, K. Chakrabarty, “Soft erroe-aware design optimisation of low power and time-constrained embedded Hardware platform systems”, pp.1462-1467, DAET10 execution 16
Industry Pragmatic Approach to Reliable Processors (every bit matters; users are willing to pay) 17
ARM Cortex-R Series - Dual core lock-step configuration* : Two identical cores running the same set of operations and their outputs are compared. If a difference is detected, the cores are rolled up to the last correct operation - Pipelines, caches and memories are protected with ECC 18 * http://www.arm.com/products/processors/cortex-r/cortex-r4.php
Oracle/Fujitsu: SPARC64 • Error detection in execution units and interconnect using data and address parity* • Recovery via instruction re-execution • ECC in L1D and L2 caches 19 * Ando et al, “ A 1.3-GHz fifth-generation SPARC64 microprocessor ” , JSSC , 38 (11), 1896–1905, 2003,
IBM Power7 Core • — Harden latches — Spare cores — Re-execution, task migration Memory • — Tag un-correctable errors — Dynamic sparing Interconnects • — ECC-protected interconnect between cluster nodes — Redundant paths * Kalla et al. "Power7: IBM's next-generation server processor." Micro, IEEE, 2010 . 20
Where are we heading to? Personal Perspectives (Automation, Cross-layer) 21
Reliability/Safety Standards IEC 60601 IEC 60601 IEC 60601 IEC 60601 (medical (medical (medical (medical equipment) equipment) equipment) equipment) RTCA/DO RTCA/DO RTCA/DO RTCA/DO - - - - 178B 178B 178B 178B DO-178B/DO-254 (aerospace) (aerospace) (aerospace) (aerospace) (aerospace) (aerospace) (aerospace) (aerospace) EN 50128 EN 50128 EN 50128 EN 50128 (railway) (railway) (railway) (railway) IEC 50156 IEC 50156 IEC 50156 IEC 50156 (furnaces) (furnaces) (furnaces) (furnaces) IEC 61508 IEC 61508 IEC 61508 IEC 61508 (meta - (meta - (meta - (meta - standard) standard) standard) standard) IEC 60880 IEC 60880 IEC 60880 IEC 60880 Source: YOGITECH (nuclear power (nuclear power (nuclear power (nuclear power stations) stations) stations) stations) ISO 26262 ISO 26262 ISO 26262 ISO 26262 (automotive) (automotive) (automotive) (automotive) IEC 61511 IEC 61511 IEC 61511 IEC 61511 IEC 62061 IEC 62061 IEC 62061 IEC 62061 (process (process (process (process (machinery) (machinery) (machinery) (machinery) industry) industry) industry) industry) 22
ISO 26262 and RIIF • ISO 26262: automotive safety standard for functional safety of electronic systems in vehicles – Focuses on risks arising from random hardware faults and systematic faults in HW/SW development • Reliability Information Interchange Format (RIIF): IEEE initiative to develop HW reliability modeling language – EDA tools to analyze reliability models to compute failure rates * Standards for specifying and modeling the reliability of complex electronic systems, 1 st RIIF Workshop, DATE2013 * Evans et al, RIIF- Reliability Information interchange format, On-Line Testing Symposium, 2012 23
Low-Power EDA: Example • Tools and standards made low-power design main-stream • UPF (Unified Power Format): IEEE standard for describing power Design ¡(RTL) intent in power optimization in EDA • Example of automatic insertion of power gating in RTL description Power ¡description Synthesis Eg. ¡UPF* 1. ¡Create ¡power ¡switches ¡ ¡ pg_switch Vdd pg_ctrl/ power Sw_Vdd 2. ¡Create ¡state ¡reten3on ¡ Vdd Placement ¡and ¡ sw_Vdd Retention ¡enabled ¡F/F D Route clock Slave ¡Retention ¡ Q Master ¡F/F latch RETAIN Gnd 3. ¡Create ¡output ¡isola3on ¡ iso1 D IN D out pg_ctrl/nclamp 24
Where are we heading to? Reliable Hardware EDA Specification Performance and reliability Failure mechanism (RIIF) (eg. SEU, NBTI, HCI,….) Reliability RTL analysis Reliability map Unified Reliability Synthesis (failure rates..) Format (URF) Fault tolerance policy Razor ECC Hardening Duplication Reliable Hardware 25
Recommend
More recommend