Hardware Reliability of Embedded Systems: Are We There Yet? Bashir - PowerPoint PPT Presentation

Hardware Reliability of Embedded Systems: Are We There Yet? Bashir M. Al-Hashimi, FREng, FIEEE March 19 th 2014 PAnDA - Programmable Digital and Analogue Array York, 18-19 March 2014

Overview • Where we are? – academic and industrial research highlights • Where we are heading to? – personal perspectives 2

Hardware Reliability • Reliability* as described by IBM – Computers designed with reliability to protect data integrity and stay available for long periods of time without failure • Unreliability sources – Logic faults Low power design Process variation • Radiation Exacerbated by Technology scaling – Timing faults • Transistor wear-out 3 * Wikipedia

Hardware Reliability Trends Voltage scaling and process variation degrades reliability Critical charge of flip-flops for 45nm node* S. Yang, S. Khursheed, B. M. Al-Hashimi, D. Flynn, and S. Idgunji, “ Reliable State Retention-Based Embedded Processors Through Monitoring and Recovery, ” IEEE TCAD , vol. 30, no. 12, pp. 1773–1785, Dec. 2011. 4

Where Does Reliability Matter? 5 Source: ARM

Embedded Systems Reliability Processor #1 Processor #n Data Data path path …… …… Control Control Cache Cache logic logic Register Register files files Interconnect Peripherals Memory #1 …… Memory #n 6

Where are we in dealing with hardware reliability? 7

Reliability Publications 1600 number of publications 1400 1200 1000 9000+ publications over 800 the past 12 years 600 400 200 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Year Reliability conference publications in 2011 DATE DAC ICCAD ASPDAC DSN Publications from both academia and industry 8

Academic & industrial Research Examples • Hazucha and Svensson, Impact of CMOS technology scaling on the atmospheric neutron soft error rate, IEEE Trans. Nuclear Science, 2000 ( citations > 330 ) • Srinivasan, The impact of technology scaling on lifetime reliability, DSN’04 ( citations > 350 ) • Intel: Borkar et al ., Parameter variations and impact on circuits and microarchitecture, DAC’03 ( citations > 1000 ) • IBM: Ziegler et al., "IBM experiments in soft fails in computer electronics (1978–1994)," IBM Journal of Research and Development , vol.40, no.1, pp.3,18, Jan. 1996 (citations > 400) • TI : McPherson, Reliability challenges for 45nm and beyond, DAC’06 ( citations > 330 ) 9

Reliability Research Approaches Hardware approach Software approach Compilers • Redundancy Operating System • (DMR, TMR, ECC, (scheduling, mapping) Parity, etc.) Runtime Management • 10

Tried and Tested Method • Triple modular redundancy Module 1 Module 2 Module 3 MUX Voting • High cost rules out this method 11

Low-Cost Hardware Methods: Examples • Selective duplication (timing faults) • only insert RAZOR flip-flops in critical paths • Re-use existing circuitry (logic faults) • scan flip-flops in BISER • idle register files for red undancy RAZOR BISER Register files * Ernst et al, “ Razor: a low-power pipeline based on circuit-level timing speculation ” , 2003. MICRO-36., pp. 7–18. * Mitra et al, “ Robust System Design with Built-In Soft-Error Resilience, ” Computer, vol. 38, no. 2, pp. 43–52, 2005. 12 * Memik et al, “ Increasing Register File Immunity to Transient Errors, ” in DATE05, pp. 586–591.

Low-Cost HW-SW Method: Example Hardware detection • Parity through scan-chains • Software correction • Interrupt service routine as firmware • S. Yang, S. Khursheed, B. M. Al-Hashimi, D. Flynn, and G. V. Merrett, “ Improved State Integrity of Flip-Flops for Voltage 13 Scaled Retention Under PVT Variation, ” IEEE TCAS-I: Regular Papers, vol. 1, pp. 1–9, 2013.

Software Approach Hardware approach emphasizes detection and correction, Software approach emphasizes software failure prevention 14

Unreliable Hardware: Software Approach Compilers Source code ― Improves software program reliability by input quantifying vulnerability of instructions Vulnerable periods of processor register ― Instruction scheduling impacts vulnerable variables analysis periods of instruction ’ s variables ― Reduce critical instructions occupancy in Estimation of program pipeline and their operands’ vulnerable periods reliability ― Schedule instruction with highest vulnerability first Reliability-optimised instruction- scheduling J. Henkel et al, “ RAISE: Reliability-Aware Instruction Scheduling ” output • T. Jones, Energy-aware compilers, Cambridge University, Reliability-aware binary http://www.cl.cam.ac.uk/~tmj32/ Complier flow • S. Garg et al, Cross-layer reliability modelling and optimisation for embedded systems under PV, Tutorial, CODES-ISSS 2013 15

Unreliable Hardware: Software Approach Reliability Requirement Operating Systems Tasks Task reliability Input profile Mapping - Heuristics decide on mapping of application tasks (Duplication) to processors, scheduling and FT policies to meet reliability requirement Scheduling (re-execution) - Many heuristics have been proposed, examples Reliability analysis Fail Reliable? • V. Izosimov, P. Pop, and P. Eles, “ Design Optimization of Time-and Cost-Constrained Fault-Tolerant Distributed Embedded Systems, ” DATE05 , pp. 864–869. Pass • R. Shafik, B.M. Al-Hashimi, K. Chakrabarty, “Soft erroe-aware design optimisation of low power and time-constrained embedded Hardware platform systems”, pp.1462-1467, DAET10 execution 16

Industry Pragmatic Approach to Reliable Processors (every bit matters; users are willing to pay) 17

ARM Cortex-R Series - Dual core lock-step configuration* : Two identical cores running the same set of operations and their outputs are compared. If a difference is detected, the cores are rolled up to the last correct operation - Pipelines, caches and memories are protected with ECC 18 * http://www.arm.com/products/processors/cortex-r/cortex-r4.php

Oracle/Fujitsu: SPARC64 • Error detection in execution units and interconnect using data and address parity* • Recovery via instruction re-execution • ECC in L1D and L2 caches 19 * Ando et al, “ A 1.3-GHz fifth-generation SPARC64 microprocessor ” , JSSC , 38 (11), 1896–1905, 2003,

IBM Power7 Core • — Harden latches — Spare cores — Re-execution, task migration Memory • — Tag un-correctable errors — Dynamic sparing Interconnects • — ECC-protected interconnect between cluster nodes — Redundant paths * Kalla et al. "Power7: IBM's next-generation server processor." Micro, IEEE, 2010 . 20

Where are we heading to? Personal Perspectives (Automation, Cross-layer) 21

Reliability/Safety Standards IEC 60601 IEC 60601 IEC 60601 IEC 60601 (medical (medical (medical (medical equipment) equipment) equipment) equipment) RTCA/DO RTCA/DO RTCA/DO RTCA/DO - - - - 178B 178B 178B 178B DO-178B/DO-254 (aerospace) (aerospace) (aerospace) (aerospace) (aerospace) (aerospace) (aerospace) (aerospace) EN 50128 EN 50128 EN 50128 EN 50128 (railway) (railway) (railway) (railway) IEC 50156 IEC 50156 IEC 50156 IEC 50156 (furnaces) (furnaces) (furnaces) (furnaces) IEC 61508 IEC 61508 IEC 61508 IEC 61508 (meta - (meta - (meta - (meta - standard) standard) standard) standard) IEC 60880 IEC 60880 IEC 60880 IEC 60880 Source: YOGITECH (nuclear power (nuclear power (nuclear power (nuclear power stations) stations) stations) stations) ISO 26262 ISO 26262 ISO 26262 ISO 26262 (automotive) (automotive) (automotive) (automotive) IEC 61511 IEC 61511 IEC 61511 IEC 61511 IEC 62061 IEC 62061 IEC 62061 IEC 62061 (process (process (process (process (machinery) (machinery) (machinery) (machinery) industry) industry) industry) industry) 22

ISO 26262 and RIIF • ISO 26262: automotive safety standard for functional safety of electronic systems in vehicles – Focuses on risks arising from random hardware faults and systematic faults in HW/SW development • Reliability Information Interchange Format (RIIF): IEEE initiative to develop HW reliability modeling language – EDA tools to analyze reliability models to compute failure rates * Standards for specifying and modeling the reliability of complex electronic systems, 1 st RIIF Workshop, DATE2013 * Evans et al, RIIF- Reliability Information interchange format, On-Line Testing Symposium, 2012 23

Low-Power EDA: Example • Tools and standards made low-power design main-stream • UPF (Unified Power Format): IEEE standard for describing power Design ¡(RTL) intent in power optimization in EDA • Example of automatic insertion of power gating in RTL description Power ¡description Synthesis Eg. ¡UPF* 1. ¡Create ¡power ¡switches ¡ ¡ pg_switch Vdd pg_ctrl/ power Sw_Vdd 2. ¡Create ¡state ¡reten3on ¡ Vdd Placement ¡and ¡ sw_Vdd Retention ¡enabled ¡F/F D Route clock Slave ¡Retention ¡ Q Master ¡F/F latch RETAIN Gnd 3. ¡Create ¡output ¡isola3on ¡ iso1 D IN D out pg_ctrl/nclamp 24

Where are we heading to? Reliable Hardware EDA Specification Performance and reliability Failure mechanism (RIIF) (eg. SEU, NBTI, HCI,….) Reliability RTL analysis Reliability map Unified Reliability Synthesis (failure rates..) Format (URF) Fault tolerance policy Razor ECC Hardening Duplication Reliable Hardware 25

Hardware Reliability of Embedded Systems: Are We There Yet? Bashir - PowerPoint PPT Presentation

Hardware Reliability of Embedded Systems: Are We There Yet? Bashir M. Al-Hashimi, FREng, FIEEE March 19 th 2014 PAnDA - Programmable Digital and Analogue Array York, 18-19 March 2014 Overview Where we are? academic and industrial

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

Embedded systems and the role of programmable logic devices in embedded systems Embedded system :

Hardware Pool Embedded Operating Systems Operating Systems & Middleware Group Available

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

HW/SW Codesign w/ FPGAs Embedded Systems ECE 495/595 Overview (Slides from Embedded Systems

Embedded Embedded Architecture Architecture Systems Systems Jakob Engblom, PhD Jakob

Embedded PC The modular Industrial PC for mid-range control Stefan Hoppe 14.09.2007 1 Embedded

4TU MASTER EMBEDDED SYSTEMS Bert Molenkamp 19/03/2020 Master Embedded Systems 1 Table of

Software Reliability 18-849b Dependable Embedded Systems Jiantao Pan Feb 2, 1999 Handbook of

There s no s no there there there! there! There W. Hyattsville Station

software and hardware for the Internet of Things. Choose hardware Design hardware Design

CS 5150 So(ware Engineering Reliability William Y. Arms

Large Scale Data Engineering Cloud Computing event.cwi.nl/lsde Cloud computing What?

Safety to the Weak! Security Through Feebleness: An Unorthodox Manifesto Rick McGeer, US Ignite

EHDL Easy Hardware Description Language COMS 4115: Programming Languages and Translators, Fall

CPSC 875 CPSC 875 John D McGregor John D. McGregor Class 6 Design Concept Unix Unix Linux

CMPS 223 OVERVIEW WHO ARE YOU? Owen Arden (hi!) Email: owen@soe.ucsc.edu Office:

Lecture 24: Cache, Memory, Security Todays topics: Caching policies Main memory

Post Silicon Patchable Hardware Post-Silicon Patchable Hardware Masahiro F jita Masahiro Fujita

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Hardware Reliability of Embedded Systems: Are We There Yet? Bashir - PowerPoint PPT Presentation

Hardware Reliability of Embedded Systems: Are We There Yet? Bashir M. Al-Hashimi, FREng, FIEEE March 19 th 2014 PAnDA - Programmable Digital and Analogue Array York, 18-19 March 2014 Overview Where we are? academic and industrial

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

Embedded systems and the role of programmable logic devices in embedded systems Embedded system :

Hardware Pool Embedded Operating Systems Operating Systems &amp; Middleware Group Available

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

HW/SW Codesign w/ FPGAs Embedded Systems ECE 495/595 Overview (Slides from Embedded Systems

Embedded Embedded Architecture Architecture Systems Systems Jakob Engblom, PhD Jakob

Embedded PC The modular Industrial PC for mid-range control Stefan Hoppe 14.09.2007 1 Embedded

4TU MASTER EMBEDDED SYSTEMS Bert Molenkamp 19/03/2020 Master Embedded Systems 1 Table of

Software Reliability 18-849b Dependable Embedded Systems Jiantao Pan Feb 2, 1999 Handbook of

There s no s no there there there! there! There W. Hyattsville Station

software and hardware for the Internet of Things. Choose hardware Design hardware Design

CS 5150 So(ware Engineering Reliability William Y. Arms

Large Scale Data Engineering Cloud Computing event.cwi.nl/lsde Cloud computing What?

Safety to the Weak! Security Through Feebleness: An Unorthodox Manifesto Rick McGeer, US Ignite

EHDL Easy Hardware Description Language COMS 4115: Programming Languages and Translators, Fall

CPSC 875 CPSC 875 John D McGregor John D. McGregor Class 6 Design Concept Unix Unix Linux

CMPS 223 OVERVIEW WHO ARE YOU? Owen Arden (hi!) Email: owen@soe.ucsc.edu Office:

Lecture 24: Cache, Memory, Security Todays topics: Caching policies Main memory

Post Silicon Patchable Hardware Post-Silicon Patchable Hardware Masahiro F jita Masahiro Fujita

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Hardware Pool Embedded Operating Systems Operating Systems & Middleware Group Available