overview
play

Overview Introduction ECE 753: FAULT-TOLERANT Watchdog - PDF document

3/25/2014 Overview Introduction ECE 753: FAULT-TOLERANT Watchdog techniques COMPUTING Timers, watchdog processors, error model, control flow checking, memory access and assertion Kewal K.Saluja checking Re-execution for


  1. 3/25/2014 Overview • Introduction ECE 753: FAULT-TOLERANT • Watchdog techniques COMPUTING – Timers, watchdog processors, error model, control flow checking, memory access and assertion Kewal K.Saluja checking • Re-execution for fault-tolerance R ti f f lt t l Department of Electrical and Computer Engineering – Basic techniques: RESO concept, program re- execution, instruction re-execution – Case studies: Fine grain parallel architecture Low Level Fault-Tolerance: Watchdog and (CRAY), SMT architecture, multiscalar architecture. Chip Multiprocessor Re-execution • Summary ECE 753 Fault Tolerant Computing 2 Introduction Introduction (contd.) • Somewhat higher level than ECC and • References masking at circuit level • Watchdog - [mahm:88] • Bordering between hardware and • Re-execution - [rotenberg:99], [rashid:00] software (hardware often assisted by [subra:10] [kala:13] [subra:10], [kala:13] software) • Sohi, Franklin, and Saluja, “A study of time- • These are some of the very first fault- redundant fault-tolerant techniques for high- performance pipelined computers,” tolerance methods Proceedings FTCS-19, June 1989, pp. 436- 443. ECE 753 Fault Tolerant Computing ECE 753 Fault Tolerant Computing 3 4 Watchdog techniques Watchdog: Timers • Key concept • Check for aliveness – A process or processor is checked by another hardware (normally) unit of its – Processor resets the timer at certain actions. Actions checked include if the intervals or on certain conditions process is still active, alive, not executing process is still active alive not executing – Timer raises error flag if not reset before it Timer raises error flag if not reset before it incorrect paths during execution, etc. overruns watchdog timer Processor Processor Error ECE 753 Fault Tolerant Computing ECE 753 Fault Tolerant Computing 5 6 1

  2. 3/25/2014 Watchdog: Timers (contd.) Watchdog: Timers (contd.) • Applications • Check for timeout – Processor control systems (chemical, – Processor sends a message and starts a mechanical and other control systems) timer, the second processor must reply – Switching systems – messages sent or within this time (hardware/software within this time (hardware/software received often await certain length of time implementation) before they are repeated – Networks – email messages often have Processor A Processor B timeouts associated with them Timer ECE 753 Fault Tolerant Computing ECE 753 Fault Tolerant Computing 7 8 Watchdog: Processors Watchdog: Processors (contd.) • Architecture – can be complex but let us • What can it achieve? consider the following simple – Observe the address bus architecture • Can observe the data • Can observe instructions • Can observe instructions Memory • Can check the flow of program control data – Need to know what kind of errors can address BUS control occur to determine the capability of this method Watchdog Processor (observer) ECE 753 Fault Tolerant Computing ECE 753 Fault Tolerant Computing 9 10 Watchdog: Error models Watchdog: Error models (contd.) • Conclusions of the studies • Experimental setup to develop error – Program flow could change (branch to no branch, models applicable at this level or vise a versa) – Processor-memory architecture – Instruction fetched from data space – Access to non existence memory space – Inject faults (random errors) - in I/O – Data fetched from instruction space D t f t h d f i t ti processor, within processor (register file, – Illegal instruction states), within memory – Writing in protected area (ROM) – Simulate • 60% of all faults could be detected by – Also hardware was designed to inject such monitoring control flow – Thus we need to faults and study the impact/behavior develop methods that are good in monitoring control flow ECE 753 Fault Tolerant Computing ECE 753 Fault Tolerant Computing 11 12 2

  3. 3/25/2014 Watchdog: Control flow checking Watchdog: Control flow checking (contd.) • A simple example • Basic principle Program watchdog – Analyze the program and extract control start ------------  receive start information • Branch free intervals • Branch free intervals branch observe bus b h b b • Subroutine calls free cont. to form – Assign signatures to branch free intervals code signature and provide these signatures to the watchdog processor to check these values check sig X ---  Check X against collected sig ECE 753 Fault Tolerant Computing ECE 753 Fault Tolerant Computing 13 14 Watchdog: Control flow checking (contd.) Watchdog: Control flow checking (contd.) • Details and variations (contd.) • Details and variations – Derived signature checking – Structural integrity checking • Compiler identifies branch free intervals and generates • Analyze the program control flow – create a program signatures (such as check sum) for these intervals control flow graph • At run time these signatures are provided to the watchdog using tag bits to differentiate between regular • Assign unique identifier to the nodes of the graph instructions and watchdog messages instructions and watchdog messages • Provide control flow graph to the watchdog along with the • Watchdog monitors the bus and generates the signatures identifiers and compare these signatures with the signatures • In case of branches, watchdog expects one of the many captured from the bus (compiled signature) possible identifiers • Example: associate two tag bits with every memory word • Limitations to differentiate between instructions and compiled signatures – when a tag for signature appears on the bus – Performance impact – insertion of special instructions watchdog captures the tag and forces a NOP on the bus – Inability to detect data processing variations – add to sub for the regular processor ECE 753 Fault Tolerant Computing ECE 753 Fault Tolerant Computing 15 16 Watchdog: Mem access and assertion Watchdog: Control flow checking (contd.) checks • Details and variations (contd.) • What to do about memory/data errors – Derived signature checking (contd.) – Use ECC • Coverage – Can detect random errors in instructions in branch free – Few other methods using watchdog intervals (but aliasing can occur) • Check for non existent memory addresses • Overheads – Memory width increase due to tag bits • Check for out of range addresses – Memory increase due to signatures insertions • Capability based checking for objects is also – Performance impact due to NOPs possible • Solutions • Assertion based checking and sanity checks – Using path signature method – reduces the number of signatures needed using watchdog (independent hardware) is also – Branch address hashing – merge signature and branch possible address ECE 753 Fault Tolerant Computing ECE 753 Fault Tolerant Computing 17 18 3

Recommend


More recommend