overview
play

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT - PDF document

4/1/2014 Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault coverage COMPUTING Checkpointing and backward error recovery (rollback) Kewal K.Saluja General principles General principles


  1. 4/1/2014 Overview • Introduction and basic concept ECE 753: FAULT-TOLERANT • Fault model and fault coverage COMPUTING • Checkpointing and backward error recovery (rollback) Kewal K.Saluja – General principles General principles Department of Electrical and Computer Engineering D t t f El t i l d C t E i i – Uniprocessor systems • Summary HIGH Level Fault-Tolerance: Checkpointing • Cost, Overhead, Latency issues and recovery • Distributed Systems Introductory material ECE 753 Fault Tolerant Computing 2 Introduction Introduction (contd.) • References • Some what higher level than ECC and – Text Chapter 6 watchdog, uses re-execution as basic – [Prad:96] Chapter 3 – sections on rollback recovery strategy and reconfiguration • It is a hardware assisted software method in practice • Basic concept: save fault-free state of the system and if and when an error is detected, reload the fault-free state and re-execute ECE 753 Fault Tolerant Computing 3 ECE 753 Fault Tolerant Computing 4 Fault model and fault coverage Introduction - Basic Concept (contd.) • Three phases of recovery • Possible scenarios – Error detection – Hardware is faulty, software is fault-free – Damage assessment – Fault detection mechanism exists – in hardware or – Recovery – error elimination and arrival at the in software form point where error was detected – Hardware fault-free, software is faulty • often entails re-starting fresh on a system presumably often entails re starting fresh on a system presumably – Both hardware software faulty fault free • Assumptions for backward error recovery • Backward error recovery – Reliable error detection mechanism exists – Current process is rolled back to some error-free point and re-executes – Error can be removed by re-execution – Trivial solution – start afresh from the beginning of – Process state can be restored to a previous error- the program free state ECE 753 Fault Tolerant Computing 5 ECE 753 Fault Tolerant Computing 6 1

  2. 4/1/2014 Fault model and fault coverage (contd.) Checkpointing and Rollback • Based on the assumptions stated: • General principles – Time redundancy is permissible – The method is normally applicable when: – Transient hardware errors error detection mechanism exists, transient – If software errors (design or otherwise) alternative hardware faults, and no-software faults modules exist or there are timing errors that may • Methods to address other fault scenario M th d t dd th f lt i be solved during re-execution b l d d i ti – Reliable error detection mechanism are – It is feasible to determine checkpoints (system – Re-configuration states that need to be saved) in an application – Method can apply to redundant as well as – Software fault-tolerance: e.g. recovery nonredundant systems block and n-version programming ECE 753 Fault Tolerant Computing 7 ECE 753 Fault Tolerant Computing 8 Checkpointing and Rollback (contd.) Checkpointing and Rollback (contd.) • General issues: checkpointing & rollback • General issues: checkpointing & rollback – Save system state at regular interval • How often to save - checkpoint interval – Rollback recovery • How much to save - can be as little as PC and How much to save can be as little as PC and • Where do we go back to: damage assessment Where do we go back to: damage assessment status flags, just one instruction or as mush as • Rollback: load the state vector (state of the log of all messages, the complete program and processor, the data that may have been altered associated data values at a given time or corrupted) • How long between fault occurrence and its • Restart the computation detection (error latency) is tolerable – often large error latency may make this method less than an ideal method ECE 753 Fault Tolerant Computing 9 ECE 753 Fault Tolerant Computing 10 Checkpointing and Rollback (contd.) Checkpointing and Rollback (contd.) • What do we need • What do we need (contd.) – Error detection mechanism – Events • Various self-checking mechanisms, e.g. error • Messages and transactions that should be detection, timers, watchdog, acceptance tests. logged and replayed logged and replayed – Storage for state/data saving – Procedures to handle errors and restart • Large enough storage – PC, stack, data computation segments (static and dynamic), information – What if errors continue to exist? – about user and system files that may be open mechanism to handle this • Access time – issue during storing and retrieval • Volatility and stability of the storage ECE 753 Fault Tolerant Computing 11 ECE 753 Fault Tolerant Computing 12 2

  3. 4/1/2014 Checkpointing: Uniprocessor Checkpointing: Uniprocessor systems (contd.) systems • Process control systems • Uniprocess and uniprocessor systems – Program that monitors a process behaves in a equivalence predetermined manner – known control flow and typically periodic • Simplest scheme – Define checkpoints statically – Instruction re-execution Instruction re execution • Hardware (parity, self-checking, duplication) reports error • Instruction is re-executed using previous data and state – Issues • Register file update (commit) • Latency, especially in pipeline systems – Key is to determine the state to be saved ECE 753 Fault Tolerant Computing 13 ECE 753 Fault Tolerant Computing 14 Checkpointing: Uniprocessor systems Checkpointing: Uniprocessor systems (contd.) (contd.) • Process control systems (contd.) • General purpose systems – Typical objectives – How much information to save • Recovery possible in a given time • System state consisting of register file, PC, stack, etc. • Minimize the total number of checkpoints • Data? • Methods of this nature studied in 60’s Methods of this nature studied in 60 s – All of it? Can be prohibitive (space and time) p ( p ) – So? – Only that data which is modified after the last checkpoint – How do we do this efficiently? – Caches provide a nice boundary to achieve this ECE 753 Fault Tolerant Computing 15 ECE 753 Fault Tolerant Computing 16 Summary • Discussed checkpointing classical studies ECE 753 Fault Tolerant Computing 17 3

Recommend


More recommend