10 dependable architectures
play

10 Dependable Architectures The material of this course has been - PowerPoint PPT Presentation

EPFL, Spring 2017 10 Dependable Architectures The material of this course has been initially created by Prof. Dr. H. Kirrmann and adapted by Dr. Y-A. Pignolet & J-C. Tournier Fault Error - Failure Fault: Defect in system (bug) Error:


  1. EPFL, Spring 2017 10 Dependable Architectures The material of this course has been initially created by Prof. Dr. H. Kirrmann and adapted by Dr. Y-A. Pignolet & J-C. Tournier

  2. Fault – Error - Failure Fault: Defect in system (bug) Error: Difference between intended and actual behavior Failure: Not satisfying specification Internal External may may failure error fault = system doesn’t perform cause cause required function Fault examples Error examples SW bug Missing values Stuck bit Measured value ≠ real value Loose connector … … Industrial Automation | 2017 2

  3. Fault Tolerance Mechanisms Identify and record the cause(s) of error(s), Error 1 location/type, concurrent or pre-emptive detection Fault isolation Error 2 Reconfiguration (online repair) Passivation Transform from state with errors into state Error 3 Recovery without errors (forward, backward recovery) Fault Masking Error 4 Compensation Error Corrections Deliver the required service in the presence of faults Industrial Automation | 2017 3 Sli de

  4. Main dependable computer architectures input inputs diagnostics D D processor processor D processor on-line workby fail-over logic off-switch inputs outputs output a) Integer b) Persistent " rather nothing than wrong " " rather wrong than nothing " processor processor processor (fail-silent, fail-stop, "fail-safe") "fail-operate “ 1oo1d (1oo2d) 2/3 Exercise: 2/3 voter Compute the reliability and availability of all architectures, without and with outputs repairs. c) Integer & persistent error masking, massive redundancy (2oo3v) Industrial Automation | 2017 4

  5. 10.1 Error Detection and Fail-Silent 10.1 Error detection and fail-silent computers - check redundancy - duplication and comparison 10.2 Fault-Tolerant Structures 10.3 Issues in Workby operation - Input Processing - Synchronization - Output Processing 10.4 Issues in Standby Implementation - Standby Redundancy Structures - Checkpointing - Recovery 10.5 Examples of Dependable Architectures - ABB dual controller - Boeing 777 Primary Flight Control - Space Shuttle PASS Computer Industrial Automation | 2017 5

  6. Error Detection: Classification  Error detection is the base of “ safe ” computing ( “ fail-silent ” ) -> disable outputs if error detected  Error detection is the base of fault-tolerant computing ( “ fail-operate ” ) -> switchover if error detected, passivate faulty unit. Key factors:  “ hamming distance ” : how many simultaneous errors can be detected  coverage ( recouvrement , Deckungsgrad) probability that an error is discovered within useful time (definition of "useful time": before any damages occur, before automatic shutdown,…)  latency ( latence , Latenz) time between occurrence and detection of an error Industrial Automation | 2017 6

  7. Error Detection: Classification Errors can be detected, (in order of increasing latency): – on-line (while the specified function is performed)  by continuous monitoring/supervision – off-line (in a time period when the unit is not used for its specified function)  by periodic testing – during periodic maintenance (when the unit is tested and calibrated)  by thorough testing, uncovering lurking errors Industrial Automation | 2017 7

  8. Error detection The correctness of a result can be checked by: relative tests (comparison tests): by comparing several results of redundant units or computations (not necessary identical) pessimistic, i.e. differences due to (allowed) indeterminism count as errors high coverage, high cost absolute tests (acceptance tests): by checking the result against an a priori consistency condition (plausibility check) optimistic, i.e. even if result is consistent it may not be correct (but can catch some design errors) Industrial Automation | 2017 8

  9. Error Detection: Possibilities relative test absolute test duplication and comparison watchdog (time-out) (either hardware duplication or control flow checking on-line time redundancy) error-detecting code (CRC, etc.) triplication and voting illegal address checking comparison with check of program version precomputed test result (fixed check of watchdog function off-line inputs) check code for program code e.g. memory test Industrial Automation | 2017 9

  10. Detection of Errors Caused by Physical Faults Depends on type of component, its error rate and its complexity. Component Error characteristics Typical error detection medium to high error rate, Data transmission lines parity, memoryless CRC, watchdog Regular memory elements medium error rate, parity, large storage Hamming codes, EDC CRC on disk. Processors and controllers low error rate, duplication and comparison, high complexity coded logic Auxiliary elements high error rate, mechanical integrity, (hard disk, ventilation) high diversity voltage supervision, watchdogs,... Industrial Automation | 2017 10

  11. Watchdog Processor (absolute test) watchdog processor supply application processor voltage time cyclic > k ms application reset (every k ms) trusted switch inhibit The application processor periodically resets the watchdog timer. If it fails to do so, the watchdog processor will shut down and restart the processor. Industrial Automation | 2017 11

  12. Duplication and Comparison (relative test) safe input Advantage: high coverage, short latency spreader Problem non-determinism: digital computers are made of analogue elements clock with variable delays, thresholds, asynchronous worker checker clocks... sync The safety-relevant parts (comparator and  switch) are useless if not regularly checked. comparator switch fail-silent output worker and checker are identical and deterministic. Conditions: inputs are (made) identical and synchronized (interrupts !) output must be synchronized to allow comparison. Variant: the checker only checks the plausibility of the results (requires definition of what is forbidden) Industrial Automation | 2017 12

  13. Error detection method by coding (absolute test) This method is used in network and storage, where error patterns are simple. It consists in adding a code (parity, checksum, cyclic redundancy check,…) to the useful data that guarantees its integrity. r check bits k data bits n-bit code word Coding is more efficient than duplication and comparison. Coding has also been applied to processing elements, but complexity can be large. For each operation, a corresponding operation on the check bits has to be done. A A’ B B’ C C’ value code Industrial Automation | 2017 13

  14. Error detection by predicates (absolute check) Results of computation are checked against predicates that must be fulfilled, e.g. the sum of two positive integers is a positive integer Plausibility checks require knowledge of the specification: • e.g. not all traffic lights may be green at the same time Plausibility may involve different information sources: • e.g. compare wheel speed with GPS speed Danger is - detection of wrong errors (legal situations not foreseen by application, e.g. flight altitude below sea level) - not detecting real errors (the result is wrong, but plausible) Error coverage is not 100% ! Industrial Automation | 2017 14

  15. Integer processors Integer processors are capable of detecting all single errors and switch their outputs to a safe state in case of error (“fail - silent” processors) (often called “fail - safe” processors, but they are only safe when used in plants where a safe state can be reached by passive means). This requires a high coverage, that is usually achieved by duplication and comparison. For operation, both computers must be operational, this is a 2oo2 structure (2 out of 2). Industrial Automation | 2017 15

  16. Integer Computers: Self-Testing System self-testing parallel processors E E E backplane bus P P P (e.g. duplication D D D (self-test by & comparison) parity) Computers include stable storage E E increasingly means to I/O D MEM (with error detection D detect their own errors. and correction) changeover logic serial bus to safe state (CRC) Vs safe value What happens if the safe switch fails ? Industrial Automation | 2017 16

  17. Integer outputs: selection by the plant The dual channel should be extended as far as possible into the plant E worker checker worker checker controller D M act if both agree act if any does act if error detection agrees (workby) (workby) (error detector controls power) Industrial Automation | 2017 17

  18. 10.2 Fault-tolerant structures 10.1 Error detection and fail-silent computers - check redundancy - duplication and comparison 10.2 Fault-Tolerant Structures 10.3 Issues in Workby operation - Input Processing - Synchronization - Output Processing 10.4 Issues in Standby operation - Standby Redundancy Structures - Checkpointing - Recovery 10.5 Examples of Dependable Architectures - ABB dual controller - Boeing 777 Primary Flight Control - Space Shuttle PASS Computer Industrial Automation | 2017 18

Recommend


More recommend