System Dependability Robert Wierschke Seminar “Prozesssteuerung und Robotik” 14. Januar 2009
Outline 2 ■ Motivation ■ Dependability □ Definition □ Dependability attributes □ Attribute relevance ■ Threads □ Fault model □ Fault-error-failure ■ Attaining dependability □ Fault tolerance □ Redundancy ■ Software Dependability ■ Summary
Motivation 3 ■ Deliver correct communication and computation services ■ First generation computer used unreliable components □ Hardware concept ■ Consequences of system failure □ Economically ◊ Credit card authorization $2.6 million / hour of downtime ◊ Airline reservation $89.500 / hour of downtime □ Human life ◊ What happens if the board computer of an air plane crashes?
Definition: Dependability 1|2 4 [Merriam-Webster Online] dependability: capable of being depended on : reliable reliable: suitable or fit to be relied on : dependable “the collective term used to describe the availability performance and its influencing factors : reliability performance, maintainability performance and maintenance support performance” [7] □ (Zuverlässigkeit) □ Focus: availability ◊ Strongly influenced by telecommunication industry ■ Evolves over time ■ Depends on problem domain
Definition: Dependability 2|2 5 “the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers” [1] □ Behaves as specified □ Avoids hazards ■ Dependability Attributes □ Reliability (Funktionsfähigkeit) □ Availability (Verfügbarkeit) □ Safety (Sicherheit) □ Confidentiality (Vertraulichkeit) □ Integrity (Integrität) □ Maintainability (Wartbarkeit)
Reliability 1|2 6 ■ Continuity of service ■ Probability R(t) of a system/component to operate correctly during a time period t ■ Example: □ Space probe needs to operate correctly during the mission time. ■ Calculating reliability □ R(0) = 1 , R(∞) = 0 □ Failure probability Q(t) = 1 – R(t) □ Failure rate λ(t) number of failures during Δt □ For constant failure rate R(t) = e -λt [3]
Reliability 2|2 7 ■ Empirical values ■ Assuming independent components □ Series □ Parallel [3]
Availability 1|2 8 ■ Readiness for usage ■ Probability A of a system/component to operate correctly at any point in time ■ Example: □ Telecommunication ■ Availability vs. reliability □ Service that crashes often but restarts instantly has high A but low R(t) . 1 90,0 % 36,5 d ■ Number of 9s 2 99,0 % 3,65 d □ Availability per year 3 99,9 % 8,76 h 4 99,99 % 52,6 min 5 99,999 % 5,26 min 6 99,9999 % 31,5 s 7 99,99999 % 0,3 s
Availability 2|2 9 ■ A = MTTF / (MTTF + MTTR) □ MTTF mean time to failure ◊ MTTF = 1/λ □ MTTR mean time to repair ◊ Shorter repair time leads to higher availability □ MTTB mean time between failures [5]
Safety 10 ■ Avoidance of catastrophic consequences ■ Property of a system/component that it will not imperil equipment or human life. ■ Example: nuclear power plant □ If the reactor reaches temperature X, it must shut down within time Y. ■ Conflicting with reliability: a non working system is often save □ Fail-safe: system reaches a safe state □ Fail-operational: system provides a degraded service mode ◊ Example: spare tire
Security 11 ■ Combines confidentiality and integrity ■ Property of a system/component that it will prevent unauthorized access or alteration of data ■ example: control board in train □ Displays and switches are behind a glass door, thus values can be read be everyone but not modified.
Maintainability 12 [2] ■ System can be repaired and modified ■ Repair rate: μ = 1 / MTTR ■ Hard to specify and measure □ Low maintainability ◊ e.g. Satellites ◊ Requires high reliability
Attribute relevance 13 ■ Depends on problem domain □ Economical: Are financial consequence acceptable? □ Ethical: Are risks for life or equipment acceptable? ■ Attributes might be conflicting □ Fail-safe state (reliability vs. safety) ■ Classes □ Uncritical Embedded Systems (e.g. mobile phone, Lego NXT) □ High-Integrity Embedded Systems (e.g. Satellites ) □ Safety-Critical Systems (e.g. aircraft)
Threads 14 ■ Anything that is capable of decreasing the system dependability ■ A meaningful specification must state threads to relevant dependability attributes ■ Fault model [5]
Fault-Error-Failure 1|3 15 ■ Fault □ A defect within the system, that eventually leads to an error. ◊ Active f. if it causes an error, otherwise [2] ◊ Dormant f. □ Fault classes ■ Error □ Part of system state that may lead to failure.
Fault-Error-Failure 2|3 16 ■ Failure □ Event that occurs when the delivered service deviates from correct service □ Service restoration: transition from incorrect to correct service □ Partial failure: a failure of a service may leave the system in degraded mode (e.g. Emergency service) ■ Fault/failure chain □ Failures are recognized at component boundaries, thus a failure can be considered a fault in a depending component [2]
Fault-Error-Failure 3|3 17 ■ Failure □ Typical failure rate ◊ Hardware: bath tube curve [5] ◊ Software [5]
Attain dependability 18 ■ Fault prevention □ Avoid fault to be introduced into the system □ Development techniques ■ Fault tolerance □ Mechanisms that allow the system to operate correctly in case of certain failures □ Possibly degraded service mode ■ Fault removal □ Development or usage phase □ Maintenance ■ Fault forecasting □ Predicting likely faults
Fault tolerance 1|4 19 ■ Definition: Means to maintain service while faults are present. ■ Improves reliability and availability ■ Fault can only be tolerated if it was expected ■ Phases □ Error detection □ Damage assessment □ State restoration □ Continue service ◊ Degraded mode
Fault tolerance 2|4 20 ■ Error detection techniques □ Result comparison ◊ Compare results of redundant components □ Watchdog timers ◊ Assume failure if result is late □ Reasonableness ◊ Range checks ◊ Constraints (e.g. negative value) □ Information redundancy ◊ Checksums □ Functionality test ◊ Memory checks
Fault tolerance 3|4 21 ■ Error recovery □ Forward ◊ Discard computation ◊ Resume service from a error-free system state. ◊ Typical used for periodic tasks □ Backward ◊ Roll-back to know-good sate (checkpoints) ◊ Re-execution of failed task possible
Fault tolerance 4|4 22 ■ Replication □ Using multiple identical instances of a component □ Parallel task processing □ Voting/quorum ■ Diversity □ Tolerate systematic failures □ Using multiple different implementations of a component □ Otherwise use as replica “The most certain and effectual check upon errors which arise in the process of computation, is to cause the same computations to be made by separate and independent computers; and this check is rendered still more decisive if they make their computations by different methods” [1834; Lardner; 2] ■ Redundancy
Redundancy 1|2 23 ■ Using multiple identical instances of a component if component fails, switch to another (fail-over) ■ Types □ Space ◊ Use multiple components of the same type □ Time ◊ Send messages multiple times ◊ Execute computation multiple times □ Information ◊ Checksums, error correcting codes
Redundancy 2|2 24 ■ Active redundancy □ Parallel components, voting □ Voter is single point of failure □ Fail-silent □ N-modular-redundancy ◊ N >= 3 (TMR) [3] ◊ Tolerates (N-1)/2 component failures ■ Passive redundancy □ Hot standby ◊ Operating in background [3] to keep synchronized □ Cold standby
Software Dependability 1|2 25 ■ Fault model for software components □ Bohrbugs: easily reproducible □ Heisenbugs: complex event combination; transient faults; hard to reproduce □ “Aging”: fault accumulation (e.g. memory leaks) [8]
Software Dependability 2|2 26 ■ Rejuvenation “proactive fault management technique aimed at cleaning up the system internal state to prevent the occurrence of more severe crash failures in the future” [8] □ Heisenbugs, Aging ■ N-version Programming □ Different implementation of a software component (diversity) □ Bohrbugs (operational phase) ■ Retrying operations □ Heisenbugs, Aging ■ Fault masking/reconfiguring/fail over □ e.g. FT CORBA □ NMR □ Heisenbugs, Aging
Summary 1|2 27 ■ System Dependability is the ability of a system/component to operate as specified. Depending on the problem domain it subsumes a set of dependability attributes (reliability, availability, safety, security, maintainability, ...) with varying importance. ■ Threads □ Define possible degradation of system dependability □ Need to be specified □ Fault -> Error -> Failure --> Fault
Summary 2|2 28 ■ Fault tolerance □ Operate regardless of certain faults □ Tolerates only expected faults □ Strategies ◊ Replication ◊ Diversity ◊ Redundancy ● Space, time, information ● Active or passive
Recommend
More recommend