overview of reliability engineering
play

Overview of reliability engineering Eric Marsden - PowerPoint PPT Presentation

Overview of reliability engineering Eric Marsden <eric.marsden@risk-engineering.org> I am purchasing pumps for my refjnery and want to understand the MTBF, lambda etc. provided by the manufacturers I want to compare difgerent system


  1. Overview of reliability engineering Eric Marsden <eric.marsden@risk-engineering.org>

  2. ▷ I am purchasing pumps for my refjnery and want to understand the MTBF, lambda etc. provided by the manufacturers ▷ I want to compare difgerent system designs to determine the impact of architecture on availability 2 / 32 Context ▷ I have a fmeet of airline engines and want to anticipate when they may fail

  3. function as required over a specifjed time period when operated and maintained in a specifjed manner. ▷ Reliability engineers address 3 basic questions: • When does something fail? • Why does it fail? • How can the likelihood of failure be reduced? 3 / 32 Reliability engineering ▷ Reliability engineering is the discipline of ensuring that a system will

  4. Tie termination of the ability of an item to perform a required function. [IEV 191-04-01] Failure ▷ A failure is always related to a required function . Tie function is ofuen specifjed together with a performance requirement (eg. “must handle up to 3 tonnes per minute”, “must respond within 0.1 seconds”). ▷ A failure occurs when the function cannot be performed or has a performance that falls outside the performance requirement. 4 / 32 Failure

  5. Tie state of an item characterized by inability to perform a required function [IEV 191-05-01] Fault ▷ While a failure is an event that occurs at a specifjc point in time, a fault is a state that will last for a shorter or longer period. ▷ When a failure occurs, the item enters the failed state. A failure may occur: • while running • while in standby • due to demand 5 / 32 Fault

  6. Discrepancy between a computed, observed, or measured value or condition and the true, specifjed, or theoretically correct value or condition. [IEC 191-05-24]. Error ▷ An error is present when the performance of a function deviates from the target performance, but still satisfjes the performance requirement ▷ An error will ofuen, but not always, develop into a failure 6 / 32 Error

  7. Tie way a failure is observed on a failed item. [IEC 191-05-22] Failure mode ▷ An item can fail in many difgerent ways: a failure mode is a description of a possible state of the item afuer it has failed 7 / 32 Failure mode

  8. IEC 61508 classifjes failures according to their: ▷ Causes: • random (hardware) faults • systematic faults (including sofuware faults) ▷ Efgects: • safe failures • dangerous failures ▷ Detectability: • detected: revealed by online diagnostics • undetected: revealed by functional tests or upon a real demand for activation IEC 61508: Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems 8 / 32 Failure classifjcation

  9. 9 / 32 Models the transitions between correct state correct state. Availability = proportion of time spent in Failure and repair are stochastic processes. future events except for current state. Assumption : nothing in the past determines and failed state. repair rate μ λ failure rate λ state failed state correct outputs inputs Markovian models

  10. 10 / 32 rate of safe or dangerous dangerous-but-detected. conditional probability that a failure will be safe, or mechanisms, measured by the “safe failure fraction”: Importance of the coverage of the error detection been designed to tolerate them. Not all failures are dangerous: the system may have repair rate μ dangerous failures 𝜇 𝐸 rate of non-detected and and detected failure 𝜇 𝑇 state correct dangerous service but safe degraded OK service repair rate μ failure rate λ state failed state The “safe failure fraction”

  11. 10 / 32 rate of safe or dangerous dangerous-but-detected. conditional probability that a failure will be safe, or mechanisms, measured by the “safe failure fraction”: Importance of the coverage of the error detection been designed to tolerate them. Not all failures are dangerous: the system may have repair rate μ dangerous failures 𝜇 𝐸 rate of non-detected and and detected failure 𝜇 𝑇 state correct dangerous service but safe degraded OK service repair rate μ failure rate λ state failed state The “safe failure fraction”

  12. 10 / 32 rate of safe or dangerous dangerous-but-detected. conditional probability that a failure will be safe, or mechanisms, measured by the “safe failure fraction”: Importance of the coverage of the error detection been designed to tolerate them. Not all failures are dangerous: the system may have repair rate μ dangerous failures 𝜇 𝐸 rate of non-detected and and detected failure 𝜇 𝑇 state correct dangerous service but safe degraded OK service repair rate μ failure rate λ state failed state The “safe failure fraction”

  13. when not demanded ▷ Safe detected (SD): A non-critical alarm raised by the component ▷ Dangerous detected (DD): A critical diagnostic alarm reported by the component, which will, as long as it is not corrected prevent the safety function from being executed ▷ Dangerous undetected (DU): A critical dangerous failure which is not reported and remains hidden until the next test or demanded activation of the safety function 11 / 32 Failure classifjcation ▷ Safe undetected (SU): A spurious (untimely) activation of a component

  14. A failure that is the result of one or more events, causing concurrent failures of two or more separate channels in a multiple channel system, leading to system failure [IEC 61508] Common cause failure ▷ Typical examples: loss of electricity supply, massive physical destruction ▷ More subtle example: loss of clock function (electronics), common maintenance procedure 12 / 32 Common cause failures

  15. Tie ability of an item to perform a required function, under given environmental and operational conditions for a stated period of time. Reliability [ISO 8402] ▷ Tie reliability 𝑆(𝑢) of an item at time 𝑢 is the probability that the item performs the required function in the interval [0–𝑢] given the stress and environmental conditions in which it operates 13 / 32 Reliability: defjnitions

  16. survival function (or reliability function ) 𝑆(𝑢) is 𝑆(𝑢) = Pr (𝑌 > 𝑢) ▷ 𝑆(𝑢) represents the probability that the item is working correctly at time 𝑢 ▷ Properties: • 𝑆(𝑢) is non-increasing (no rising from the dead) • 𝑆(0) = 1 (no immediate death/failure) • lim 14 / 32 Reliability: defjnitions ▷ If 𝑌 is a random variable representing time to failure of an item, the 𝑢→∞ 𝑆(𝑢) = 0 (no eternal life)

  17. 15 / 32 Cumulative distribution function 𝑆(𝑢) = 𝑄(𝑈 > 𝑢) = 1 − 𝐺(𝑢) Tells you the probability that lifetime is > 𝑢 Reliability function 𝐺(𝑢) = 𝑄(𝑈 ≤ 𝑢) Tells you the probability that lifetime is ≤ 𝑢 Interpreting the reliability function 1 1 P(T ≤ t) Survival function R(t) Probability F(t) P(T > t) t t 0 0 Time to failure (T) Time to failure (T)

  18. Problem Tie lifetime of a modern low-wattage electronic light bulb is known to be exponentially distributed with a mean of 8000 hours. Q1 Find the proportion of bulbs that may be expected to fail before 7000 hours use. Q2 What is the lifetime that we have 95% confjdence will be exceeded? For more on the reliability of solid-state lamps, see energy.gov 16 / 32 Exercise

  19. Solution Tie time to failure of our light bulbs can be modelled by the distribution dist = scipy.stats.expon(scale=8000) Q1 : Tie CDF gives us the probability that the lifetime is ≤ 𝑢 . We want dist.cdf(7000) which is 0.583137. So about 58% of light bulbs will fail before they reach 7000 hours of operation. Q2 : We need the 0.05 quantile of the lifetime distribution, dist.ppf(0.05) which is around 410 hours. 17 / 32 Exercise

  20. Problem A particular electronic device will only function correctly if two essential components both function correctly. Tie lifetime of the fjrst component is known to be exponentially distributed with a mean of 5000 hours and the lifetime of the second component (whose failures can be assumed to be independent of those of the fjrst component) is known to be exponentially distributed with a mean of 7000 hours. Find the proportion of devices that may be expected to fail before 6000 hours use. 18 / 32 Exercise

  21. Solution Tie device will only be working afuer 6000 hours if both components are operating. Tie probability of the fjrst component still working is > pa = 1 - scipy.stats.expon(scale=5000).cdf(6000) > pa 0.3011942119122022 and likewise for the second component > pb = 1 - scipy.stats.expon(scale=7000).cdf(6000) > pb 0.42437284567695 Tie probability of both working is pa × pb = 0.127818, so the proportion of devices that can be expected to fail before 6000 hours use is around 87%. 19 / 32 Exercise

  22. Tie hazard function or failure rate function ℎ(𝑢) gives the conditional probability of failure in the interval 𝑢 to 𝑢 + 𝑒𝑢 , given that no failure has occurred by 𝑢 . 𝑆(𝑢) Hazard function where 𝑔 (𝑢) is the probability density function (failure density function) and 𝑆(𝑢) is the reliability function. It’s the probability of quitting a given state afuer having spent a given time in that state. 20 / 32 Hazard function ℎ(𝑢) = 𝑔 (𝑢)

Recommend


More recommend