Avoiding Accidents - A Misson Impossible? Michael Dorner Chair for Network Architectures and Services Department for Computer Science Technical University of Munich 11.10.2013 Michael Dorner: Avoiding Accidents - A Misson Impossible? 1
Outline Motivation for a science based on accidents 1 Normal Accident Theory and High Reliability Theory 2 Normal Accident Theory High Reliability Organization NAT and/vs. HRO? Accidents and Computer Systems 3 Conclusion 4 Michael Dorner: Avoiding Accidents - A Misson Impossible? 2
Motivation Accidents happen to all of us The reasons for accidents are not as simple as we sometimes think Most accident investigations stop after assigning blame - most often to the operator What should we do if accidents must not happen, because the risk is e.g. a nuclear catastrophe? Michael Dorner: Avoiding Accidents - A Misson Impossible? 3
Accident at Three Mile Island TMI was and still is a nuclear power plant In 1979 there was a partial meltdown in reactor 2 It was the most serious accident in nuclear energy inside the US Michael Dorner: Avoiding Accidents - A Misson Impossible? 4
Basic Layout of a NPP Michael Dorner: Avoiding Accidents - A Misson Impossible? 5
What happened? The plants feedwater pumps failed, so the reactor was not cooled properly Increased pressure forced an emergency relief The relief valve got stuck open, but its indicator showed it was closed The operators failed to realize what was happening More coolant escaped and a part of the fuel rods was no longer covered The nuclear fuel rods began to melt down Michael Dorner: Avoiding Accidents - A Misson Impossible? 6
What’s next Motivation for a science based on accidents 1 Normal Accident Theory and High Reliability Theory 2 Normal Accident Theory High Reliability Organization NAT and/vs. HRO? Accidents and Computer Systems 3 Conclusion 4 Michael Dorner: Avoiding Accidents - A Misson Impossible? 7
Normal Accident Theory - NAT Michael Dorner: Avoiding Accidents - A Misson Impossible? 8
What is NAT? For the next 20 minutes it is not Network Address Translation! NAT is an accident theory, which tries to explain accidents It pays special attention to systems, which are tightly coupled with highly complex interactions (HC2-systems), such as nuclear power plants According to NAT there are ”normal accidents” in these systems due to unanticipated interaction of component failures, which cannot be avoided Michael Dorner: Avoiding Accidents - A Misson Impossible? 9
Interactive Complexity Criteria Isolation of errors and easy fixes are not possible (e.g. replace A with B) Common-mode failures: failure in one system causes failure of multiple systems Uncertainty about exact processes, internal feedback and exact system state Local proximity or interconnection of subsystems Michael Dorner: Avoiding Accidents - A Misson Impossible? 10
Coupling Criteria Processes follow an immutable order and cannot be delayed or interrupted Only one path to success Buffers and redundancy must be designed into the system from the very start System has little slack Michael Dorner: Avoiding Accidents - A Misson Impossible? 11
What NAT does not do It does not say anything about how accidents can be avoided It does not offer a metric for interactive complexity or coupling It does not cover intentions, group interests and other human factors as relevant Michael Dorner: Avoiding Accidents - A Misson Impossible? 12
High Reliability Organization - HRO Michael Dorner: Avoiding Accidents - A Misson Impossible? 13
What is HRO? HRO is an organizational strategy to prevent accidents in HC2-systems, i.e. provide high reliability It encourages redundancy as central technical mean to prevent accidents It puts a focus on an organizational ”culture of reliability” to prevent accidents, which is put in place by centralized control and executed in a decentralized manner A HRO values reliability over everything else, even performance/cost Michael Dorner: Avoiding Accidents - A Misson Impossible? 14
Culture of Reliability Preoccupation with failure - could this minor fault result in a big catastrophe next time? Don’t simplify - stay aware of the complex processes Sensibility to operations Commitment to resilience - improve crisis management Value experience over hierarchies Michael Dorner: Avoiding Accidents - A Misson Impossible? 15
What does HRO not do HRO does not offer an alternative accident-model, in fact it claims to be based on NAT It does not consider system design (except redundancy), but operation and organization It does also not offer a metric for interactive complexity and coupling Michael Dorner: Avoiding Accidents - A Misson Impossible? 16
NAT vs. HRO? NAT and HRO? Michael Dorner: Avoiding Accidents - A Misson Impossible? 17
Why they conflict Both claim to be applicable to HC2-systems NAT claims ”normal accidents” cannot be avoided, while HRO claims to provide an accident avoidance strategy NAT-advocates claim that HRO is a theory of its own (HRT) → NAT vs HRT HRO considers itself an organizational strategy for those systems → NAT and HRO Both theories assume different decision making models (sensemaking vs. garbage can) Michael Dorner: Avoiding Accidents - A Misson Impossible? 18
Why they don’t conflict Both say they apply to HC2-systems, but none of them actually has a reproducible metric to categorize systems Both sides categorize systems solely based on their own subjective impression Their definition of HC2-systems does not seem to match and thus they seem to talk about two different things Practical consequences from both theories usually don’t conflict Michael Dorner: Avoiding Accidents - A Misson Impossible? 19
HRO and NAT NAT explains the impact of certain design factors on complex/unpredictable accidents There are likely some rare ”normal” accidents and they do depend on coupling and complexity of the interactions HRO offers a promising culture to operate risky systems, but not a way to prevent these ”normal” accidents (which is not contrary to its original claim) Forcing multidimensional properties into a four-fold table with a non-reproducible metric is causing categorization problem, which makes both sides think they talk about the same thing The more properties of tight coupling and complex interaction a system has the more likely normal accidents get Michael Dorner: Avoiding Accidents - A Misson Impossible? 20
What’s next Motivation for a science based on accidents 1 Normal Accident Theory and High Reliability Theory 2 Normal Accident Theory High Reliability Organization NAT and/vs. HRO? Accidents and Computer Systems 3 Conclusion 4 Michael Dorner: Avoiding Accidents - A Misson Impossible? 21
Computers in Classical Systems Using computers in a system will add interactive complexity, but will likely not increase coupling Computers are often black boxes to system operators, thus making sensitive operation harder in the overall system It is to be expected that computers will make ”normal” accidents more likely and HRO harder to apply Michael Dorner: Avoiding Accidents - A Misson Impossible? 22
NAT in Computer Systems NAT should be useful during system design, because it explains which criteria increase the likelihood of unpredictable failure Computers as they are now are not tightly coupled, e.g. execution order is not guaranteed at any level unless explicitly synchronized Computers do have a certain level of complexity, but are not on par with nuclear power plants Many trends in computer science may increase the risk of ”normal accidents”, e.g. managed services, cloud computing ( → common mode!) Michael Dorner: Avoiding Accidents - A Misson Impossible? 23
HRO in Computer Systems HRO culture is not applicable to software, but it is to administration Systems with distributed organization are a weak spot of HRO, because HRO culture is imposed by the centralized leadership Thus HRO is not useful for the Internet as a whole, but may be for providers of centralized services When more critical processes embrace computer technology there may very well be an even bigger market for highly reliable services Michael Dorner: Avoiding Accidents - A Misson Impossible? 24
We’re done! - almost Motivation for a science based on accidents 1 Normal Accident Theory and High Reliability Theory 2 Normal Accident Theory High Reliability Organization NAT and/vs. HRO? Accidents and Computer Systems 3 Conclusion 4 Michael Dorner: Avoiding Accidents - A Misson Impossible? 25
Conclusion Some accidents may be impossible to prevent - NAT explains why and we can conclude how to reduce their probability A lot of accidents can be avoided - HRO offers an organizational model to achieve that Both theories apply to computer systems with some limitations, e.g. decentralization Both theories can contribute to a more complete approach to accidents Michael Dorner: Avoiding Accidents - A Misson Impossible? 26
Questions? Michael Dorner: Avoiding Accidents - A Misson Impossible? 27
Full List of Interactive Complexity Criteria Local proximity Common-mode connections Interconnected subsystems Limited substitution of materials Unknown/unfamiliar feedback-loops Multiple and interacting controls Indirect information sources Limited understanding of processes Michael Dorner: Avoiding Accidents - A Misson Impossible? 28
Recommend
More recommend