What can nuclear engineering teach us about software? Todd Lewis & Eduardo Bellani tlewis@brickabode.com emb@brickabode.com 24 April 2017
Read every word of Lamport ● Leslie Lamport (1977): "Proving the Correctness of Multiprocess Programs" ● This paper is amazing ● Leslie Lamport is amazing ● He did "Time, clocks, and the ordering of events in a distributed system" only a year later ● (Has there ever been a computer science decade as great as the 1970s?)
System properties come in two kinds! ● Computing is great at Liveness Safety liveness: lots of features! ● Benefit of features often When Sometimes Always outweighs cost of failure, so “Move Fast & Break Where Somewhere Everywhere Things” ● However, we often do Nature Good thing Bad thing safety so badly that there is opportunity there; lots Does not Action Happens happen of low-hanging fruit Means Feature Control
Let’s talk about saving lives ● Starting in the 1970s, human factors analysis started happening in aviation ● What used to be called “pilot error” is now recognized as “bad interface design” ● Hundreds of thousands of people are alive today who would otherwise be dead because of this advance
Compare and contrast
Let’s design a nuclear plant! ● We are putting a nuclear plant next to the ocean ● Your mother lives next door ● What failures would you want the designers to care about?
Multi-system failures (Oceanic edition) Bad outcome Cause Control Put critical infrastructure up Multi-system failure Tsunami high Multi-system failure Corrosion Annual inspections Multi-system failure Flooding Sea wall and drainage Loss of coolant Multi-system failure Inspect & clean pipes (biomass clogs pipes) Multi-system failure Sea-borne attack Sea walls Multi-system failure Erosion kills plant Sea walls Sedimentation blocks Multi-system failure Inspect & dredge coolant
We can do this systematically 1) What failures matter? (“Bad business outcome” is a useful criterion) 2) For each failure, what can cause it? 3) How do you address each cause? ● Gives you a finite list of hazards handled ● Gives you a clear model to give to your operators: here are the risks we manage, and how
Pro tip: Create a Red Team ● It is psychologically difficult to look at your own designs critically ● You need distance in order to tease out assumptions and blindspots ● Bring an outsider into your analysis, and encourage them to ask “dumb questions”
How to do this 1) Get a few hours of whiteboard time: your team, plus a smart outsider 2) Failures → Causes → Controls 3) Write it up 4) Start sharing it with others: here are some new options to improve our system
Where to find more ● Engineering a Better World, by Nancy Leveson ● Resilience Engineering, by Hollnagel, Woods and Leveson ● Drift into Failure, by Sidney Dekker
“Correct, On-Time, On-Budget” ● Do you like building systems that work? ● Are you a Haskell, ML, or Lisp programmer? ● Meet us after the talk! ● jobs@brickabode.com
Recommend
More recommend