Building Reliable and Safe Systems - Lessons Learned Scott Torborg storborg@mit.edu April 2009
The “Right” Way • Failure Modes and Effects Analysis (FMEA) • Root Cause Analysis (RCA) • MTBF, FIT, etc. ...yeah, yeah Learn that at engineering school.
Design
Standards • Might matter • EN 61508 = 10 9 hours MTBF for safety critical systems http://www.flickr.com/photos/lickyoats/2290383219/
Redundancy http://www.flickr.com/photos/lickyoats/2290383219/
Failure Isolation http://www.flickr.com/photos/metrix_feet/357018809/ Redundancy is no good if failures a fg ect everything at once.
Heterogeneous Redundancy (this is just adding redundancy in design) ...this is impractical, because design is expensive Architecture of the space shuttle primary avionics software system - http://portal.acm.org/citation.cfm?id=358258 Space shuttle has 5x redundant computers, with di fg erent configurations. Akamai championed the “keep the systems heterogeneous” mentality, but that’s easier with open source/other platforms.
Graceful Degradation
Failures aren’t uniform or random ...don’t treat them like they are http://en.wikipedia.org/wiki/Bathtub_curve Don’t apply MTBF without considering product lifetime.
Manufacturers Mislead You • “Typical”? Yeah, right. • Sometimes they just lie, or don’t know
Humans • Least reliable part of most systems • Political challenges v. Technical challenges • Interfaces and feedback • ...get stupid when in immediate danger
Look at the whole picture • Reliability doesn’t stop at the product • Training • Maintenance “Despite these efforts, the F-22A continues to operate below its expected reliability rates. A key reliability requirement for the F-22A is a 3-hour mean time • Support between maintenance intervals... Currently, the mean time between maintenance is less than 1 hour.” March 2008 GAO Congressional Report
Testing
Test it! • Do it yourself : putting your life on the line makes you very focused • Seeing field failures yourself helps • Have an answer for every “what if?” There’s a limit to this, because testing is often destructive... focus on common use scenarios and the most risky situations.
Know what happens • Reliable == Deterministic • Test everything • Talk to users as much as possible • If there’s an incident (death or injury) everyone stop everything
Some environments to test • Temp / humidity extremes • Rapid changes in temp / humidity • High vibration • EMI / ESD • Oxidation risk (high-O2 or corrosive env.)
Build Awesome Fixtures (3000 feet!)
Burn-in Automated burn-in testing reduces infant mortality
Maintenance • Record everything • Infrastructure helps make it easy, identify trends - You wouldn’t try to write software without a bug tracker. - The better the tools you have for this, the more data you’ll get. - Keep the feedback loop between maintenance and design engineers tight.
Tricks
Generally • Use simpler, more reliable devices to supplement more complex devices • Voltage supervisors • Watchdog timers • Diagnostic sensors
Logic is your friend Check Faults FAULT A FAULT FAULT B Logic gates are small and very reliable, e.g. TI “Little Logic”
Logic is your friend Check Status OK A OK OK B
Logic is your friend Share Outputs A CTL CTL CTL B Each independent system can override the other. Needs careful control algorithms!
Do it with power too A POWER B Great chips for this, e.g. Linear PowerPath controllers - Can also be done with FETs, so don’t fret about power consumption.
Voting Logic Flaky Sensor A OUTPUT Controller Flaky Sensor B Flaky Sensor C 3, 5, 7... inputs - Don’t use just voting logic, because it can make a bad problem really bad.
Detect Failures with Internal Models • Sensor value can’t change faster than 10mV/sec • Limit switch A can’t be tripped at the same time as limit switch B Pick the simplest constraints (least amount of state required) and go up from there.
I/O is like sex Use protection! resistors isolate short conditions / component ferrite beads failures, reduce max currents damp HF noise OUTSIDE (bad ESD, EMI) microcontroller small low-ESR caps (ceramic) ESD/TVS diodes clamp absorb power spikes, ESD over/under voltage Don’t use all of these! Just some. - Be mindful of slew rates, extra capacitance, etc. - Excess capacitance or resistance can increase power consumption, exacerbate loads, and make things worse. - Ceramic caps work best for absorbing pulses, and can be a cheap substitute for an ESD diode. - Especially protect things like reset, fault, shutdown lines.
Mechanical • Don’t overconstrain or stress the board • Vibration is bad • Potting helps • Piezoceramic effects - Beware of pressure e fg ects with soft potting compounds at altitude and pressure
Board Mounting Loosen up, it’s not going anywhere
Components Large components are more vulnerable
Piezoceramics (e.g. ceramic capacitor) = power supply noise
Piezoceramics sometimes intentional
Some things that suck
Most capacitors • Tantalum especially • Electrolytic bad long-term because of leaking
Electromechanical Devices • Mechanical Relays ➔ Solid-state Relays • Tilt Switches ➔ Accelerometers • Mechanical Switches ➔ Piezo, FETs • Connectors The least bad connectors are optical, or process control/instrumentation connectors (e.g. M8, M12).
Tin Whiskers Until they’re dealt with, get an RoHS exemption http://nepp.nasa.gov/WHISKER/index.html Can also occur with other metals, e.g. zinc. The solution is to use leaded solder.
Flux Residue Clean boards after assembly! http://glacier.lbl.gov/%7Egtp/DOM/MB/V5.0/206_ps4.JPG Often overlooked reliability issue, particularly for low-voltage analog circuits.
ESD • Take it seriously! • Especially while potting and testing
Some things that don’t suck Notice a trend? These things that don’t suck apply the same principles discussed earlier.
PPTCs • Polymer Positive Temp Coefficient • Like a fuse, but resets
TDK Capacitors - A fg ects MLCC (multilayer ceramic caps) - “Open Mode” - Fail open instead of fail short - Much more conservative ratings
Hi-Rel • Only part of solution • Flight-grade, mil-spec, etc. • $$$ Expensive • Don’t go overboard
Envirogel • Makes potting practical • Watch out for rapid pressure changes, behaves like lipid tissue http://www.kellerstudio.de/repairfaq/sam/ya234p1.jpg
CAN Bus • Deterministic • Robust • Fault Tolerant
Process Control Connectors • E.g. M8, M12 • Affordable and easy • Turck, Phoenix, Woodhead, Binder, Tyco
In short... • Be paranoid • Test thoroughly • Analyze everything ...thanks!
Recommend
More recommend