building reliable and safe systems lessons learned
play

Building Reliable and Safe Systems - Lessons Learned Scott Torborg - PowerPoint PPT Presentation

Building Reliable and Safe Systems - Lessons Learned Scott Torborg storborg@mit.edu April 2009 The Right Way Failure Modes and Effects Analysis (FMEA) Root Cause Analysis (RCA) MTBF, FIT, etc. ...yeah, yeah Learn that at


  1. Building Reliable and Safe Systems - Lessons Learned Scott Torborg storborg@mit.edu April 2009

  2. The “Right” Way • Failure Modes and Effects Analysis (FMEA) • Root Cause Analysis (RCA) • MTBF, FIT, etc. ...yeah, yeah Learn that at engineering school.

  3. Design

  4. Standards • Might matter • EN 61508 = 10 9 hours MTBF for safety critical systems http://www.flickr.com/photos/lickyoats/2290383219/

  5. Redundancy http://www.flickr.com/photos/lickyoats/2290383219/

  6. Failure Isolation http://www.flickr.com/photos/metrix_feet/357018809/ Redundancy is no good if failures a fg ect everything at once.

  7. Heterogeneous Redundancy (this is just adding redundancy in design) ...this is impractical, because design is expensive Architecture of the space shuttle primary avionics software system - http://portal.acm.org/citation.cfm?id=358258 Space shuttle has 5x redundant computers, with di fg erent configurations. Akamai championed the “keep the systems heterogeneous” mentality, but that’s easier with open source/other platforms.

  8. Graceful Degradation

  9. Failures aren’t uniform or random ...don’t treat them like they are http://en.wikipedia.org/wiki/Bathtub_curve Don’t apply MTBF without considering product lifetime.

  10. Manufacturers Mislead You • “Typical”? Yeah, right. • Sometimes they just lie, or don’t know

  11. Humans • Least reliable part of most systems • Political challenges v. Technical challenges • Interfaces and feedback • ...get stupid when in immediate danger

  12. Look at the whole picture • Reliability doesn’t stop at the product • Training • Maintenance “Despite these efforts, the F-22A continues to operate below its expected reliability rates. A key reliability requirement for the F-22A is a 3-hour mean time • Support between maintenance intervals... Currently, the mean time between maintenance is less than 1 hour.” March 2008 GAO Congressional Report

  13. Testing

  14. Test it! • Do it yourself : putting your life on the line makes you very focused • Seeing field failures yourself helps • Have an answer for every “what if?” There’s a limit to this, because testing is often destructive... focus on common use scenarios and the most risky situations.

  15. Know what happens • Reliable == Deterministic • Test everything • Talk to users as much as possible • If there’s an incident (death or injury) everyone stop everything

  16. Some environments to test • Temp / humidity extremes • Rapid changes in temp / humidity • High vibration • EMI / ESD • Oxidation risk (high-O2 or corrosive env.)

  17. Build Awesome Fixtures (3000 feet!)

  18. Burn-in Automated burn-in testing reduces infant mortality

  19. Maintenance • Record everything • Infrastructure helps make it easy, identify trends - You wouldn’t try to write software without a bug tracker. - The better the tools you have for this, the more data you’ll get. - Keep the feedback loop between maintenance and design engineers tight.

  20. Tricks

  21. Generally • Use simpler, more reliable devices to supplement more complex devices • Voltage supervisors • Watchdog timers • Diagnostic sensors

  22. Logic is your friend Check Faults FAULT A FAULT FAULT B Logic gates are small and very reliable, e.g. TI “Little Logic”

  23. Logic is your friend Check Status OK A OK OK B

  24. Logic is your friend Share Outputs A CTL CTL CTL B Each independent system can override the other. Needs careful control algorithms!

  25. Do it with power too A POWER B Great chips for this, e.g. Linear PowerPath controllers - Can also be done with FETs, so don’t fret about power consumption.

  26. Voting Logic Flaky Sensor A OUTPUT Controller Flaky Sensor B Flaky Sensor C 3, 5, 7... inputs - Don’t use just voting logic, because it can make a bad problem really bad.

  27. Detect Failures with Internal Models • Sensor value can’t change faster than 10mV/sec • Limit switch A can’t be tripped at the same time as limit switch B Pick the simplest constraints (least amount of state required) and go up from there.

  28. I/O is like sex Use protection! resistors isolate short conditions / component ferrite beads failures, reduce max currents damp HF noise OUTSIDE (bad ESD, EMI) microcontroller small low-ESR caps (ceramic) ESD/TVS diodes clamp absorb power spikes, ESD over/under voltage Don’t use all of these! Just some. - Be mindful of slew rates, extra capacitance, etc. - Excess capacitance or resistance can increase power consumption, exacerbate loads, and make things worse. - Ceramic caps work best for absorbing pulses, and can be a cheap substitute for an ESD diode. - Especially protect things like reset, fault, shutdown lines.

  29. Mechanical • Don’t overconstrain or stress the board • Vibration is bad • Potting helps • Piezoceramic effects - Beware of pressure e fg ects with soft potting compounds at altitude and pressure

  30. Board Mounting Loosen up, it’s not going anywhere

  31. Components Large components are more vulnerable

  32. Piezoceramics (e.g. ceramic capacitor) = power supply noise

  33. Piezoceramics sometimes intentional

  34. Some things that suck

  35. Most capacitors • Tantalum especially • Electrolytic bad long-term because of leaking

  36. Electromechanical Devices • Mechanical Relays ➔ Solid-state Relays • Tilt Switches ➔ Accelerometers • Mechanical Switches ➔ Piezo, FETs • Connectors The least bad connectors are optical, or process control/instrumentation connectors (e.g. M8, M12).

  37. Tin Whiskers Until they’re dealt with, get an RoHS exemption http://nepp.nasa.gov/WHISKER/index.html Can also occur with other metals, e.g. zinc. The solution is to use leaded solder.

  38. Flux Residue Clean boards after assembly! http://glacier.lbl.gov/%7Egtp/DOM/MB/V5.0/206_ps4.JPG Often overlooked reliability issue, particularly for low-voltage analog circuits.

  39. ESD • Take it seriously! • Especially while potting and testing

  40. Some things that don’t suck Notice a trend? These things that don’t suck apply the same principles discussed earlier.

  41. PPTCs • Polymer Positive Temp Coefficient • Like a fuse, but resets

  42. TDK Capacitors - A fg ects MLCC (multilayer ceramic caps) - “Open Mode” - Fail open instead of fail short - Much more conservative ratings

  43. Hi-Rel • Only part of solution • Flight-grade, mil-spec, etc. • $$$ Expensive • Don’t go overboard

  44. Envirogel • Makes potting practical • Watch out for rapid pressure changes, behaves like lipid tissue http://www.kellerstudio.de/repairfaq/sam/ya234p1.jpg

  45. CAN Bus • Deterministic • Robust • Fault Tolerant

  46. Process Control Connectors • E.g. M8, M12 • Affordable and easy • Turck, Phoenix, Woodhead, Binder, Tyco

  47. In short... • Be paranoid • Test thoroughly • Analyze everything ...thanks!

Recommend


More recommend