Building Reliable and Safe Systems - Lessons Learned Scott Torborg - PowerPoint PPT Presentation

Building Reliable and Safe Systems - Lessons Learned Scott Torborg storborg@mit.edu April 2009

The “Right” Way • Failure Modes and Effects Analysis (FMEA) • Root Cause Analysis (RCA) • MTBF, FIT, etc. ...yeah, yeah Learn that at engineering school.

Design

Standards • Might matter • EN 61508 = 10 9 hours MTBF for safety critical systems http://www.flickr.com/photos/lickyoats/2290383219/

Redundancy http://www.flickr.com/photos/lickyoats/2290383219/

Failure Isolation http://www.flickr.com/photos/metrix_feet/357018809/ Redundancy is no good if failures a fg ect everything at once.

Heterogeneous Redundancy (this is just adding redundancy in design) ...this is impractical, because design is expensive Architecture of the space shuttle primary avionics software system - http://portal.acm.org/citation.cfm?id=358258 Space shuttle has 5x redundant computers, with di fg erent configurations. Akamai championed the “keep the systems heterogeneous” mentality, but that’s easier with open source/other platforms.

Graceful Degradation

Failures aren’t uniform or random ...don’t treat them like they are http://en.wikipedia.org/wiki/Bathtub_curve Don’t apply MTBF without considering product lifetime.

Manufacturers Mislead You • “Typical”? Yeah, right. • Sometimes they just lie, or don’t know

Humans • Least reliable part of most systems • Political challenges v. Technical challenges • Interfaces and feedback • ...get stupid when in immediate danger

Look at the whole picture • Reliability doesn’t stop at the product • Training • Maintenance “Despite these efforts, the F-22A continues to operate below its expected reliability rates. A key reliability requirement for the F-22A is a 3-hour mean time • Support between maintenance intervals... Currently, the mean time between maintenance is less than 1 hour.” March 2008 GAO Congressional Report

Testing

Test it! • Do it yourself : putting your life on the line makes you very focused • Seeing field failures yourself helps • Have an answer for every “what if?” There’s a limit to this, because testing is often destructive... focus on common use scenarios and the most risky situations.

Know what happens • Reliable == Deterministic • Test everything • Talk to users as much as possible • If there’s an incident (death or injury) everyone stop everything

Some environments to test • Temp / humidity extremes • Rapid changes in temp / humidity • High vibration • EMI / ESD • Oxidation risk (high-O2 or corrosive env.)

Build Awesome Fixtures (3000 feet!)

Burn-in Automated burn-in testing reduces infant mortality

Maintenance • Record everything • Infrastructure helps make it easy, identify trends - You wouldn’t try to write software without a bug tracker. - The better the tools you have for this, the more data you’ll get. - Keep the feedback loop between maintenance and design engineers tight.

Tricks

Generally • Use simpler, more reliable devices to supplement more complex devices • Voltage supervisors • Watchdog timers • Diagnostic sensors

Logic is your friend Check Faults FAULT A FAULT FAULT B Logic gates are small and very reliable, e.g. TI “Little Logic”

Logic is your friend Check Status OK A OK OK B

Logic is your friend Share Outputs A CTL CTL CTL B Each independent system can override the other. Needs careful control algorithms!

Do it with power too A POWER B Great chips for this, e.g. Linear PowerPath controllers - Can also be done with FETs, so don’t fret about power consumption.

Voting Logic Flaky Sensor A OUTPUT Controller Flaky Sensor B Flaky Sensor C 3, 5, 7... inputs - Don’t use just voting logic, because it can make a bad problem really bad.

Detect Failures with Internal Models • Sensor value can’t change faster than 10mV/sec • Limit switch A can’t be tripped at the same time as limit switch B Pick the simplest constraints (least amount of state required) and go up from there.

I/O is like sex Use protection! resistors isolate short conditions / component ferrite beads failures, reduce max currents damp HF noise OUTSIDE (bad ESD, EMI) microcontroller small low-ESR caps (ceramic) ESD/TVS diodes clamp absorb power spikes, ESD over/under voltage Don’t use all of these! Just some. - Be mindful of slew rates, extra capacitance, etc. - Excess capacitance or resistance can increase power consumption, exacerbate loads, and make things worse. - Ceramic caps work best for absorbing pulses, and can be a cheap substitute for an ESD diode. - Especially protect things like reset, fault, shutdown lines.

Mechanical • Don’t overconstrain or stress the board • Vibration is bad • Potting helps • Piezoceramic effects - Beware of pressure e fg ects with soft potting compounds at altitude and pressure

Board Mounting Loosen up, it’s not going anywhere

Components Large components are more vulnerable

Piezoceramics (e.g. ceramic capacitor) = power supply noise

Piezoceramics sometimes intentional

Some things that suck

Most capacitors • Tantalum especially • Electrolytic bad long-term because of leaking

Electromechanical Devices • Mechanical Relays ➔ Solid-state Relays • Tilt Switches ➔ Accelerometers • Mechanical Switches ➔ Piezo, FETs • Connectors The least bad connectors are optical, or process control/instrumentation connectors (e.g. M8, M12).

Tin Whiskers Until they’re dealt with, get an RoHS exemption http://nepp.nasa.gov/WHISKER/index.html Can also occur with other metals, e.g. zinc. The solution is to use leaded solder.

Flux Residue Clean boards after assembly! http://glacier.lbl.gov/%7Egtp/DOM/MB/V5.0/206_ps4.JPG Often overlooked reliability issue, particularly for low-voltage analog circuits.

ESD • Take it seriously! • Especially while potting and testing

Some things that don’t suck Notice a trend? These things that don’t suck apply the same principles discussed earlier.

PPTCs • Polymer Positive Temp Coefficient • Like a fuse, but resets

TDK Capacitors - A fg ects MLCC (multilayer ceramic caps) - “Open Mode” - Fail open instead of fail short - Much more conservative ratings

Hi-Rel • Only part of solution • Flight-grade, mil-spec, etc. • $$$ Expensive • Don’t go overboard

Envirogel • Makes potting practical • Watch out for rapid pressure changes, behaves like lipid tissue http://www.kellerstudio.de/repairfaq/sam/ya234p1.jpg

CAN Bus • Deterministic • Robust • Fault Tolerant

Process Control Connectors • E.g. M8, M12 • Affordable and easy • Turck, Phoenix, Woodhead, Binder, Tyco

In short... • Be paranoid • Test thoroughly • Analyze everything ...thanks!

Building Reliable and Safe Systems - Lessons Learned Scott Torborg - PowerPoint PPT Presentation

Building Reliable and Safe Systems - Lessons Learned Scott Torborg storborg@mit.edu April 2009 The Right Way Failure Modes and Effects Analysis (FMEA) Root Cause Analysis (RCA) MTBF, FIT, etc. ...yeah, yeah Learn that at

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Natural Refrigerants Natural Refrigerants Natural Refrigerants Natural Refrigerants Safe

OVERVI EW OF MTN 015 AND OVERVI EW OF MTN 015 AND LESSONS LEARNED LESSONS LEARNED Peter Mutale

3/8/2019 Epidemiology, Risk Factors, and Outcomes of Pediatric PVD: LESSONS learned from the

OSHA Lessons Learned Adam Fries OSHA Compliance Officer February 13, 2018 OSHA Lessons Learned

Lessons Learned A Value Added Product of the Project Life Cycle R Gilman April 19, 2006 Agenda

Lessons Learned From Sequenced, Integrated Strategies of Economic After Hours Seminar

Some lessons learned from Team Science Some lessons learned from Team Science Lewis Cantley Weill

Opportunities Opportunities Lessons Learned Using Lessons Learned Using Vegetative

Lessons Learned from A Three-Week Lessons Learned from A Three-Week Long User Study w ith

Lessons Learned from Evaluating the Robustness of Defenses to Adversarial Examples Nicholas

Institutionalizing Lessons Learned October 25, 2006 Loren Plisco Region II Background

DEBUGGING LESSONS LEARNED WHILE DEBUGGING LESSONS LEARNED WHILE FIXING NETBSD FIXING NETBSD

Ten lessons learned about Ten lessons learned about Ubiquitous Computing Ubiquitous Computing

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

My Project Matter? 2 2 1 1 1 Ying Wang , Ming Wen, Zhenwei Liu, Rongxin Wu, Rui Wang, 1 1*

Safe-by-Design Integrating safety into your innovations Korienke Smit Msc., RIVM Contents

Di r e c t Di r e c t c c i i Pr Pr W ( p) Ol d Ec onom y Ol d Ec

Blind galaxy survey images Deconvolution with Shape Constraint Jean-Luc Starck

Automatic discovery of the characteristics and capacities of a distributed computational platform

ICS 101: Tools for the Information World Web development - The Semantic web Nov. 29, 2016

1 J E A N C L A U D E K I E F F E R On behalf of the Conseil Rgional dAquitaine Mission sur

P