An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS 294- 4 ROC Seminar
Outline • Human error and computer system f ailures • A theory of human error • Human error and accident theory • Addressing human error Slide 2
Dependability and human error • I ndustry data shows that human error is the largest contributor to reduced dependability – HP HA labs: human error is # 1 cause of f ailures (2001) – Oracle: half of DB f ailures due t o human error (1999) – Gray/ Tandem: 42% of f ailures f rom human administ rat or errors (1986) – Murphy/ Gent st udy of VAX syst ems (1993): % of Syst em Crashes Causes of system crashes Ot her 100% 90% 18% Syst em 80% 70% management 60% 53% 50% Sof t ware 40% 30% f ailure 20% 18% 10% 10% Hardware 0% Time (1985-1993) 1985 1993 f ailure Slide 3
Learning f rom other f ields: PSTN • FCC- collected data on outages in the US public- switched telephone network – met ric: breakdown of cust omer calls blocked by syst em out ages (excluding nat ural disast ers). J an-J une 2001 Human error account s f or 9% 56% of all blocked calls 56% 22% Human-co. Human-ext. 5% Hardware Failure Software Failure Overload 47% Vandalism 17% – comparison wit h 1992-4 dat a shows t hat human error is t he only f act or t hat is not improving over t ime Slide 4
Learning f rom other f ields: PSTN • PSTN trends: 1992- 1994 vs. 2001 Minutes (millions of customer minutes/month) Minutes Cause Trend 2001 1992- 94 Human error: 98 176 company Human error: 100 75 ext ernal Hardware 49 49 Sof t ware 15 12 Overload 314 60 Vandalism 5 3 Slide 5
Learning f rom experiments • Human error rates during maintenance of sof tware RAI D system – part icipant s at t empt t o repair RAI D disk f ailures » by replacing broken disk and reconst r uct ing dat a – each part icipant repeat ed t ask several t imes – dat a aggregat ed across 5 part icipant s Error type Windows Solaris Linux Fat al Dat a Loss � �� Unsuccessf ul Repair � Syst em ignored f at al input � User Error – I nt ervent ion Required � �� � User Error – User Recovered � ���� �� Total number of trials 35 33 31 Slide 6
Learning f rom experiments • Errors occur despite experience: 3 Windows Solaris Linux Number of errors 2 1 0 1 2 3 4 5 6 7 8 9 Iteration • Training and f amiliarity don’t eliminate errors – t ypes of errors change: mist akes vs. slips/ lapses • System design af f ects error- susceptibilit y Slide 7
Outline • Human error and computer system f ailures • A theory of human error • Human error and accident theory • Addressing human error Slide 8
A theory of human error (dist illed f rom J . Reason, Human Error, 1990) • Preliminaries: the three stages of cognitive processing f or tasks 1) planning » a goal is ident if ied and a sequence of act ions is select ed t o reach t he goal 2) st orage » t he select ed plan is st ored in memor y unt il it is appropriat e t o carr y it out 3) execut ion » t he plan is implement ed by t he pr ocess of car rying out t he act ions specif ied by t he plan Slide 9
A theory of human error (2) • Each cognit ive st age has an associated f orm of error – slips: execut ion st age » incorrect execut ion of a planned act ion » example: miskeyed command – lapses: st orage st age » incor rect omission of a st ored, planned act ion » examples: skipping a st ep on a checklist , f orget t ing t o rest ore nor mal valve set t ings af t er maint enance – mistakes: planning st age » t he plan is not suit able f or achieving t he desired goal » example: TMI operat ors premat urely disabling HPI pumps Slide 10
Origins of error: the GEMS model • GEMS: Generic Error- Modeling System – an at t empt t o underst and t he origins of human error • GEMS identif ies three levels of cognitive task processing – skill- based: f amiliar , aut omat ic procedural t asks » usually low-level, like knowing t o t ype “ls” t o list f iles – rule- based: t asks approached by pat t ern-mat ching f rom a set of int ernal problem-solving rules » “obser ved sympt oms X mean syst em is in st at e Y” » “if syst em st at e is Y, I should pr obably do Z t o f ix it ” – knowledge- based: t asks approached by reasoning f rom f irst principles » when rules and experience don’t apply Slide 11
GEMS and errors • Errors can occur at each level – skill- based: slips and lapses » usually errors of inat t ent ion or misplaced at t ent ion – rule- based: mist akes » usually a result of picking an inappropriat e rule » caused by misconst r ued view of st at e, over-zealous pat t ern mat ching, f requency gambling, def icient r ules – knowledge- based: mist akes » due t o incomplet e/ inaccurat e underst anding of syst em, conf irmat ion bias, over conf idence, cognit ive st rain, ... • Errors can result f rom operating at wrong level – humans are reluct ant t o move f rom RB t o KB level even if rules aren’t working Slide 12
Error f requencies • I n raw f requencies, SB >> RB > KB – 61% of errors are at skill-based level – 27% of errors are at rule-based level – 11% of errors are at knowledge-based level • But if we look at opportunit ies f or error, the order reverses – humans perf orm vast ly more SB t asks t han RB, and vast ly more RB t han KB » so a given KB t ask is more likely t o result in err or t han a given RB or SB t ask Slide 13
Error detection and correction • Basic detection mechanism is self - monitoring – periodic at t ent ional checks, measurement of progress t oward goal, discovery of surprise inconsist encies, ... • Ef f ectiveness of self - detection of errors – SB errors: 75-95% det ect ed, avg 86% » but some lapse-t ype er ror s were r esist ant t o det ect ion – RB errors: 50-90% det ect ed, avg 73% – KB errors: 50-80% det ect ed, avg 70% • I ncluding correction tells a dif f erent story: – SB: ~70% of all errors det ect ed and correct ed – RB: ~50% det ect ed and correct ed – KB: ~25% det ect ed and correct ed Slide 14
Outline • Human error and computer system f ailures • A theory of human error • Human error and accident theory • Addressing human error Slide 15
Human error and accident theory • Major systems accidents (“normal accidents”) start with an accumulation of latent errors – most of t hose lat ent errors are human errors » lat ent slips/ lapses, par t icularly in maint enance • example: misconf igured valves in TMI » lat ent mist akes in syst em design, organizat ion, and planning, part icularly of emergency pr ocedures • example: f lowchart s t hat omit unf oreseen pat hs – invisible lat ent errors change syst em realit y wit hout alt ering operat or’s models » seemingly-cor rect act ions can t hen t rigger accident s Slide 16
Accident theory (2) • Accident s are exacerbated by human errors made during operator response – RB errors made due t o lack of experience wit h syst em in f ailure st at es » t raining is r arely suf f icient t o develop a r ule base t hat capt ures syst em response out side of nor mal bounds – KB reasoning is hindered by syst em complexit y and cognit ive st rain » syst em complexit y prohibit s ment al modeling » st ress of an emergency encourages RB appr oaches and diminishes KB ef f ect iveness – syst em visibilit y limit ed by aut omat ion and “def ense in dept h” » result s in improper rule choices and KB reasoning Slide 17
Outline • Human error and computer system f ailures • A theory of human error • Human error and accident theory • Addressing human error – general guidelines – t he ROC approach: syst em-level undo Slide 18
Addressing human error • Challenges – humans are inherent ly f allible and errors are inevit able – hard-t o-det ect lat ent errors can be more t roublesome t han f ront -line errors – human psychology must not be ignored » especially t he SB/ RB/ KB dist inct ion and human behavior at each level • General approach: error- tolerance rather than error- avoidance “I t is now widely held among human reliabilit y specialist s t hat t he most pr oduct ive st rat egy f or dealing wit h act ive err or s is t o f ocus upon cont r olling t heir consequences rat her t han upon st riving f or t heir eliminat ion.” (Reason, p. 246) Slide 19
The Automation I rony • Automation is not the cure f or human error – aut omat ion addresses t he easy SB/ RB t asks, leaving t he complex KB t asks f or t he human » humans are ill-suit ed t o KB t asks, especially under st ress – aut omat ion hinders underst anding and ment al modeling » decreases syst em visibilit y and incr eases complexit y » operat or s don’t get hands-on cont r ol experience » rule-set f or RB t asks and models f or KB t asks are weak – aut omat ion shif t s t he error source f rom operat or errors t o design errors » harder t o det ect / t olerat e/ f ix design errors Slide 20
Recommend
More recommend