Errudite: Scalable, Reproducible, and Testable Error Analysis - PowerPoint PPT Presentation

Errudite: Scalable, Reproducible, and Testable Error Analysis Tongshuang (Sherry) Wu @tongshuangwu University of Washington Marco Tulio Ribeiro Microsoft Research Jeffrey Heer @jeffrey_heer Daniel S. Weld @dsweld University of Washington � 1

Motivation & Contributions � 2

Error analysis is important for… Uncovering bugs Improving the state-of-art Safeguarding deployments � 3

Where We Are Fader et tal. We performed an error analysis on a sample of 100 questions ACL’13 Chen et al. We randomly select 50 incorrect questions and categorize ACL’16 them into 6 classes. We sample 100 incorrect predictions and try to find common Wadhwa et al. error categories. ACL’18 � 4

Where We Are Fader et tal. We performed an error analysis on a sample of 100 questions ACL’13 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Chen et al. We randomly select 50 incorrect questions and categorize ACL’16 them into 6 classes. We sample 100 incorrect predictions and try to find common Wadhwa et al. error categories. ACL’18 � 5

Where We Are “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause � 6

Where We Are & Our Contribution “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Principles & Errudite Subjectively defined hypotheses Precise & reproducible h ypotheses + + Small samples Scale up to the entire dev set + + Focus exclusively on errors Cover errors & correct instances + + No Test on true cause Test via counterfactual analysis 7 �

A E C B D F � 8

A E C B D F Video demo: https://tinyurl.com/errudite-video � 9

Core Design Precise & Reproducible Domain Specific Language � 10

Precise DSL (Domain Specific Language) DSL = + + Target Attribute Extractor Operators length(q) > 20 Extract A E C E Instance Attribute B B D Filter F Instance Groups 11 �

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Biased conclusion due to… Errudite Subjectively defined hypotheses Subjectively defined hypotheses Precise & reproducible h ypotheses + + Too ambiguous to reproduce Small samples Scale up to the entire dev set + + Focus exclusively on errors Cover errors & correct instances + + No Test on true cause Test via counterfactual analysis � 12

User study: What is imprecise answer boundaries? “The model is making predictions with missing or additional words…?” D1 D2 No exact match, but high overlap O ff by at most 2 tokens both on the left and right exact_match(p(m)) == 0 exact_match(p(m)) == 0 and abs(answer_offset(p(m),"left")) <= 2 and f1(p(m)) > 0.7 and abs(answer_offset(p(m),"right")) <= 2 � 13

User study: What is imprecise answer boundaries? “The model is making predictions with missing or additional words…?” D1 No exact match, but high overlap D2 O ff by at most 2 tokens both on the left and right exact_match(p(m)) == 0 exact_match(p(m)) == 0 and abs(answer_offset(p(m),"left")) <= 2 and f1(p(m)) > 0.7 and abs(answer_offset(p(m),"right")) <= 2 � 14

User study: What is imprecise answer boundaries? D1 No exact match, but high overlap D2 O ff by at most 2 tokens both on the left and right D1 D2 groundtruth …the polynomial time hierarchy collapses. …believed that the polynomial hierarchy does.. prediction � 15

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Biased conclusion due to… Errudite Errudite Subjectively defined hypotheses Subjectively defined hypotheses Precise & reproducible h ypotheses Precise & reproducible h ypotheses + + Quantify instances with a domain Small samples Scale up to the entire dev set specific language + + Focus exclusively on errors Cover errors & correct instances + + No Test on true cause Test via counterfactual analysis 16 �

Design & Use Scenario Examine the distractor hypothesis on BiDAF (Seo et al., 2016), with SQuAD (10570 instances; Rajpurkar et al., 2016) Independently tested by 4 (out of 10) participants in the user study � 17

Scenario: distractor hypothesis Who created the 2005 theme for Doctor Who? Common belief: BiDAF… … John Debney created a new arrangement Matches entity types of Ron Grainer’s original theme for Doctor Knows to find a PERSON Who in 1996. For the return of the series in 2005, Murray Gold provided a new Finds the exact answer spans Distracted by other PERSON spans arrangement... featured sampled from the 1963 original. � 18

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Biased conclusion due to… Errudite Subjectively defined hypotheses Precise & reproducible h ypotheses + + Small samples Small samples Scale up to the entire dev set + + 100 << 2000+ errors in total Focus exclusively on errors Cover errors & correct instances + + No Test on true cause Test via counterfactual analysis � 19

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Biased conclusion due to… Errudite Errudite Subjectively defined hypotheses Precise & reproducible h ypotheses + + Small samples Small samples Scale up to the entire dev set Scale up to the entire dev set + + Focus exclusively on errors Cover errors & correct instances + + No Test on true cause Test via counterfactual analysis 20 �

Build distractor groups with DSL C C D ENT(g) != "" 1 and count(token(c, pattern=ENT(g))) > 2 count(token(g, pattern=ENT(g))) 3 and ENT(g) == ENT(p(m)) 4 and f1(m) == 0 5 � 21

Build distractor groups with DSL ENT(Murray Gold) == PERSON 1 ENT(g) != "" is_entity and count(token(c, pattern=ENT(g))) > 2 count(token(g, pattern=ENT(g))) 3 and ENT(g) == ENT(p(m)) 4 and f1(m) == 0 5 “The g roundtruth is an ENT ity.” � 22

Build distractor groups with DSL count(PERSON : Murray Gold, John Dubney, Ron Grainer) == 3 1 ENT(g) != "" is_entity and count(token(c, pattern=ENT(g))) > 2 has_distractor count(token(g, pattern=ENT(g))) 3 and ENT(g) == ENT(p(m)) 4 count(PERSON : Murray Gold) == 1 and f1(m) == 0 5 “There are more tokens matching the ground truth entity type ( ENT(g) ) in the whole c ontext than in the g roundtruth.” 23 �

Build distractor groups with DSL 1 ENT(g) != "" is_entity and count(token(c, pattern=ENT(g))) > 2 has_distractor count(token(g, pattern=ENT(g))) 3 correct_type and ENT(g) == ENT(p(m)) 4 and f1(m) == 0 ENT(John Debney) == PERSON 5 “The m odel p rediction ENT ity type matches the g roundtruth ENT ity type.” 24 �

Build distractor groups with DSL 1 ENT(g) != "" is_entity and count(token(c, pattern=ENT(g))) > 2 has_distractor count(token(g, pattern=ENT(g))) 3 correct_type and ENT(g) == ENT(p(m)) 4 is_distracted and f1(m) == 0 5 “The m odel prediction is incorrect.” 25 �

Build distractor groups with DSL 1 ENT(g) != "" is_entity and count(token(c, pattern=ENT(g))) > 2 has_distractor count(token(g, pattern=ENT(g))) 3 correct_type and ENT(g) == ENT(p(m)) 4 is_distracted and f1(m) == 0 5 Correct Incorrect 5.7% of all BiDAF errors: The distractor hypothesis seems correct! 26 �

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Biased conclusion due to… Errudite Subjectively defined hypotheses Precise & reproducible h ypotheses + + Small samples Scale up to the entire dev set + + Focus exclusively on errors Focus exclusively on errors Cover errors & correct instances + + Wrongly prioritize groups that are No Test on true cause Test via counterfactual analysis well-handled in average. 27 �

“We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Biased conclusion due to… Errudite Errudite Subjectively defined hypotheses Precise & reproducible h ypotheses + + Small samples Scale up to the entire dev set + + Focus exclusively on errors Focus exclusively on errors Cover errors & correct instances Cover errors & correct instances + + Wrongly prioritize groups that are No Test on true cause Test via counterfactual analysis well-handled in average. 28 �

Build distractor groups with DSL all_instance 1 ENT(g) != "" is_entity and count(token(c, pattern=ENT(g))) > 2 has_distractor count(token(g, pattern=ENT(g))) 3 correct_type and ENT(g) == ENT(p(m)) 4 is_distracted and f1(m) == 0 5 Correct Incorrect 88% EM > 68% EM: BiDAF performs better when have distractors & entity type is matched, than overall. Reject / revise the hypothesis! 29 �

Errudite: Scalable, Reproducible, and Testable Error Analysis - PowerPoint PPT Presentation

Errudite: Scalable, Reproducible, and Testable Error Analysis Tongshuang (Sherry) Wu @tongshuangwu University of Washington Marco Tulio Ribeiro Microsoft Research Jeffrey Heer @jeffrey_heer Daniel S. Weld @dsweld University of Washington 1

Reproducible Research with Stata using version control, GitHub, and MarkDoc E. F. Haghish Nov.

Reproducible builds in Debian and everywhere Lunar lunar@debian.org Libre Software Meeting

Testable JavaScript Saturday, October 6, 12 1 Testable JavaScript James Kovacs Technical

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

Mayfly Reproducible Research in Minutes Reproducible Research is

Reproducible Research Practices for Economists Mindy L. Mallory November 10, 2017 Mindy L.

Reproducible research in practice ifgi Institute for Geoinformatics University of Mnster

Reproducible research in practice M ADAGASCAR software package Sergey Fomel Jackson School of

Reproducible Builds Valerie Young (spectranaut) Linux Conf Australia 2016 Reproducible Builds

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Re-analysis and replica/on prac/ces in reproducible research Daniele Fanelli Conceptual

Low Scale Testable Leptogenesis Jacobo Lpez-Pavn Neutrino Physics at the High Energy

Testable Implications of General Equilibrium Models: An Integer Programming Approach Laurens

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Mydoctorsaidwhat? Astudyoflanguagea6tudes

Decision Trees Lecture 12 David Sontag New York University

Economic Outcomes of Percutaneous Coronary Intervention Performed at Sites With and Without

CEO Update Jonathan Curtright OPEN HEALTH AFF INFO 3 1 WHERE WERE GOING TODAY

A popula(on-based evalua(on of the delivery of care for people living with HIV in Ontario Claire

APPROPRIATE PATIENT CARE IN A TIMELY MANNER www.patienteer.co PROCESS CRAIG BURKE FORMER NURSE

A Model of Hepa33s B Linkage-to-Care The Chicago

SIM LHIC Stakeholder Mee1ng July 16, 2013 Stakeholder Feedback

Errudite: Scalable, Reproducible, and Testable Error Analysis - PowerPoint PPT Presentation

Errudite: Scalable, Reproducible, and Testable Error Analysis Tongshuang (Sherry) Wu @tongshuangwu University of Washington Marco Tulio Ribeiro Microsoft Research Jeffrey Heer @jeffrey_heer Daniel S. Weld @dsweld University of Washington 1

Reproducible Research with Stata using version control, GitHub, and MarkDoc E. F. Haghish Nov.

Reproducible builds in Debian and everywhere Lunar lunar@debian.org Libre Software Meeting

Testable JavaScript Saturday, October 6, 12 1 Testable JavaScript James Kovacs Technical

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

Mayfly Reproducible Research in Minutes Reproducible Research is

Reproducible Research Practices for Economists Mindy L. Mallory November 10, 2017 Mindy L.

Reproducible research in practice ifgi Institute for Geoinformatics University of Mnster

Reproducible research in practice M ADAGASCAR software package Sergey Fomel Jackson School of

Reproducible Builds Valerie Young (spectranaut) Linux Conf Australia 2016 Reproducible Builds

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Re-analysis and replica/on prac/ces in reproducible research Daniele Fanelli Conceptual

Low Scale Testable Leptogenesis Jacobo Lpez-Pavn Neutrino Physics at the High Energy

Testable Implications of General Equilibrium Models: An Integer Programming Approach Laurens

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Mydoctorsaidwhat? Astudyoflanguagea6tudes

Decision Trees Lecture 12 David Sontag New York University

Economic Outcomes of Percutaneous Coronary Intervention Performed at Sites With and Without

CEO Update Jonathan Curtright OPEN HEALTH AFF INFO 3 1 WHERE WERE GOING TODAY

A popula(on-based evalua(on of the delivery of care for people living with HIV in Ontario Claire

APPROPRIATE PATIENT CARE IN A TIMELY MANNER www.patienteer.co PROCESS CRAIG BURKE FORMER NURSE

A Model of Hepa33s B Linkage-to-Care The Chicago

SIM LHIC Stakeholder Mee1ng July 16, 2013 Stakeholder Feedback

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits