stress test evaluation for natural language inference
play

Stress Test Evaluation for Natural Language Inference Aakanksha - PowerPoint PPT Presentation

Stress Test Evaluation for Natural Language Inference Aakanksha Naik*, Abhilasha Ravichander* , Norman Sadeh, Carolyn Rose, Graham Neubig 1 Natural Language Inference (a.k.a Recognizing Textual Entailment) Premise : Stimpy was a little cat who


  1. Stress Test Evaluation for Natural Language Inference Aakanksha Naik*, Abhilasha Ravichander* , Norman Sadeh, Carolyn Rose, Graham Neubig 1

  2. Natural Language Inference (a.k.a Recognizing Textual Entailment) Premise : Stimpy was a little cat who believed he could fly Hypothesis: Stimpy could fly (Fyodorov, 2000, Condoravdi, 2003, Bos and Markert, 2005, Dagan et.al, 2006, McCartney and Manning, 2009) 2

  3. Natural Language Inference (a.k.a Recognizing Textual Entailment) Premise : Stimpy was a little cat who believed he could fly Given a premise, determine whether a hypothesis is True ( entailment ), False ( contradiction ), Undecided ( neutral ) Hypothesis: Stimpy could fly (Fyodorov, 2000, Condoravdi, 2003, Bos and Markert, 2005, Dagan et.al, 2006, McCartney and Manning, 2009) 3

  4. Natural Language Inference (a.k.a Recognizing Textual Entailment) Benchmark task for Natural Language Understanding Prevalent View: To perform well at NLI, models must ● ○ learn good sentence representations: “handle nearly the full complexity of compositional semantics” (Williams et al, 2018) ○ reason over “difficult” phenomena like lexical entailment, quantification, coreference, tense, belief, modality, lexical and syntactic ambiguity (Dagan et al, 2009; McCartney and Manning, 2009; Marelli et al, 2014; Williams et al, 2018) 4

  5. Natural Language Inference (a.k.a Recognizing Textual Entailment) MultiNLI ● Text from 10 genres! ● Covers written & spoken english ● Longer, more complex sentences ● Variety of linguistic phenomena ● Sentence-encoder SOTA: 74.5 % (Nie and Bansal, 2017)* 5 * All results shown for matched. Refer to paper for further details

  6. Motivation Neural networks can solve nearly ¾ examples from the challenging MultiNLI dataset! 6

  7. Motivation Neural networks can solve nearly ¾ examples from the challenging MultiNLI dataset! But , more difficult cases occur rarely and are masked in traditional evaluation Optimistic estimate of model performance 7

  8. Motivation Neural networks can solve nearly ¾ examples from the challenging MultiNLI dataset! But , more difficult cases occur rarely and are masked in traditional evaluation Optimistic estimate of model performance We want to figure out whether our systems have the ability to make real inferential decisions, and if so, to what extent. 8

  9. What are Stress Tests? Stress Testing: Testing a system beyond normal operational capacity to confirm that intended specifications are being For NLI met and identify weaknesses if any Building large-scale diagnostic datasets to exercise models on their weaknesses and better understand their capabilities. 9

  10. Why Stress Tests? Reward system ability to reason about task instead of encouraging reliance on misleading correlations in datasets. “Sanity checking” for NLP models Analyze strengths and weaknesses of various models Fine-grained phenomenon-by-phenomenon evaluation scheme 10

  11. Weaknesses of SOTA NLI Models ● To construct stress tests, we must first identify “bugs” (potential weaknesses) ● Analyzed errors of Nie & Bansal (2017) (best-performing single model) 11

  12. Word Overlap (29%) Premise : And, could it not result in a decline in Postal Service volumes across–the–board? Hypothesis :There may not be a decline in Postal Service volumes across–the–board. Neutral Entailment 12

  13. Negation (13%) Premise : Enthusiasm for Disney’s Broadway production of The Lion King dwindles. Hypothesis : The broadway production of The Lion King is no longer enthusiastically attended. Entailment Contradiction 13

  14. Length Mismatch (3%) Premise : So you know well a lot of the stuff you hear coming from South Africa now and from West Africa that’s considered world music because it’s not particularly using certain types of folk styles. Hypothesis : They rely too heavily on the types of folk styles. Contradiction Neutral 14

  15. Numerical Reasoning (3%) Premise : Deborah Pryce said Ohio Legal Services in Columbus will receive a $200,000 federal grant toward an online legal self-help center. Hypothesis : A $900,000 federal grant will be received by Missouri Legal Services, said Deborah Pryce. Contradiction Entailment 15

  16. Antonymy (5%) Premise : “Have her show it,” said Thorn Hypothesis : Thorn told her to hide it. Contradiction Entailment 16

  17. Grammaticality (3%) Premise : So if there are something interesting or something worried, please give me a call at any time. Hypothesis : The person is open to take a call anytime. Contradiction Neutral 17

  18. Real World Knowledge (12%) Premise : It was still night. Hypothesis : The sun hadn’t risen yet, for the moon was shining daringly in the sky. Entailment Neutral 18

  19. Ambiguity (6%) Premise : Outside the cathedral you will find a statue of John Knox with Bible in hand. Hypothesis : John Knox was someone who read the Bible. Entailment Neutral 19

  20. Unknown (26%) Premise : We’re going to try something different this morning, said Jon. Hypothesis : Jon decided to try a new approach. Entailment Contradiction 20

  21. Constructing Stress Tests Competence Tests Distraction Tests Noise Tests Evaluate model ability to Evaluate model robustness Evaluate model robustness reason about quantities and to shallow distractions to noise in data understand antonyms Target error categories: Target error categories: Target error categories: word overlap, negation, grammaticality antonymy, numerical length mismatch reasoning Construction framework: Construction framework: Random perturbation Construction framework: label-preserving Heuristic rules, external perturbations using knowledge sources propositional logic 21

  22. Constructing Competence Tests: Antonymy P: I saw a big house POS tagging H: I saw a small house NOUNS ADJECTIVES house big Replace word with antonym in sentence WSD WordNet ANTONYMS: WORD SENSES big x small 22

  23. Constructing Competence Tests: Antonymy P: I saw a big house Final Entailment Pair POS tagging H: I saw a small house Premise: I saw a big house. NOUNS Hypothesis: I saw a small house. ADJECTIVES house big Replace word with Label: Contradiction antonym in sentence Size: 3.2 k pairs WSD WordNet ANTONYMS: WORD SENSES big x small 23

  24. Constructing Competence Tests: Numerical Reasoning ENT CONT AQuA-RAT Word problem pair pair Options Answer Rationale Ent H: Tim had more than 550 bags of cement Preprocessing: 1. Discard “complex” Flip problems Cont H: Tim had 350 bags of cement 2. Split problems into sentences 3. Discard sentences NEU pair Use without numbers and P: Tim had more heuristics Randomly NEs than 550 bags of choose one cement quantity H: Tim had 750 P: Tim had 750 bags of cement 750 bags bags of cement L: neutral 24

  25. Constructing Competence Tests: Numerical Reasoning ENT CONT AQuA-RAT Word problem pair pair Options Answer Final Entailment Pairs Rationale Ent H: Tim had more than 550 bags Premise: Tim had 750 bags of cement. of cement Preprocessing: Hypothesis: Tim had more than 550 bags of 1. Discard “complex” cement. Flip problems Cont H: Tim had 350 bags of cement Label: Entailment 2. Split problems into sentences 3. Discard sentences Premise: Tim had 750 bags of cement. NEU pair Use without numbers and Hypothesis: Tim had 350 bags of cement. P: Tim had more heuristics Randomly NEs Label: Contradiction than 550 bags of choose one cement Size: 7.5 k pairs quantity H: Tim had 750 P: Tim had 750 bags of cement 750 bags bags of cement L: neutral 25

  26. Constructing Distraction Tests Premise P Hypothesis H Entailment Neutral Contradiction (P ⇒ H) ⇒ ((P ⋀ True) ⇒ H) Appending true still (P ⤃ H) ⇒ ((P ⋀ True) ⤃ H) keeps P and H neutral Logic Framework for Distraction Test Construction: Appending tautology to either premise or hypothesis 26

  27. Constructing Distraction Tests (Cont.) Word Overlap Negation Length Mismatch ● Tautology: true is ● Tautology: false is ● Tautology: (true is true not true true)*5 ● Append to ● Append to ● Append to premise hypothesis hypothesis ● Add irrelevant ● Reducing word ● Introducing strong information to overlap negation premise 27

  28. Constructing Distraction Tests (Cont.) Word Overlap Negation Length Mismatch ● Tautology: true is ● Tautology: false is ● Tautology: (true is true not true true)*5 Final Entailment Pair Final Entailment Pair Final Entailment Pair ● Append to ● Append to ● Append to premise hypothesis hypothesis ● Add irrelevant Premise: Possibly no other Premise: Possibly no other Premise: Possibly no other country has had such a ● Reducing word country has had such a ● Introducing strong country has had such a information to overlap negation premise turbulent history and true is turbulent history and false turbulent history and true is true. is not true. true and true is true and true Hypothesis: The country’s Hypothesis: The country’s is true and true is true and history has been turbulent. history has been turbulent. true is true. Hypothesis: The country’s Label: Entailment Label: Entailment history has been turbulent. Label: Entailment 28

Recommend


More recommend