development of ai has been driven by benchmarks and
play

Development of AI has been driven by benchmarks and datasets. - PowerPoint PPT Presentation

Adversarial NLI: A New Benchmark for Natural Language Understanding Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, Douwe Kiela UNC Chapel Hill & Facebook AI Research 1 Development of AI has been driven by benchmarks and


  1. Adversarial NLI: A New Benchmark for Natural Language Understanding Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, Douwe Kiela UNC Chapel Hill & Facebook AI Research 1

  2. Development of AI has been driven by benchmarks and datasets. Computer Vision: (Russakovsky et al. 2015) NLP: (Rajpurkar et al. 2016), (Wang et al. 2018) 2

  3. Error Rate 30 25 26 20 15 16.4 11.7 10 Human: 5.1 7.3 6.7 5 3.6 3.1 2.3 0 XRCE AlexNet ZF VGG GoogleNet ResNet GoogleNet-v4 SENet 2011 2012 2013 2014 2014 2015 2016 2017 3 years 3

  4. Exact Match 100 Human: 86.8 90 89.9 85.08 80 78.58 70 72.14 67.97 64.74 60 50 Match-LSTM Ptr BiDAF BiDAF+SelfAtt BiDAF+SelfAtt+ELMo BERT XLNet 2016 2016 2017 2018 2018 2019 2 years 4

  5. Score 100 95 Human: 87.1 90 90.3 88.1 85 80 80.5 75 70 70 65 60 BiLSTM+Attn+ELMo BERT RoBERTa T5 2018 2018 2019 2019 1 year 5

  6. Model vs. Human on Static Benchmarks Superhuman performance achieved Human won Human still won • Word2Vec • ELMo • BERT • Glove • GPT-1 • RoBERTa • GPT-2 … 6

  7. Model vs. Human on Static Benchmarks Superhuman performance achieved …… Human won Human still won • Word2Vec • ELMo • BERT • T5 • Glove • GPT-1 • RoBERTa • GPT-3 • GPT-2 … 7

  8. Model vs. Human on Static Benchmarks Superhuman performance achieved …… Human won Human still won • Word2Vec • ELMo • BERT • T5 • Glove • GPT-1 • RoBERTa • GPT-3 • GPT-2 … Superhuman at NLU? 8

  9. Model vs. Human on Static Benchmarks Superhuman performance achieved …… Human won Human still won • Word2Vec • ELMo • BERT • T5 • Glove • GPT-1 • RoBERTa • GPT-3 • GPT-2 … Are current NLU models genuinely as good as their high performance on static benchmark? 9

  10. Overestimated NLU Ability The state-of-the-art models learn to exploit spurious statistical patterns and are vulnerable to adversaries. Adversary for reading comprehension Adversary for natural language inference (Jia and Liang, 2017) (Nie et al., 2018) 10

  11. Overestimated NLU Ability The state-of-the-art models learn to exploit spurious statistical patterns and are vulnerable to adversaries. Adversary for reading comprehension Adversary for natural language inference (Jia and Liang, 2017) (Nie et al., 2018) § Annotation artifacts (Gururangan et al., 2018, Poliak et al. 2018) § Breaking NLI with lexical inference (Glockner et al., 2018) § Pathologies of Neural Models (Feng et al., 2018) § Modeling task or annotator? (Geva et al., 2019) § Right for the wrong reason (McCoy et al., 2019) … 11

  12. Performance is Overestimated Model brittleness can be exposed by researchers or non-experts. General NLU is still far from achieved despite the high performance. How to solve the benchmark fast-saturation and robustness issues? 12

  13. 13

  14. HAMLET Human-And-Model-in-the-Loop Enabled Training Context is also premise according to NLI terminology. 14

  15. HAMLET Human-And-Model-in-the-Loop Enabled Training Context is also premise according to NLI terminology. 15

  16. HAMLET Human-And-Model-in-the-Loop Enabled Training Context is also premise according to NLI terminology. 16

  17. HAMLET Human-And-Model-in-the-Loop Enabled Training Context is also premise according to NLI terminology. 17

  18. HAMLET Human-And-Model-in-the-Loop Enabled Training Context is also premise according to NLI terminology. 18

  19. HAMLET Human-And-Model-in-the-Loop Enabled Training Context is also premise according to NLI terminology. 19

  20. Related work Adversarial & Human-in-the-Loop 20

  21. Adversarial NLI (ANLI) Analogy: white-hat hackers finding vulnerabilities in models, which we then patch for the next round. Three rounds of data collection. - Round 1 Model: BERT (Trained on SNLI+MNLI ) Domain: Wikipedia - Round 2 Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1 ) Domain: Wikipedia - Round 3 Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1+A2 ) Domains: Wikipedia, News, Fiction, Spoken, WikiHow, RTE5 21

  22. Adversarial NLI (ANLI) Analogy: white-hat hackers finding vulnerabilities in models, which we then patch for the next round. Three rounds of data collection. Dataset Genre Context Train / Dev / Test - Round 1 (A1) Model: BERT (Trained on SNLI+MNLI ) A1 Wiki 2,080 16,946 / 1,000 / 1,000 Domain: Wikipedia A2 Wiki 2,694 45,460 / 1,000 / 1,000 Various 6,002 100,459 / 1,200 / 1,200 - Round 2 (A2) A3 (Wiki subset) 1,000 19,920 / 200 / 200 Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1 ) Domain: Wikipedia ANLI Various 10,776 162,865 / 3,200 / 3,200 - Round 3 (A3) SNLI: 570K Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1+A2 ) MNLI: 433K Domains: Wikipedia, News, Fiction, Spoken, WikiHow, RTE5 ANLI: 163K 22

  23. Adversarial NLI (ANLI) Analogy: white-hat hackers finding vulnerabilities in models, which we then patch for the next round. Three rounds of data collection. Dataset Genre Context Train / Dev / Test - Round 1 (A1) Model: BERT (Trained on SNLI+MNLI ) A1 Wiki 2,080 16,946 / 1,000 / 1,000 Domain: Wikipedia A2 Wiki 2,694 45,460 / 1,000 / 1,000 Various 6,002 100,459 / 1,200 / 1,200 - Round 2 (A2) A3 (Wiki subset) 1,000 19,920 / 200 / 200 Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1 ) Domain: Wikipedia ANLI Various 10,776 162,865 / 3,200 / 3,200 - Round 3 (A3) SNLI: 570K Model: RoBERTa ensemble (Trained on SNLI+MNLI+FEVER+A1+A2 ) MNLI: 433K Domains: Wikipedia, News, Fiction, Spoken, WikiHow, RTE5 ANLI: 163K • Adversarially collected • More data-efficient in training 23

  24. Collection Statistics Model Error Rate during Median Time (sec.) per Example during Collection Collection 189.6 189.1 29.68 157 125.2 wiki wiki 17.47 16.59 all all 14.79 0 0 0 0 A1 A2 A3 A1 A2 A3 24

  25. Collection Statistics Model Error Rate during Median Time (sec.) per Example during Collection Collection 189.6 189.1 Error rate halved with 3 rounds 29.68 157 125.2 wiki wiki 17.47 16.59 all all 14.79 0 0 0 0 A1 A2 A3 A1 A2 A3 Room for improvement on NLI still exists 25

  26. Findings Base model (backend model in the collection) performance is low 26

  27. RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 70 60 S+M 50 40 Chance 30 20 10 0 A1 A2 A3 27

  28. RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 70 60 S+M S+M+F 50 40 Chance 30 20 10 0 A1 A2 A3 28

  29. RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 70 60 S+M S+M+F 50 S+M+F+A1 40 Chance 30 20 10 0 A1 A2 A3 29

  30. RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 70 60 S+M S+M+F 50 S+M+F+A1 40 S+M+F+A1+A2 Chance 30 20 10 0 A1 A2 A3 30

  31. RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 70 60 S+M S+M+F 50 S+M+F+A1 40 S+M+F+A1+A2 Chance 30 S+M+F+A1+A2+A3 20 10 0 A1 A2 A3 31

  32. RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 Rounds become increasingly more difficult. 70 60 S+M S+M+F 50 S+M+F+A1 40 S+M+F+A1+A2 Chance 30 S+M+F+A1+A2+A3 20 10 0 A1 A2 A3 32

  33. RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 70 Training on more rounds improves robustness. 60 S+M S+M+F 50 S+M+F+A1 40 S+M+F+A1+A2 Chance 30 S+M+F+A1+A2+A3 20 10 0 A1 A2 A3 33

  34. RoBERTa performance on different rounds as we accumulatively combine training data (S=SNLI, M=MNLI, F=FEVER) 80 70 60 S+M S+M+F 50 S+M+F+A1 40 S+M+F+A1+A2 Chance 30 S+M+F+A1+A2+A3 XLNet (All Data) 20 BERT(All) 10 0 A1 A2 A3 34

  35. RoBERTa (All Data) vs. XLNet (All Data) vs. BERT (All Data) 80 70 60 50 40 Chance 30 RoBERTa (All Data) S+M+F+A1+A2+A3 XLNet (All Data) 20 BERT(All) 10 0 A1 A2 A3 Different models have different weakness 35

  36. RoBERTa performance with different training data 100 90 80 70 60 SNLI+MNLI (~900K) 50 40 30 Chance 20 10 0 A1 A2 A3 SNLI MNLI-m MNLI-mm Model trained only on SNLI and MNLI (statically collected) is not good at ANLI 36

  37. RoBERTa performance with different training data 100 90 80 70 60 SNLI+MNLI (~900K) 50 ANLI-Only ( 162K) 40 30 Chance 20 10 0 A1 A2 A3 SNLI MNLI-m MNLI-mm Model trained only on SNLI and MNLI (statically collected) is not good at ANLI But Model trained only on ANLI (adversarially collected) is reasonably good at SNLI and MNLI 37

  38. RoBERTa performance with different training data 100 90 80 ANLI is less than 1/5 of SNLI+MNLI 70 60 SNLI+MNLI (~900K) 50 ANLI-Only ( 162K) 40 30 Chance 20 10 0 A1 A2 A3 SNLI MNLI-m MNLI-mm Model trained only on SNLI and MNLI (statically collected) is not good at ANLI But Model trained only on ANLI (adversarially collected) is reasonably good at SNLI and MNLI 38

Recommend


More recommend