linguistic knowledge and transferability of contextual
play

Linguistic Knowledge and Transferability of Contextual - PowerPoint PPT Presentation

Linguistic Knowledge and Transferability of Contextual Representations Nelson F. Matt Yonatan Matthew E. Noah A. Liu Gardner Belinkov Peters Smith NAACL 2019June 3, 2019 UWNLP 1 [McCann et al., 2017; Peters et al., 2018a;


  1. Linguistic Knowledge and Transferability of Contextual Representations Nelson F. Matt Yonatan Matthew E. Noah A. Liu Gardner Belinkov Peters Smith NAACL 2019—June 3, 2019 UWNLP � 1

  2. [McCann et al., 2017; Peters et al., 2018a; Devlin et al., 2019, inter alia ] Contextual Word Representations Are Extraordinarily Effective • Contextual word representations (from contextualizers like ELMo or BERT) work well on many NLP tasks. • But why do they work so well? • Better understanding enables principled enhancement. • This work: studies a few questions about their generalizability and transferability. � 2

  3. (1) Probing Contextual Representations Question: Is the information necessary for a variety of core NLP tasks linearly recoverable from contextual word representations? Answer: Yes, to a great extent! Tasks with lower performance may require fine- grained linguistic knowledge. � 3

  4. (2) How Does Transferability Vary? Question: How does transferability vary across contextualizer layers? Answer: First layer in LSTMs is the most transferable. Middle layers for transformers. � 4

  5. (3) Why Does Transferability Vary? Question: Why does transferability vary across contextualizer layers? Answer: It depends on pretraining task-specificity! � 5

  6. (4) Alternative Pretraining Objectives Question: How does language model pretraining compare to alternatives? Answer: Even with 1 million tokens, language model pretraining yields the most transferable representations. But, transferring between related tasks does help. � 6

  7. [Shi et al., 2016; Adi et al., 2017] Probing Models � 7

  8. [Shi et al., 2016; Adi et al., 2017] Probing Models � 8

  9. [Shi et al., 2016; Adi et al., 2017] Probing Models � 9

  10. [Shi et al., 2016; Adi et al., 2017] Probing Models � 10

  11. [Shi et al., 2016; Adi et al., 2017] Probing Models � 11

  12. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing

  13. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 13

  14. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 14

  15. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 15

  16. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 16

  17. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 17

  18. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 18

  19. [Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 19

  20. Probing Model Setup • Contextualizer weights are always frozen. • Results are from the highest-performing contextualizer layer. • We use a linear probing model. � 20

  21. Contextualizers Analyzed � 21

  22. [Peters et al., 2018a,b] Contextualizers Analyzed ELMo 2-layer 4-layer 6-layer Bidirectional language Transformer LSTM LSTM model (BiLM) pretraining (ELMo (ELMo original) (ELMo 4-layer) transformer) on 1B Word Benchmark � 22

  23. [Peters et al., 2018a,b; Radford et al., 2018] Contextualizers Analyzed ELMo 2-layer 4-layer 6-layer Bidirectional language Transformer LSTM LSTM model (BiLM) pretraining (ELMo (ELMo original) (ELMo 4-layer) transformer) on 1B Word Benchmark OpenAI Transformer 12-layer Left-to-right language transformer model pretraining on uncased BookCorpus � 23

  24. [Peters et al., 2018a,b; Radford et al., 2018; Devlin et al., 2019] Contextualizers Analyzed ELMo 2-layer 4-layer 6-layer Bidirectional language Transformer LSTM LSTM model (BiLM) pretraining (ELMo (ELMo original) (ELMo 4-layer) transformer) on 1B Word Benchmark OpenAI Transformer 12-layer Left-to-right language transformer model pretraining on uncased BookCorpus BERT (cased) 24-layer 12-layer Masked language model transformer transformer pretraining on (BERT large) (BERT base) BookCorpus + Wikipedia � 24

  25. (1) Probing Contextual Representations Question: Is the information necessary for a variety of core NLP tasks linearly recoverable from contextual word representations? Answer: Yes, to a great extent! Tasks with lower performance may require fine- grained linguistic knowledge. � 25

  26. Examined 17 Diverse Probing Tasks • Part-of-Speech • Syntactic • Syntactic Dependency Tagging Chunking Arc Prediction • CCG Supertagging • Named entity • Syntactic Dependency recognition • Semantic Tagging Arc Classification • Grammatical • Semantic Dependency • Preposition error detection Arc Prediction supersense • Conjunct disambiguation • Semantic Dependency identification • Event Factuality Arc Classification • Coreference Arc • Syntactic Prediction Constituency Ancestor Tagging � 26

  27. Linear Probing Models Rival Task-Specific Architectures • Part-of-Speech • Syntactic • Syntactic Dependency Tagging Chunking Arc Prediction • CCG Supertagging • Named entity • Syntactic Dependency recognition • Semantic Tagging Arc Classification • Grammatical • Semantic Dependency • Preposition error detection Arc Prediction supersense • Conjunct disambiguation • Semantic Dependency identification • Event Factuality Arc Classification • Coreference Arc • Syntactic Prediction Constituency Ancestor Tagging � 27

  28. CCG Supertagging 100 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 28

  29. CCG Supertagging 100 71.58 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 29

  30. CCG Supertagging 94.28 93.31 100 92.68 82.69 71.58 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 30

  31. CCG Supertagging 94.7 94.28 93.31 100 92.68 82.69 71.58 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 31

  32. Event Factuality 100 Pearson Correlation (r) x 100 75 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 32

  33. Event Factuality 100 77.10 76.25 74.03 Pearson Correlation (r) x 100 73.20 70.88 75 49.70 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 33

  34. But Linear Probing Models Underperform on Some Tasks • Tasks that linear model + contextual word representation performs poorly may require more fine-grained linguistic knowledge. • In these cases, task-specific contextualization leads to especially large gains. See the paper for more details. � 34

  35. Named Entity Recognition 100 75 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 35

  36. Named Entity Recognition 100 75 53.22 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 36

  37. Named Entity Recognition 100 84.44 82.85 81.21 75 58.14 53.22 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 37

  38. Named Entity Recognition 100 91.38 84.44 82.85 81.21 75 58.14 53.22 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 38

  39. (2) How Does Transferability Vary? Question: How does transferability vary across contextualizer layers? Answer: First layer in LSTMs is the most transferable. Middle layers for transformers. � 39

  40. Layerwise Patterns in Transferability � 40

  41. Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks � 41

  42. Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks � 42

  43. Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks Transformer-based Contextualizers � 43

  44. Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks Transformer-based Contextualizers OpenAI Transformer ELMo (transformer) Tasks Tasks BERT (base, cased) BERT (large, cased) Tasks Tasks � 44

  45. (3) Why Does Transferability Vary? Question: Why does transferability vary across contextualizer layers? Answer: It depends on pretraining task-specificity! � 45

  46. Layerwise Patterns Dictated by Perplexity LSTM-based ELMo (original) 8000 7026 Outputs of higher LSTM layers are 7000 better for language modeling 6000 (have lower perplexity) 5000 Perplexity 4000 3000 2000 920 1000 235 0 Layer 0 Layer 1 Layer 2 � 46

  47. Layerwise Patterns Dictated by Perplexity LSTM-based ELMo (4-layer) 4500 4204 Outputs of higher LSTM layers are 4000 better for language modeling (have lower perplexity) 3500 3000 Perplexity 2398 2363 2500 2000 1500 1013 1000 500 195 0 Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 � 47

  48. Layerwise Patterns Dictated by Perplexity Transformer-based ELMo (6-layer) 600 546 523 500 448 374 400 Perplexity 314 295 300 200 91 100 0 Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 � 48

Recommend


More recommend