Linguistic Knowledge and Transferability of Contextual Representations Nelson F. Matt Yonatan Matthew E. Noah A. Liu Gardner Belinkov Peters Smith NAACL 2019—June 3, 2019 UWNLP � 1
[McCann et al., 2017; Peters et al., 2018a; Devlin et al., 2019, inter alia ] Contextual Word Representations Are Extraordinarily Effective • Contextual word representations (from contextualizers like ELMo or BERT) work well on many NLP tasks. • But why do they work so well? • Better understanding enables principled enhancement. • This work: studies a few questions about their generalizability and transferability. � 2
(1) Probing Contextual Representations Question: Is the information necessary for a variety of core NLP tasks linearly recoverable from contextual word representations? Answer: Yes, to a great extent! Tasks with lower performance may require fine- grained linguistic knowledge. � 3
(2) How Does Transferability Vary? Question: How does transferability vary across contextualizer layers? Answer: First layer in LSTMs is the most transferable. Middle layers for transformers. � 4
(3) Why Does Transferability Vary? Question: Why does transferability vary across contextualizer layers? Answer: It depends on pretraining task-specificity! � 5
(4) Alternative Pretraining Objectives Question: How does language model pretraining compare to alternatives? Answer: Even with 1 million tokens, language model pretraining yields the most transferable representations. But, transferring between related tasks does help. � 6
[Shi et al., 2016; Adi et al., 2017] Probing Models � 7
[Shi et al., 2016; Adi et al., 2017] Probing Models � 8
[Shi et al., 2016; Adi et al., 2017] Probing Models � 9
[Shi et al., 2016; Adi et al., 2017] Probing Models � 10
[Shi et al., 2016; Adi et al., 2017] Probing Models � 11
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 13
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 14
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 15
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 16
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 17
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 18
[Belinkov, 2018; Blevins et al., 2018; Tenney et al., 2019] Pairwise Probing � 19
Probing Model Setup • Contextualizer weights are always frozen. • Results are from the highest-performing contextualizer layer. • We use a linear probing model. � 20
Contextualizers Analyzed � 21
[Peters et al., 2018a,b] Contextualizers Analyzed ELMo 2-layer 4-layer 6-layer Bidirectional language Transformer LSTM LSTM model (BiLM) pretraining (ELMo (ELMo original) (ELMo 4-layer) transformer) on 1B Word Benchmark � 22
[Peters et al., 2018a,b; Radford et al., 2018] Contextualizers Analyzed ELMo 2-layer 4-layer 6-layer Bidirectional language Transformer LSTM LSTM model (BiLM) pretraining (ELMo (ELMo original) (ELMo 4-layer) transformer) on 1B Word Benchmark OpenAI Transformer 12-layer Left-to-right language transformer model pretraining on uncased BookCorpus � 23
[Peters et al., 2018a,b; Radford et al., 2018; Devlin et al., 2019] Contextualizers Analyzed ELMo 2-layer 4-layer 6-layer Bidirectional language Transformer LSTM LSTM model (BiLM) pretraining (ELMo (ELMo original) (ELMo 4-layer) transformer) on 1B Word Benchmark OpenAI Transformer 12-layer Left-to-right language transformer model pretraining on uncased BookCorpus BERT (cased) 24-layer 12-layer Masked language model transformer transformer pretraining on (BERT large) (BERT base) BookCorpus + Wikipedia � 24
(1) Probing Contextual Representations Question: Is the information necessary for a variety of core NLP tasks linearly recoverable from contextual word representations? Answer: Yes, to a great extent! Tasks with lower performance may require fine- grained linguistic knowledge. � 25
Examined 17 Diverse Probing Tasks • Part-of-Speech • Syntactic • Syntactic Dependency Tagging Chunking Arc Prediction • CCG Supertagging • Named entity • Syntactic Dependency recognition • Semantic Tagging Arc Classification • Grammatical • Semantic Dependency • Preposition error detection Arc Prediction supersense • Conjunct disambiguation • Semantic Dependency identification • Event Factuality Arc Classification • Coreference Arc • Syntactic Prediction Constituency Ancestor Tagging � 26
Linear Probing Models Rival Task-Specific Architectures • Part-of-Speech • Syntactic • Syntactic Dependency Tagging Chunking Arc Prediction • CCG Supertagging • Named entity • Syntactic Dependency recognition • Semantic Tagging Arc Classification • Grammatical • Semantic Dependency • Preposition error detection Arc Prediction supersense • Conjunct disambiguation • Semantic Dependency identification • Event Factuality Arc Classification • Coreference Arc • Syntactic Prediction Constituency Ancestor Tagging � 27
CCG Supertagging 100 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 28
CCG Supertagging 100 71.58 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 29
CCG Supertagging 94.28 93.31 100 92.68 82.69 71.58 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 30
CCG Supertagging 94.7 94.28 93.31 100 92.68 82.69 71.58 75 Accuracy 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 31
Event Factuality 100 Pearson Correlation (r) x 100 75 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 32
Event Factuality 100 77.10 76.25 74.03 Pearson Correlation (r) x 100 73.20 70.88 75 49.70 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 33
But Linear Probing Models Underperform on Some Tasks • Tasks that linear model + contextual word representation performs poorly may require more fine-grained linguistic knowledge. • In these cases, task-specific contextualization leads to especially large gains. See the paper for more details. � 34
Named Entity Recognition 100 75 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 35
Named Entity Recognition 100 75 53.22 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 36
Named Entity Recognition 100 84.44 82.85 81.21 75 58.14 53.22 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 37
Named Entity Recognition 100 91.38 84.44 82.85 81.21 75 58.14 53.22 F1 50 25 0 GloVe ELMo ELMo OpenAI BERT SOTA (original) (transformer) Transformer (large) � 38
(2) How Does Transferability Vary? Question: How does transferability vary across contextualizer layers? Answer: First layer in LSTMs is the most transferable. Middle layers for transformers. � 39
Layerwise Patterns in Transferability � 40
Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks � 41
Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks � 42
Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks Transformer-based Contextualizers � 43
Layerwise Patterns in Transferability LSTM-based Contextualizers ELMo (original) ELMo (4-layer) Tasks Tasks Transformer-based Contextualizers OpenAI Transformer ELMo (transformer) Tasks Tasks BERT (base, cased) BERT (large, cased) Tasks Tasks � 44
(3) Why Does Transferability Vary? Question: Why does transferability vary across contextualizer layers? Answer: It depends on pretraining task-specificity! � 45
Layerwise Patterns Dictated by Perplexity LSTM-based ELMo (original) 8000 7026 Outputs of higher LSTM layers are 7000 better for language modeling 6000 (have lower perplexity) 5000 Perplexity 4000 3000 2000 920 1000 235 0 Layer 0 Layer 1 Layer 2 � 46
Layerwise Patterns Dictated by Perplexity LSTM-based ELMo (4-layer) 4500 4204 Outputs of higher LSTM layers are 4000 better for language modeling (have lower perplexity) 3500 3000 Perplexity 2398 2363 2500 2000 1500 1013 1000 500 195 0 Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 � 47
Layerwise Patterns Dictated by Perplexity Transformer-based ELMo (6-layer) 600 546 523 500 448 374 400 Perplexity 314 295 300 200 91 100 0 Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 � 48
Recommend
More recommend