Language Models with Transformers Chenguang Wang, Mu Li, Alexander J. Smola Amazon Web Services
Background 2
Language Model (LM) • Predict what word comes next Start to learn English 3
Language Model (LM) • Predict what word comes next • Useful in many NLP applications Start to learn English 4
Language Model (LM) • Predict what word comes next • Useful in many NLP applications Word order matters! Start to learn English Learn to start business • Many NLP problems share similar definition 5
Language Model with RNNs • RNN uses one-hot encoding Start input 6
Language Model with RNNs • RNN models the word order in hidden state to output RNN hidden state Start input 7
Language Model with RNNs • RNN models the word order in hidden state to learn output RNN RNN hidden state Start to input 8
Language Model with RNNs • RNN models the word order in hidden state to learn English output RNN RNN RNN hidden state Start to learn input 9
SOTA NLP with Transformers Transformer Positional encoding With less word order Other components are omitted for simplicity [Devlin, Jacob, et al 2018] 10
SOTA NLP with Transformers Transformer • Parallelizable Self-attention • Efficient Positional encoding Other components are omitted for simplicity [Devlin, Jacob, et al 2018] 11
SOTA NLP with Transformers • With less word order Transformer • Parallelizable • Efficient Self-attention RNN Positional encoding • With word order • Sequential Other components are omitted • Less efficient for simplicity [Devlin, Jacob, et al 2018] 12
SOTA NLP with Transformers • BERT: a stack of 12 (or 24) Transformer blocks Transformer 11 . . . Transformer 1 Transformer 0 13
SOTA NLP with Transformers • BERT: a stack of 12 (or 24) Transformer blocks Transformer 11 • Trained on large language model datasets . . • Full training cost in excess of $10,000 (16 TPU, 4 days) . • Achieved SOTA results on 11 NLP applications • Sentence level tasks: care less about word order Transformer 1 Transformer 0 14
Approach: Make Best Use of BERT for Language Model 15
LM: Adapted BERT BERT with Linear Layer Linear Transformer 11 . . Transformer 0 embedding Fixed Tunable weights weights 16
LM 1: Adapted BERT with Fixed Weights Model Test PPL Only moderate results BERT 69.32 Linear (the Lower, the Better) RNN 42.25 Transformer 11 . . Transformer 0 embedding Fixed Tunable weights weights 17
LM 2: Adapted BERT with All Weights Model Test PPL BERT 69.32 Linear Overfitting BERT-All 67.43 Transformer 11 . RNN 42.25 . Transformer 0 embedding Fixed Tunable weights weights 18
LM 3: Adapted BERT with Partial Weights Model Test PPL Fix a subset of weights is BERT 69.32 Linear promising BERT-All 67.43 Transformer 11 . BERT-Subset 40.56 . However, enumerating is Transformer 0 RNN 42.25 not feasible embedding Fixed Tunable weights weights 19
LM 4: Adapted BERT with RNN Add RNN to capture Model Test PPL word order is promising BERT 69.32 Linear BERT-RNN 41.64 RNN RNN 42.25 Transformer 11 However, enumerating . . is not feasible Transformer 0 Where • embedding How many • Fixed Tunable weights weights 20
Where to add the RNN layers? 21
Which layer’s pre-trained weights should be fixed? Where to add the RNN layers? 22
Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer • Step 3: Go to Step 1 or Terminate Fixed Tunable Fix Transformer 0’s weights weights weights Transformer 1 Transformer 1 Transformer 0 Transformer 0 embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 23
Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer • Step 3: Go to Step 1 or Terminate Fixed Tunable Add a RNN layer weights weights RNN Transformer 1 Transformer 1 Transformer 1 Transformer 0 Transformer 0 Transformer 0 embedding embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 24
Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer Add a linear layer • Step 3: Go to Step 1 or Terminate Fixed Tunable weights weights Linear RNN RNN Transformer 1 Transformer 1 Transformer 1 Transformer 1 Transformer 0 Transformer 0 Transformer 0 Transformer 0 embedding embedding embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 25
Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer • Step 3: Go to Step 1 or Terminate Fixed Tunable weights weights Linear RNN RNN Transformer 1 Transformer 1 Transformer 1 Transformer 1 Transformer 0 Transformer 0 Transformer 0 Transformer 0 embedding embedding embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 26
Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 100 Test Perplexity 80 60 40 20 0 PTB WT-103 27
Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Test Perplexity 80 60 40 20 0 PTB WT-103 28
Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Capture word order Test Perplexity 80 60 40 20 0 PTB WT-103 29
Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Capture word order Test Perplexity 80 60 40 20 Achieve SOTA: 31.34 PPL with 0.5 GPU days 0 PTB WT-103 30
Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Capture word order Test Perplexity 80 60 Achieve 20.42 PPL 40 with 1B tokens 20 Achieve SOTA: 31.34 PPL with 0.5 GPU days 0 PTB WT-103 31
Take-aways • BERT needs to be adapted for language model • Add RNN layers with neural architecture search works • Fix pre-trained weights with neural architecture search works 32
Recommend
More recommend