language models with transformers
play

Language Models with Transformers Chenguang Wang, Mu Li, Alexander - PowerPoint PPT Presentation

Language Models with Transformers Chenguang Wang, Mu Li, Alexander J. Smola Amazon Web Services Background 2 Language Model (LM) Predict what word comes next Start to learn English 3 Language Model (LM) Predict what word comes next


  1. Language Models with Transformers Chenguang Wang, Mu Li, Alexander J. Smola Amazon Web Services

  2. Background 2

  3. Language Model (LM) • Predict what word comes next Start to learn English 3

  4. Language Model (LM) • Predict what word comes next • Useful in many NLP applications Start to learn English 4

  5. Language Model (LM) • Predict what word comes next • Useful in many NLP applications Word order matters! Start to learn English Learn to start business • Many NLP problems share similar definition 5

  6. Language Model with RNNs • RNN uses one-hot encoding Start input 6

  7. Language Model with RNNs • RNN models the word order in hidden state to output RNN hidden state Start input 7

  8. Language Model with RNNs • RNN models the word order in hidden state to learn output RNN RNN hidden state Start to input 8

  9. Language Model with RNNs • RNN models the word order in hidden state to learn English output RNN RNN RNN hidden state Start to learn input 9

  10. SOTA NLP with Transformers Transformer Positional encoding With less word order Other components are omitted for simplicity [Devlin, Jacob, et al 2018] 10

  11. SOTA NLP with Transformers Transformer • Parallelizable Self-attention • Efficient Positional encoding Other components are omitted for simplicity [Devlin, Jacob, et al 2018] 11

  12. SOTA NLP with Transformers • With less word order Transformer • Parallelizable • Efficient Self-attention RNN Positional encoding • With word order • Sequential Other components are omitted • Less efficient for simplicity [Devlin, Jacob, et al 2018] 12

  13. SOTA NLP with Transformers • BERT: a stack of 12 (or 24) Transformer blocks Transformer 11 . . . Transformer 1 Transformer 0 13

  14. SOTA NLP with Transformers • BERT: a stack of 12 (or 24) Transformer blocks Transformer 11 • Trained on large language model datasets . . • Full training cost in excess of $10,000 (16 TPU, 4 days) . • Achieved SOTA results on 11 NLP applications • Sentence level tasks: care less about word order Transformer 1 Transformer 0 14

  15. Approach: Make Best Use of BERT for Language Model 15

  16. LM: Adapted BERT BERT with Linear Layer Linear Transformer 11 . . Transformer 0 embedding Fixed Tunable weights weights 16

  17. LM 1: Adapted BERT with Fixed Weights Model Test PPL Only moderate results BERT 69.32 Linear (the Lower, the Better) RNN 42.25 Transformer 11 . . Transformer 0 embedding Fixed Tunable weights weights 17

  18. LM 2: Adapted BERT with All Weights Model Test PPL BERT 69.32 Linear Overfitting BERT-All 67.43 Transformer 11 . RNN 42.25 . Transformer 0 embedding Fixed Tunable weights weights 18

  19. LM 3: Adapted BERT with Partial Weights Model Test PPL Fix a subset of weights is BERT 69.32 Linear promising BERT-All 67.43 Transformer 11 . BERT-Subset 40.56 . However, enumerating is Transformer 0 RNN 42.25 not feasible embedding Fixed Tunable weights weights 19

  20. LM 4: Adapted BERT with RNN Add RNN to capture Model Test PPL word order is promising BERT 69.32 Linear BERT-RNN 41.64 RNN RNN 42.25 Transformer 11 However, enumerating . . is not feasible Transformer 0 Where • embedding How many • Fixed Tunable weights weights 20

  21. Where to add the RNN layers? 21

  22. Which layer’s pre-trained weights should be fixed? Where to add the RNN layers? 22

  23. Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer • Step 3: Go to Step 1 or Terminate Fixed Tunable Fix Transformer 0’s weights weights weights Transformer 1 Transformer 1 Transformer 0 Transformer 0 embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 23

  24. Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer • Step 3: Go to Step 1 or Terminate Fixed Tunable Add a RNN layer weights weights RNN Transformer 1 Transformer 1 Transformer 1 Transformer 0 Transformer 0 Transformer 0 embedding embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 24

  25. Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer Add a linear layer • Step 3: Go to Step 1 or Terminate Fixed Tunable weights weights Linear RNN RNN Transformer 1 Transformer 1 Transformer 1 Transformer 1 Transformer 0 Transformer 0 Transformer 0 Transformer 0 embedding embedding embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 25

  26. Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer • Step 3: Go to Step 1 or Terminate Fixed Tunable weights weights Linear RNN RNN Transformer 1 Transformer 1 Transformer 1 Transformer 1 Transformer 0 Transformer 0 Transformer 0 Transformer 0 embedding embedding embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 26

  27. Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 100 Test Perplexity 80 60 40 20 0 PTB WT-103 27

  28. Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Test Perplexity 80 60 40 20 0 PTB WT-103 28

  29. Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Capture word order Test Perplexity 80 60 40 20 0 PTB WT-103 29

  30. Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Capture word order Test Perplexity 80 60 40 20 Achieve SOTA: 31.34 PPL with 0.5 GPU days 0 PTB WT-103 30

  31. Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Capture word order Test Perplexity 80 60 Achieve 20.42 PPL 40 with 1B tokens 20 Achieve SOTA: 31.34 PPL with 0.5 GPU days 0 PTB WT-103 31

  32. Take-aways • BERT needs to be adapted for language model • Add RNN layers with neural architecture search works • Fix pre-trained weights with neural architecture search works 32

Recommend


More recommend