in5550 neural methods in natural language processing
play

IN5550: Neural Methods in Natural Language Processing IN5550 - PowerPoint PPT Presentation

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural Language Processing Transformers Jeremy Barnes University of Oslo March 31, 2020 Attention - tl;dr Pay attention to a weighted combination of input


  1. IN5550: Neural Methods in Natural Language Processing – IN5550 – Neural Methods in Natural Language Processing Transformers Jeremy Barnes University of Oslo March 31, 2020

  2. Attention - tl;dr Pay attention to a weighted combination of input states to generate the right output state 2

  3. Attention RNNs + attention work great, but are inefficient ◮ impossible to make computation parallel ◮ leads to long train times and smaller models ◮ To enjoy the benefits of deep learning, our models need to be truly deep! 3

  4. Desiderta for a new kind of model 1. Reduce the total computational complexity per layer 2. Increase the amount of computation that can be parallelized 3. Ensure that the model can efficiently learn long range dependencies. 4

  5. Self-attention John Lennon, 1967: love is all u need Vaswani et al., 2017: 5

  6. Self-attention Main principle: instead of a target paying attention to different parts of the source, make the source pay attention to itself. 6

  7. Self-attention Values (V) happy not am so . I I am Keys (K) not so happy . ◮ By making parts of a sentence pay attention to other parts of itself, we get better representations ◮ This can be an RNN replacement ◮ Where an RNN carries long-term information down a chain, self-attention acts more like a tree 7

  8. Transformer Remember for attention: K = key vector V = value vector 8

  9. Transformer For Transformers, we will add another: Q = query vector K = key vector V = value vector We will see what the difference is in a minute 9

  10. Transformer 10

  11. Transformer attention 11

  12. Scaled dot product attention Remember, we have two ways of doing attention 1. easy, fast way: dot product attention 2. parameterized : 1-layer feed-forward network to determine attention weights But can’t we get the benefits of the 2nd without the extra parameters? hypothesis: With large values, dot product leads to large values, and the gradient becomes small (squashing with sigmoid/tanh) solution: Scaled dot-product attention Make sure the dot product doesn’t get too big 12

  13. Scaled dot product attention The important bit: The maths: Attention( Q, K, V ) = softmax( QK T ) V √ d k 13

  14. Transformer Attention( Q, K, V ) = softmax( QK T ) V √ d k What’s happening at a token level: ◮ Obtain three representations of the input, Q , K and V - query, key and value ◮ Obtain a set of relevance strengths: QK T . For words i and j , Q i · K j represents the strength of the association - exactly like in seq2seq attention. ◮ Scale it (stabler gradients, boring maths) and softmax for α s. ◮ Unlike seq2seq, use different ‘value’ vectors to weight. In a sense, this is exactly like seq2seq attention, except: a) non-recurrent representations, b) same source/target, c) different value vectors 14

  15. Intuition behind Query, Key, and Value vectors 15

  16. Multi-head attention 16

  17. Adding heads Revolutionary idea: if representations learn so much from attention, why not learn many attentions Multi-headed attention is many self-attentions (Simplified) transformer: 17

  18. Point-wise feed-forward layers 18

  19. Point-wise feed-forward layers ◮ Add 2-layer feed-forward layers after attention layers ◮ Same across all positions, but different for layers ◮ Again a trade-off to increase model complexity while keeping computation costs down 19

  20. Position embeddings 20

  21. Position embeddings But wait, now we lost our sequence information :( ◮ Use an encoding that gives this information ◮ Mix of sine and cosine functions ◮ How would this work? ◮ Why do they need both? In the end, learning positional embeddings is often better ◮ But, it has a very large disadvantage ◮ No way to represent sequences longer than those seen in training ◮ You have to chop your data off at an arbitrary length 21

  22. Depth To get the benefits of deep learning, we need depth. ◮ Let’s make it deep: ◮ encoder: 6 layers ◮ decoder : 6 layers 22

  23. Transformer 23

  24. Transformer ◮ Can be complicated to train ◮ Has its own ADAM setup (learning rate is proportional to step − 0 . 5 ) ◮ dropout added just before residual ◮ label smoothing ◮ during decoding add length penalties ◮ checkpoint averaging 24

  25. Transformer ◮ Have a look at The Annotated Transformer ◮ http://nlp.seas.harvard.edu/2018/04/03/attention.html 25

  26. Evolution of Transformer-based Models 26

  27. Evolution of Transformer-based Models 27

  28. Evolution of Transformer-based Models 28

  29. Carbon footprint *from Strubel et al. (2019) Energy and policy considerations for deep learning in NLP. 29

  30. What can we to avoid wasting resources? 30

  31. Sharing is caring To avoid retraining lots of models ◮ We can share the trained models ◮ Nordic Language Processing Laboratory (NLPL) is a good example ◮ But it’s important to get things right ◮ METADATA!!! ◮ same format for all models 31

  32. Reduce model size? What if we can reduce the size of these giant models? ◮ Often, overparameterized transformer models lead to better performance, even with less data ◮ Lottery-ticket hypothesis: for large enough models, there is a small chance that random initialization will lead to a submodel that already has good weights for the task ◮ But interestingly, you can often remove a large number of the parameters for only a small decrease in performance 32

  33. Reduce model size? 33

  34. Model distillation This is an example. Pretrained T eacher Model Model This is an example. Pretrained T eacher Model Model Student Model 34

  35. Head pruning Performance = 93.7 eacher Model Pretrained T Performance = 93.7 Performance = 93.0

  36. But how can we NLPers contribute to sustainability? ◮ When possible use pre-trained models ◮ If you train a strong model, similarly make it available to the community. ◮ Try to reduce the amount of hyperparameter tuning we do (for example, by working with models that are more robust to hyperparameters) 36

Recommend


More recommend