transformer models
play

Transformer Models CSE545 - Spring 2019 Review: Feed Forward - PowerPoint PPT Presentation

Transformer Models CSE545 - Spring 2019 Review: Feed Forward Network Z (full-connected) (skymind, AI Wiki) Review: Convolutional NN (Barter, 2018) Review: Recurrent Neural Network y (t) = f(h (t) W) Activation Function h (t) = g(h (t-1)


  1. Transformer Models CSE545 - Spring 2019

  2. Review: Feed Forward Network Z (full-connected) (skymind, AI Wiki)

  3. Review: Convolutional NN (Barter, 2018)

  4. Review: Recurrent Neural Network y (t) = f(h (t) W) Activation Function h (t) = g(h (t-1) U + x (t) V) “hidden layer” (Jurafsky, 2019)

  5. FFN CNN RNN Can model computation (e.g. matrix operations for a single input) be parallelized?

  6. FFN CNN RNN Can model computation (e.g. matrix operations for a single input) be parallelized?

  7. FFN CNN RNN Can model computation (e.g. matrix operations for a single input) be parallelized?

  8. FFN CNN RNN Ultimately limits how complex the model can be (i.e. it’s total number of paramers/weights) as compared to a CNN. Can model computation (e.g. matrix operations for a single input) be parallelized?

  9. The Transformer: “Attention-only” models Can handle sequences and long-distance dependencies, but…. ● Don’t want complexity of LSTM/GRU cells ● Constant num edges between input steps ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.

  10. The Transformer: “ Attention -only” models The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. Kayla kicked the ball.

  11. The Transformer: “ Attention -only” models The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. Kayla kicked the ball.

  12. Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 values 2 4 z 1 z 2 z 3 z 4

  13. Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s W 1 3 values 2 4 Score function: z 1 z 2 z 3 z 4

  14. Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s W 1 3 values 2 4 Score function: keys z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4

  15. The Transformer: “ Attention -only” models Challenge: ● Long distance dependency when translating: Attention came about for encoder decoder models. Then self-attention was introduced:

  16. Attention query c hi h i 𝜔 α hi->s α hi->s keys α hi->s α hi->s W 1 3 values 2 4 Score function: z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4

  17. Self-Attention c i 𝜔 α hi->s α hi->s keys y r α hi->s α hi->s W e u q 1 3 values 2 4 Score function: z 1 z 2 z i z 4 s 1 s 2 s i s 4

  18. The Transformer: “Attention-only” models Attention as weighting a value based on a query and key: (Eisenstein, 2018)

  19. The Transformer: “Attention-only” models Output x α 𝜔 h h i-1 h i h i+1 (Eisenstein, 2018)

  20. The Transformer: “Attention-only” models Output α 𝜔 self attention h h i h i-1 h i h i+1 h i-1 h i-1 (Eisenstein, 2018)

  21. The Transformer: “Attention-only” models Output α 𝜔 h h i-1 h i h i+1 h i+2

  22. The Transformer: “Attention-only” models FFN Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  23. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  24. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 ... Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2 ….

  25. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α Attend to all hidden states in your “neighborhood”. 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  26. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output + k t q α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  27. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 scaling parameter (k t q) σ Output ( k,q ) + α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  28. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 (k t q) σ k t q Output ( k,q ) + α X X X X Linear layer: dot product W T X 𝜔 dp dp dp One set of weights for h each of for K, Q, and V h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  29. The Transformer Limitation (thus far): Can’t capture multiple types of dependencies between words.

  30. The Transformer Solution: Multi-head attention

  31. Multi-head Attention

  32. Transformer for Encoder-Decoder

  33. Transformer for Encoder-Decoder sequence index (t)

  34. Transformer for Encoder-Decoder

  35. Transformer for Encoder-Decoder Residualized Connections

  36. Transformer for Encoder-Decoder residuals enable positional information to be passed along Residualized Connections

  37. Transformer for Encoder-Decoder

  38. Transformer for Encoder-Decoder essentially, a language model

  39. Transformer for Encoder-Decoder essentially, a language model Decoder blocks out future inputs

  40. Transformer for Encoder-Decoder Add conditioning of the LM based on the encoder essentially, a language model

  41. Transformer for Encoder-Decoder

  42. Transformer (as of 2017) “WMT-2014” Data Set. BLEU scores:

  43. Transformer ● Utilize Self-Attention ● Simple att scoring function (dot product, scaled) ● Added linear layers for Q, K, and V ● Multi-head attention ● Added positional encoding ● Added residual connection ● Simulate decoding by masking https://4.bp.blogspot.com/-OlrV-PAtEkQ/W3RkOJCBkaI/AAAAAAAADOg/gNZXo_eK3tMNOmIfsuvPzrRfNb3qFQwJwCLcB GAs/s640/image1.gif

  44. Transformer Why? ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input steps) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing. Drawbacks: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

  45. BERT Why? B idirectional E ncoder R epresentations from T ransformers ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input Produces contextualized embeddings steps) (or pre-trained contextualized encoder) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing. Drawbacks of Vanilla Transformers: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

  46. BERT Why? B idirectional E ncoder R epresentations from T ransformers ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input Produces contextualized embeddings steps) (or pre-trained contextualized encoder) ● Enables “interactions” (i.e. adaptations) between words ● Bidirectional context by “masking” in the middle ● Easy to parallelize -- don’t need sequential ● A lot of layers, hidden states, attention heads. processing. Drawbacks of Vanilla Transformers: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

  47. tokenize into “word pieces” BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

  48. Bert: Attention by Layers https://colab.research.google.com/drive/1vlOJ1lhdujVjfH857hvYKIdKPTD9Kid8 (Vig, 2019)

  49. BERT Performance: e.g. Question Answering https://rajpurkar.github.io/SQuAD-explorer/

  50. BERT: Pre-training; Fine-tuning 12 or 24 layers

  51. BERT: Pre-training; Fine-tuning 12 or 24 layers

  52. BERT: Pre-training; Fine-tuning Novel classifier (e.g. sentiment classifier; stance detector...etc..) 12 or 24 layers

  53. The Transformer: “Attention-only” models Can handle sequences and long-distance dependencies, but…. ● Don’t want complexity of LSTM/GRU cells ● Constant num edges between input steps ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.

Recommend


More recommend