transformer sequence models and sequence applications
play

Transformer Sequence Models and Sequence Applications (Machine - PowerPoint PPT Presentation

Transformer Sequence Models and Sequence Applications (Machine Translation, Speech Recognition) CSE392 - Spring 2019 Special Topic in CS Most NLP Tasks. E.g. Transformer Networks Sequence Tasks Transformers Language


  1. Transformer Sequence Models and Sequence Applications (Machine Translation, Speech Recognition) CSE392 - Spring 2019 Special Topic in CS

  2. Most NLP Tasks. E.g. ● Transformer Networks ● Sequence Tasks ○ Transformers ○ Language Modeling ○ BERT ○ Machine Translation ○ Speech Recognition

  3. Multi-level bidirectional RNN (LSTM or GRU) (Eisenstein, 2018)

  4. Multi-level bidirectional RNN (LSTM or GRU) Each node has a forward -> and backward <- hidden state: Can represent as a concatenation of both. (Eisenstein, 2018)

  5. Multi-level bidirectional RNN (LSTM or GRU) Average of top layer is an embedding (average of concated vectors) (Eisenstein, 2018)

  6. Multi-level bidirectional RNN (LSTM or GRU) Sometimes just use left-most and right-most hidden state instead (Eisenstein, 2018)

  7. Encoder A representation of input. (Eisenstein, 2018)

  8. Encoder-Decoder Representing input and converting to output (Eisenstein, 2018)

  9. y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax (Eisenstein, 2018)

  10. y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax <go> y (0) y (1) y (2) ….

  11. y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax A representation of input. <go>

  12. y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax A representation of input. <go> essentially a language model conditioned on the final state from the encoder.

  13. Encoder-Decoder When applied to new data... <go> essentially a language model conditioned on the final state from the encoder.

  14. y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax A representation of input. <go>

  15. Language 2: (e.g. English) y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax “seq2seq” model <go> Language 1: (e.g. Chinese)

  16. Encoder-Decoder Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) ….

  17. Encoder-Decoder Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) ….

  18. Encoder-Decoder The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. Kayla kicked the ball.

  19. Encoder-Decoder The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. A lot of responsibility put fixed-size hidden Kayla kicked the ball. state passed from encoder to decoder

  20. y (0) y (1) y (2) y (3) Long Distance / …. Out of order Softmax dependencies <go> A lot of responsibility put fixed-size hidden state passed from encoder to decoder

  21. y (0) y (1) y (2) y (3) Long Distance / …. Out of order Softmax dependencies <go>

  22. y (0) y (1) y (2) y (3) Attention …. Softmax s 1 s 3 s 4 s 2 <go>

  23. y (0) y (1) y (2) y (3) Attention …. Softmax Analogy: random access memory s 1 s 3 s 4 s 2 <go>

  24. y (0) y (1) y (2) y (3) Attention …. Softmax attention layer s 1 s 3 s 4 s 2 <go>

  25. y (0) y (1) y (2) y (3) Attention …. Softmax c hi attention layer h i-1 h i h i+1 s 1 z n-1 s 3 z n s 4 z n+1 s 2 <go> h n-1 h n h n+1 i: current token of output N: tokens of input h n-1 h n h n+1

  26. Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 2 4 s 1 s 2 s 3 s 4

  27. Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 2 4 z 1 z 2 z 3 z 4 Z is the vector to be attended to (the value in memory). It is typically hidden states of the input (i.e. s n ) but can be anything.

  28. Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 2 4 s 1 s 2 s 3 s 4

  29. Attention c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s 1 3 2 4 s 1 s 2 s 3 s 4

  30. Attention c hi h i 𝜔 α hi->s α hi->s v , W h , W s α hi->s α hi->s 1 3 Score function: 2 4 s 1 s 2 s 3 s 4

  31. Attention c hi h i 𝜔 α hi->s α hi->s v , W h , W s α hi->s α hi->s 1 3 Score function: 2 4 z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4 A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key vector” (s).

  32. Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s v , W h , W s 1 3 values 2 4 Score function: keys z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4 A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key vector” (s).

  33. Attention c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s 1 3 Alternative Scoring Functions 2 4 s 1 s 2 s 3 s 4

  34. Attention If variables are standardized, c hi h i 𝜔 matrix multiply produces a similarity score. α hi->s α hi->s α hi->s α hi->s 1 3 Alternative Scoring Functions 2 4 s 1 s 2 s 3 s 4

  35. Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)

  36. Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)

  37. (Bahdanau et al., 2015) Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)

  38. (Bahdanau et al., 2015) Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)

  39. Machine Translation Why? ● $40billion/year industry ● A center piece of many genres of science fiction ● A fairly “universal” problem: ○ Language understanding ○ Language generation ● Societal benefits of inter- cultural communication

  40. Machine Translation Why? ● $40billion/year industry ● A center piece of many genres of science fiction ● A fairly “universal” problem: ○ Language understanding ○ Language generation ● Societal benefits of inter- cultural communication (Douglas Adams)

  41. Machine Translation Why Neural Network Approach works? (Manning, 2018) ● Joint end-to-end training: learning all parameters at once. ● Exploiting distributed representations (embeddings) ● Exploiting variable-length context ● High quality generation from deep decoders - stronger language models (even when wrong, make sense)

  42. Machine Translation As an optimization problem (Eisenstein, 2018):

  43. Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)

  44. y (0) y (1) y (2) y (3) Attention …. Softmax Analogy: random access memory s 1 s 3 s 4 s 2 <go>

  45. y (0) y (1) y (2) y (3) Attention …. Softmax s 1 s 3 s 4 s 2 <go> Do we even need all these RNNs? (Vaswani et al., 2017: Attention is all you need )

  46. Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s v , W h , W s 1 3 values 2 4 keys z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4 A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key vector” (s).

  47. Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s v , W h , W s 1 3 values 2 4 keys z 1 z 2 z 3 z 4 h i s 1 s 2 s 3 s 4 s j z j A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key (Eisenstein, 2018) vector” (s).

  48. The Transformer: “Attention-only” models Attention as weighting a value based on a query and key: (Eisenstein, 2018)

  49. The Transformer: “Attention-only” models Output x α 𝜔 h h i-1 h i h i+1 (Eisenstein, 2018)

  50. The Transformer: “Attention-only” models Output α 𝜔 self attention h h i h i-1 h i h i+1 h i-1 h i-1 (Eisenstein, 2018)

  51. The Transformer: “Attention-only” models Output α 𝜔 h h i-1 h i h i+1 h i+2

  52. The Transformer: “Attention-only” models FFN Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  53. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  54. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 ... Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2 ….

  55. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α Attend to all hidden states in your “neighborhood”. 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  56. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output + k t q α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  57. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 scaling parameter (k t q) σ Output ( k,q ) + α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  58. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 (k t q) σ k t q Output ( k,q ) + α X X X X Linear layer: dot product W T X 𝜔 dp dp dp One set of weights for h each of for K, Q, and V h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  59. The Transformer: “Attention-only” models Why? ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input steps) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.

  60. The Transformer Limitation (thus far): Can’t capture multiple types of dependencies between words.

Recommend


More recommend