transformer sequence models
play

Transformer Sequence Models CSE354 - Spring 2020 Natural Language - PowerPoint PPT Presentation

Transformer Sequence Models CSE354 - Spring 2020 Natural Language Processing Most NLP Tasks. E.g. Transformer Networks Sequence Tasks Transformers Language Modeling BERT Machine Translation Speech


  1. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output + k t q α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  2. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 scaling parameter (k t q) σ Output ( k,q ) + α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  3. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 (k t q) σ k t q Output ( k,q ) + α X X X X Linear layer: dot product W T X 𝜔 dp dp dp One set of weights for h each of for K, Q, and V h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  4. The Transformer: “Attention-only” models Why? ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input steps) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.

  5. The Transformer Limitation (thus far): Can’t capture multiple types of dependencies between words.

  6. The Transformer Solution: Multi-head attention

  7. Multi-head Attention

  8. Transformer for Encoder-Decoder

  9. Transformer for Encoder-Decoder sequence index (t)

  10. Transformer for Encoder-Decoder

  11. Transformer for Encoder-Decoder Residualized Connections

  12. Transformer for Encoder-Decoder residuals enable positional information to be passed along Residualized Connections

  13. Transformer for Encoder-Decoder

  14. Transformer for Encoder-Decoder essentially, a language model

  15. Transformer for Encoder-Decoder essentially, a language model Decoder blocks out future inputs

  16. Transformer for Encoder-Decoder Add conditioning of the LM based on the encoder essentially, a language model

  17. Transformer for Encoder-Decoder

  18. Transformer (as of 2017) “WMT-2014” Data Set. BLEU scores:

  19. Transformer ● Utilize Self-Attention ● Simple att scoring function (dot product, scaled) ● Added linear layers for Q, K, and V ● Multi-head attention ● Added positional encoding ● Added residual connection ● Simulate decoding by masking

  20. Transformer Why? ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input steps) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing. Drawbacks: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

  21. BERT Why? B idirectional E ncoder R epresentations from T ransformers ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input Produces contextualized embeddings steps) (or pre-trained contextualized encoder) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing. Drawbacks of Vanilla Transformers: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

  22. BERT Why? B idirectional E ncoder R epresentations from T ransformers ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input Produces contextualized embeddings steps) (or pre-trained contextualized encoder) ● Enables “interactions” (i.e. adaptations) between words ● Bidirectional context by “masking” in the middle ● Easy to parallelize -- don’t need sequential ● A lot of layers, hidden states, attention heads. processing. Drawbacks of Vanilla Transformers: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

  23. BERT B idirectional E ncoder R epresentations from T ransformers Produces contextualized embeddings (or pre-trained contextualized encoder) ● Bidirectional context by “masking” in the middle ● A lot of layers, hidden states, attention heads. She saw the man on the hill with the telescope. She [mask] the man on the hill [mask] the telescope.

  24. BERT B idirectional E ncoder R epresentations from T ransformers Produces contextualized embeddings (or pre-trained contextualized encoder) ● Bidirectional context by “masking” in the middle ● A lot of layers, hidden states, attention heads. She saw the man on the hill with the telescope. Mask 1 in 7 words: Too few: expensive, less robust ● She [mask] the man on the hill [mask] the telescope. Too many: not enough context ●

  25. BERT B idirectional E ncoder R epresentations from T ransformers Produces contextualized embeddings (or pre-trained contextualized encoder) ● Bidirectional context by “masking” in the middle ● A lot of layers, hidden states, attention heads. ● BERT-Base, Cased : 12-layer, 768-hidden, 12-heads , 110M parameters

  26. BERT B idirectional E ncoder R epresentations from T ransformers Produces contextualized embeddings (or pre-trained contextualized encoder) ● Bidirectional context by “masking” in the middle ● A lot of layers, hidden states, attention heads. ● BERT-Base, Cased : 12-layer, 768-hidden, 12-heads , 110M parameters ● BERT-Large, Cased : 24-layer, 1024-hidden, 16-heads, 340M parameters ● BERT-Base, Multilingual Cased : 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters

  27. BERT (Devlin et al., 2019)

  28. BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. (Devlin et al., 2019)

  29. BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

  30. BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

  31. BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

  32. BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

  33. tokenize into “word pieces” BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

  34. BERT Performance: e.g. Question Answering https://rajpurkar.github.io/SQuAD-explorer/

  35. Bert: Attention by Layers https://colab.research.google.com/drive/1vlOJ1lhdujVjfH857hvYKIdKPTD9Kid8 (Vig, 2019)

  36. BERT: Pre-training; Fine-tuning 12 or 24 layers

  37. BERT: Pre-training; Fine-tuning 12 or 24 layers

  38. BERT: Pre-training; Fine-tuning Novel classifier (e.g. sentiment classifier; stance detector...etc..) 12 or 24 layers

  39. BERT: Pre-training; Fine-tuning [CLS] vector at start Novel classifier is supposed to (e.g. sentiment classifier; stance detector...etc..) capture meaning of whole sequence.

  40. BERT: Pre-training; Fine-tuning Novel classifier [CLS] vector at start (e.g. sentiment classifier; stance detector...etc..) is supposed to capture meaning of avg whole sequence. Average of top layer (or second to top) also often used.

  41. Extra Material:

Recommend


More recommend