Transformer Models CSE545 - Spring 2019
Review: Feed Forward Network Z (full-connected) (skymind, AI Wiki)
Review: Convolutional NN (Barter, 2018)
Review: Recurrent Neural Network y (t) = f(h (t) W) Activation Function h (t) = g(h (t-1) U + x (t) V) “hidden layer” (Jurafsky, 2019)
FFN CNN RNN Can model computation (e.g. matrix operations for a single input) be parallelized?
FFN CNN RNN Can model computation (e.g. matrix operations for a single input) be parallelized?
FFN CNN RNN Can model computation (e.g. matrix operations for a single input) be parallelized?
FFN CNN RNN Ultimately limits how complex the model can be (i.e. it’s total number of paramers/weights) as compared to a CNN. Can model computation (e.g. matrix operations for a single input) be parallelized?
The Transformer: “Attention-only” models Can handle sequences and long-distance dependencies, but…. ● Don’t want complexity of LSTM/GRU cells ● Constant num edges between input steps ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.
The Transformer: “ Attention -only” models The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. Kayla kicked the ball.
The Transformer: “ Attention -only” models The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. Kayla kicked the ball.
Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 values 2 4 z 1 z 2 z 3 z 4
Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s W 1 3 values 2 4 Score function: z 1 z 2 z 3 z 4
Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s W 1 3 values 2 4 Score function: keys z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4
The Transformer: “ Attention -only” models Challenge: ● Long distance dependency when translating: Attention came about for encoder decoder models. Then self-attention was introduced:
Attention query c hi h i 𝜔 α hi->s α hi->s keys α hi->s α hi->s W 1 3 values 2 4 Score function: z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4
Self-Attention c i 𝜔 α hi->s α hi->s keys y r α hi->s α hi->s W e u q 1 3 values 2 4 Score function: z 1 z 2 z i z 4 s 1 s 2 s i s 4
The Transformer: “Attention-only” models Attention as weighting a value based on a query and key: (Eisenstein, 2018)
The Transformer: “Attention-only” models Output x α 𝜔 h h i-1 h i h i+1 (Eisenstein, 2018)
The Transformer: “Attention-only” models Output α 𝜔 self attention h h i h i-1 h i h i+1 h i-1 h i-1 (Eisenstein, 2018)
The Transformer: “Attention-only” models Output α 𝜔 h h i-1 h i h i+1 h i+2
The Transformer: “Attention-only” models FFN Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2
The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2
The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 ... Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2 ….
The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α Attend to all hidden states in your “neighborhood”. 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2
The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output + k t q α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2
The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 scaling parameter (k t q) σ Output ( k,q ) + α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2
The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 (k t q) σ k t q Output ( k,q ) + α X X X X Linear layer: dot product W T X 𝜔 dp dp dp One set of weights for h each of for K, Q, and V h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2
The Transformer Limitation (thus far): Can’t capture multiple types of dependencies between words.
The Transformer Solution: Multi-head attention
Multi-head Attention
Transformer for Encoder-Decoder
Transformer for Encoder-Decoder sequence index (t)
Transformer for Encoder-Decoder
Transformer for Encoder-Decoder Residualized Connections
Transformer for Encoder-Decoder residuals enable positional information to be passed along Residualized Connections
Transformer for Encoder-Decoder
Transformer for Encoder-Decoder essentially, a language model
Transformer for Encoder-Decoder essentially, a language model Decoder blocks out future inputs
Transformer for Encoder-Decoder Add conditioning of the LM based on the encoder essentially, a language model
Transformer for Encoder-Decoder
Transformer (as of 2017) “WMT-2014” Data Set. BLEU scores:
Transformer ● Utilize Self-Attention ● Simple att scoring function (dot product, scaled) ● Added linear layers for Q, K, and V ● Multi-head attention ● Added positional encoding ● Added residual connection ● Simulate decoding by masking https://4.bp.blogspot.com/-OlrV-PAtEkQ/W3RkOJCBkaI/AAAAAAAADOg/gNZXo_eK3tMNOmIfsuvPzrRfNb3qFQwJwCLcB GAs/s640/image1.gif
Transformer Why? ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input steps) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing. Drawbacks: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)
BERT Why? B idirectional E ncoder R epresentations from T ransformers ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input Produces contextualized embeddings steps) (or pre-trained contextualized encoder) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing. Drawbacks of Vanilla Transformers: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)
BERT Why? B idirectional E ncoder R epresentations from T ransformers ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input Produces contextualized embeddings steps) (or pre-trained contextualized encoder) ● Enables “interactions” (i.e. adaptations) between words ● Bidirectional context by “masking” in the middle ● Easy to parallelize -- don’t need sequential ● A lot of layers, hidden states, attention heads. processing. Drawbacks of Vanilla Transformers: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)
tokenize into “word pieces” BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)
Bert: Attention by Layers https://colab.research.google.com/drive/1vlOJ1lhdujVjfH857hvYKIdKPTD9Kid8 (Vig, 2019)
BERT Performance: e.g. Question Answering https://rajpurkar.github.io/SQuAD-explorer/
BERT: Pre-training; Fine-tuning 12 or 24 layers
BERT: Pre-training; Fine-tuning 12 or 24 layers
BERT: Pre-training; Fine-tuning Novel classifier (e.g. sentiment classifier; stance detector...etc..) 12 or 24 layers
The Transformer: “Attention-only” models Can handle sequences and long-distance dependencies, but…. ● Don’t want complexity of LSTM/GRU cells ● Constant num edges between input steps ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.
Recommend
More recommend