Transformer Sequence Models and Sequence Applications (Machine Translation, Speech Recognition) CSE392 - Spring 2019 Special Topic in CS
Most NLP Tasks. E.g. ● Transformer Networks ● Sequence Tasks ○ Transformers ○ Language Modeling ○ BERT ○ Machine Translation ○ Speech Recognition
Multi-level bidirectional RNN (LSTM or GRU) (Eisenstein, 2018)
Multi-level bidirectional RNN (LSTM or GRU) Each node has a forward -> and backward <- hidden state: Can represent as a concatenation of both. (Eisenstein, 2018)
Multi-level bidirectional RNN (LSTM or GRU) Average of top layer is an embedding (average of concated vectors) (Eisenstein, 2018)
Multi-level bidirectional RNN (LSTM or GRU) Sometimes just use left-most and right-most hidden state instead (Eisenstein, 2018)
Encoder A representation of input. (Eisenstein, 2018)
Encoder-Decoder Representing input and converting to output (Eisenstein, 2018)
y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax (Eisenstein, 2018)
y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax <go> y (0) y (1) y (2) ….
y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax A representation of input. <go>
y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax A representation of input. <go> essentially a language model conditioned on the final state from the encoder.
Encoder-Decoder When applied to new data... <go> essentially a language model conditioned on the final state from the encoder.
y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax A representation of input. <go>
Language 2: (e.g. English) y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax “seq2seq” model <go> Language 1: (e.g. Chinese)
Encoder-Decoder Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) ….
Encoder-Decoder Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) ….
Encoder-Decoder The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. Kayla kicked the ball.
Encoder-Decoder The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. A lot of responsibility put fixed-size hidden Kayla kicked the ball. state passed from encoder to decoder
y (0) y (1) y (2) y (3) Long Distance / …. Out of order Softmax dependencies <go> A lot of responsibility put fixed-size hidden state passed from encoder to decoder
y (0) y (1) y (2) y (3) Long Distance / …. Out of order Softmax dependencies <go>
y (0) y (1) y (2) y (3) Attention …. Softmax s 1 s 3 s 4 s 2 <go>
y (0) y (1) y (2) y (3) Attention …. Softmax Analogy: random access memory s 1 s 3 s 4 s 2 <go>
y (0) y (1) y (2) y (3) Attention …. Softmax attention layer s 1 s 3 s 4 s 2 <go>
y (0) y (1) y (2) y (3) Attention …. Softmax c hi attention layer h i-1 h i h i+1 s 1 z n-1 s 3 z n s 4 z n+1 s 2 <go> h n-1 h n h n+1 i: current token of output N: tokens of input h n-1 h n h n+1
Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 2 4 s 1 s 2 s 3 s 4
Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 2 4 z 1 z 2 z 3 z 4 Z is the vector to be attended to (the value in memory). It is typically hidden states of the input (i.e. s n ) but can be anything.
Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 2 4 s 1 s 2 s 3 s 4
Attention c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s 1 3 2 4 s 1 s 2 s 3 s 4
Attention c hi h i 𝜔 α hi->s α hi->s v , W h , W s α hi->s α hi->s 1 3 Score function: 2 4 s 1 s 2 s 3 s 4
Attention c hi h i 𝜔 α hi->s α hi->s v , W h , W s α hi->s α hi->s 1 3 Score function: 2 4 z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4 A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key vector” (s).
Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s v , W h , W s 1 3 values 2 4 Score function: keys z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4 A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key vector” (s).
Attention c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s 1 3 Alternative Scoring Functions 2 4 s 1 s 2 s 3 s 4
Attention If variables are standardized, c hi h i 𝜔 matrix multiply produces a similarity score. α hi->s α hi->s α hi->s α hi->s 1 3 Alternative Scoring Functions 2 4 s 1 s 2 s 3 s 4
Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)
Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)
(Bahdanau et al., 2015) Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)
(Bahdanau et al., 2015) Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)
Machine Translation Why? ● $40billion/year industry ● A center piece of many genres of science fiction ● A fairly “universal” problem: ○ Language understanding ○ Language generation ● Societal benefits of inter- cultural communication
Machine Translation Why? ● $40billion/year industry ● A center piece of many genres of science fiction ● A fairly “universal” problem: ○ Language understanding ○ Language generation ● Societal benefits of inter- cultural communication (Douglas Adams)
Machine Translation Why Neural Network Approach works? (Manning, 2018) ● Joint end-to-end training: learning all parameters at once. ● Exploiting distributed representations (embeddings) ● Exploiting variable-length context ● High quality generation from deep decoders - stronger language models (even when wrong, make sense)
Machine Translation As an optimization problem (Eisenstein, 2018):
Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)
y (0) y (1) y (2) y (3) Attention …. Softmax Analogy: random access memory s 1 s 3 s 4 s 2 <go>
y (0) y (1) y (2) y (3) Attention …. Softmax s 1 s 3 s 4 s 2 <go> Do we even need all these RNNs? (Vaswani et al., 2017: Attention is all you need )
Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s v , W h , W s 1 3 values 2 4 keys z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4 A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key vector” (s).
Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s v , W h , W s 1 3 values 2 4 keys z 1 z 2 z 3 z 4 h i s 1 s 2 s 3 s 4 s j z j A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key (Eisenstein, 2018) vector” (s).
The Transformer: “Attention-only” models Attention as weighting a value based on a query and key: (Eisenstein, 2018)
The Transformer: “Attention-only” models Output x α 𝜔 h h i-1 h i h i+1 (Eisenstein, 2018)
The Transformer: “Attention-only” models Output α 𝜔 self attention h h i h i-1 h i h i+1 h i-1 h i-1 (Eisenstein, 2018)
The Transformer: “Attention-only” models Output α 𝜔 h h i-1 h i h i+1 h i+2
The Transformer: “Attention-only” models FFN Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2
The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2
The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 ... Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2 ….
The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α Attend to all hidden states in your “neighborhood”. 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2
The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output + k t q α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2
The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 scaling parameter (k t q) σ Output ( k,q ) + α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2
The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 (k t q) σ k t q Output ( k,q ) + α X X X X Linear layer: dot product W T X 𝜔 dp dp dp One set of weights for h each of for K, Q, and V h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2
The Transformer: “Attention-only” models Why? ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input steps) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.
The Transformer Limitation (thus far): Can’t capture multiple types of dependencies between words.
Recommend
More recommend