lecture 12 attention and transformers
play

Lecture 12: Attention and Transformers Julia Hockenmaier - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 12: Attention and Transformers Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Lecture 12: Attention and Transformers Attention Mechanisms


  1. CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 12: 
 Attention and Transformers Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

  2. Lecture 12: 
 Attention and Transformers Attention Mechanisms CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 2

  3. Encoder-Decoder (seq2seq) model Task: Read an input sequence 
 and return an output sequence – Machine translation: translate source into target language – Dialog system/chatbot: generate a response Reading the input sequence: RNN Encoder Generating the output sequence: RNN Decoder Encoder Decoder output hidden input 3 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  4. 
 
 
 
 
 
 
 A more general view of seq2seq Insight 1: In general, any function of the encoder’s output can be used as a representation of the context 
 we want to condition the decoder on. 
 Insight 2: We can feed the context in at any time step during decoding (not just at the beginning). 4 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  5. 
 Adding attention to the decoder Basic idea: Feed a d -dimensional representation of the entire (arbitrary-length) input sequence into the decoder 
 at each time step during decoding. This representation of the input can be a weighted average of the encoder’s representation of the input (i.e. its output) The weights of each encoder output element tell us how much attention we should pay to different parts of the input sequence Since different parts of the input may be more or less important for different parts of the output, we want to vary the weights over the input during the decoding process. (Cf. Word alignments in machine translation) 5 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  6. Adding attention to the decoder We want to condition the output generation of the decoder on a context-dependent representation of the input sequence. Attention computes a probability distribution over the encoder’s hidden states that depends on the decoder’s current hidden state (This distribution is computed anew for each output symbol ) This attention distribution is used to compute a weighted average of the encoder’s hidden state vectors . This context-dependent embedding of the input sequence 
 is fed into the output of the decoder RNN. 6 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  7. 
 Attention, more formally α ( t ) = ( α ( t ) 1 , . . . , α ( t ) S ) Define a probability distribution 
 over the S elements of the input sequence 
 that depends on the current output element t 
 Use this distribution to compute a weighted average of the ∑ ∑ α ( t ) α ( t ) s o s s h s encoder’s output or hidden states s =1.. S s =1.. S and feed that into the decoder. hhttps://www.tensorflow.org/tutorials/text/nmt_with_attention 7 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  8. Attention, more formally α ( t ) = ( α ( t ) 1 , . . . , α ( t ) S ) 1. Compute a probability distribution 
 h ( s ) over the encoder’s hidden states 
 h ( t ) that depends on the decoder’s current exp ( s ( h ( t ) , h ( s ) ) ) α ( t ) s = ∑ s ′ exp ( s ( h ( t ) , h ( s ′ ) ) ) α ( t ) c ( t ) h ( s ) 2. Use to compute a weighted avg . of the encoder’s : c ( t ) = ∑ α ( t ) s h ( s ) s =1.. S c ( t ) h ( t ) o ( t ) 3. Use both and to compute a new output , e.g. as 
 o ( t ) = tanh ( W 1 h ( t ) + W 2 c ( t ) ) 8 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  9. Defining Attention Weights Hard attention (degenerate case, non-differentiable): 
 α ( t ) = ( α ( t ) 1 , . . . , α ( t ) S ) is a one-hot vector 
 (e.g. 1 = most similar element to decoder’s vector, 0 = all other elements) Soft attention (general case): 
 α ( t ) = ( α ( t ) 1 , . . . , α ( t ) S ) is not a one-hot — Use the dot product (no learned parameters): 
 s ( h ( t ) , h ( s ) ) = h ( t ) ⋅ h ( s ) — Learn a bilinear matrix W: 
 T W h ( s ) s ( h ( t ) , h ( s ) ) = ( h ( t ) ) — Learn separate weights for the hidden states: 
 s ( h ( t ) , h ( s ) ) = v T tanh ( W 1 h ( t ) + W 2 h ( s ) ) 9 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  10. Lecture 12: 
 Attention and Transformers s r e m r o f l l a s s i n n o a i t n r e 7 t 1 t T 0 A 2 . l S a P I t N e , i d n a e e w n h s u a o V y CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 10

  11. Transformers Sequence transduction model based on attention ( no convolutions or recurrence ) — easier to parallelize than recurrent nets — faster to train than recurrent nets — captures more long-range dependencies 
 than CNNs with fewer parameters Transformers use stacked self-attention 
 and position-wise, fully-connected layers 
 for the encoder and decoder Transformers form the basis of BERT, GPT(2-3), and other state-of-the-art neural sequence models. 11 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  12. 
 Seq2seq attention mechanisms α ( t ) = ( α ( t ) 1 , . . . , α ( t ) S ) Define a probability distribution 
 over the S elements of the input sequence 
 that depends on the current output element t 
 Use this distribution to compute a weighted average of the ∑ ∑ α ( t ) α ( t ) s o s s h s encoder’s output or hidden states s =1.. S s =1.. S and feed that into the decoder. hhttps://www.tensorflow.org/tutorials/text/nmt_with_attention 12 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  13. Self-Attention Attention so far (in seq2seq architectures): In the decoder (which has access to the complete input sequence), compute attention weights over encoder positions 
 that depend on each decoder position 
 Self-attention: If the encoder has access to the complete input sequence, 
 we can also compute attention weights over encoder positions that depend on each encoder position self-attention: encoder For each decoder position t…, …Compute an attention weight for each encoder position s …Renormalize these weights (that depend on t ) w/ softmax to get a new weighted avg. of the input sequence vectors 13 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  14. Self-attention: Simple variant Given T k -dimensional input vectors x (1) … x (i) … x (T) , compute T k- dimensional output vectors y (1) … y (i) … y (T) where each output y (i) is a weighted average of the input vectors, and where the weights w ij depend on y (i) and x (j) y ( i ) = ∑ w ij x ( j ) j =1.. T Computing weights w ij naively (no learned parameters) 
 ij = ∑ k x ( j ) x ( i ) w ′ Dot product: k k exp( w ′ ij ) w ij = Followed by softmax: ∑ j exp( w ′ ij ) 14 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  15. 
 Towards more flexible self-attention y ( i ) = ∑ j =1.. T w ij x ( j ) To compute , we must… … take the element x (i) … … decide the weight of each x (j) depending on x (i) w ij … average all elements x (j) according to their weights 
 Observation 1: Dot product-based weights are large when x (i) , x (j) are similar. But we may want a more flexible approach. Idea 1: Learn attention weights that depend on x (i) and x (j) 
 w ij in a manner that works best for the task Observation 2: This weighted average is still just a simple function of the original x (j) s Idea 2: Learn weights that re-weight the elements of x (j) 
 in a manner that works best for the task 15 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  16. Self-attention with queries, keys, values k × k W Let’s add learnable parameters (three weight matrices ), 
 x ( i ) that allow us turn any input vector into three versions : q ( i ) = W q x ( i ) — Query vector to compute averaging weights at pos. i k ( i ) = W k x ( i ) — Key vector: to compute averaging weights of pos. i v ( i ) = W v x ( i ) — Value vector: to compute the value of pos. i to be averaged The attention weight of the j- th position used in the weighted average 
 at the i- th position depends on the query of i and the key of j : exp ( ∑ l q ( i ) l k ( j ) l ) exp ( q ( i ) k ( j ) ) w ( i ) = = j ∑ j exp ( q ( i ) k ( j ) ) l k ( j ) ∑ j exp ( ∑ l q ( i ) l ) The new output vector for the i -th position depends on 
 the attention weights and value vectors of all input positions j : y ( i ) = ∑ w ( i ) j v ( j ) j =1.. T 16 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  17. Transformer Architecture Non-Recurrent Encoder-Decoder 
 architecture — No hidden states — Context information 
 captured via attention and positional encodings — Consists of stacks of layers 
 with various sublayers 17 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Recommend


More recommend