Seq2Seq Models and Attention M. Soleymani Sharif University of - PowerPoint PPT Presentation

Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen • We will illustrate with a single hidden layer, but the discussion generalizes to more layers 22

The “simple” translation model Th Ich habe einen apfel gegessen <eos> ENCODER I ate an apple <eos> Ich habe einen apfel gegessen DECODER 23

The “simple” translation model Th Ich habe einen apfel gegessen <eos> 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 " " " " " - - - - - I ate an apple <eos> Ich habe einen apfel gegessen • A more detailed look: The one-hot word representations may be compressed via embeddings – Embeddings will be learned along with the rest of the net – In the following slides we will not represent the projection matrices 24

Training th Tr the e system em Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen • Must learn to make predictions appropriately – Given “I ate an apple <eos>”, produce “Ich habe einen apfel gegessen <eos>”. 25

Training : Tr : Forward pass 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen • Forward pass: Input the source and target sequences, sequentially – Output will be a probability distribution over target symbol set (vocabulary) 26

Training : Tr : Backward pass Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen • Backward pass: Compute the loss between the output distribution and target word sequence 27

Training : Tr : Backward pass Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen • Backward pass: Compute the loss between the output distribution and target word sequence • Backpropagate the derivatives of the loss through the network to learn the net 28

Training : Tr : Backward pass Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen • In practice, if we apply SGD, we may randomly sample words from the output to actually use for the backprop and update – Typical usage: Randomly select one word from each input training instance (comprising an input-output pair) For each iteration • Randomly select training instance: (input, output) • • Forward pass Randomly select a single output y(t) and corresponding desired output d(t) for backprop • 29

Tr Trick of the trade: Reversing the input Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 an ate I <eos> apple Ich habe einen apfel gegessen • Standard trick of the trade: The input sequence is fed in reverse order – Things work better this way 30

Tr Trick of the trade: Reversing the input Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 an ate I <eos> apple Ich habe einen apfel gegessen • Standard trick of the trade: The input sequence is fed in reverse order – Things work better this way 31

Trick of the trade: Reversing the input Tr Ich habe einen apfel gegessen <eos> 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 an ate I <eos> apple Ich habe einen apfel gegessen • Standard trick of the trade: The input sequence is fed in reverse order – Things work better this way • This happens both for training and during actual decode 32

Ov Overall training • Given several training instance (𝐘, 𝐙 XYZ[\X ) • Forward pass: Compute the output of the network for (𝐘, 𝐙 XYZ[\X ) with input in reverse order – Note, both 𝐘 and 𝐙 XYZ[\X are used in the forward pass • Backward pass: Compute the loss between the desired target 𝐙 XYZ[\X and the actual output 𝐙 – Propagate derivatives of loss for updates • We called 𝐘 as 𝑱 and 𝐙 XYZ[\X as 𝑷 during training 33

What the Wha he ne network k actua ually y pr produc duces :cd 𝑧 " CfgFh 𝑧 " e:FD 𝑧 " … iFjkl 𝑧 " I ate an apple <eos> At each time 𝑙 the network actually produces a probability distribution over the output vocabulary • ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • The drawn word is provided as input to the next time 34 •

Wha What the he ne network k actua ually y pr produc duces Ich :cd 𝑧 " CfgFh 𝑧 " e:FD 𝑧 " … iFjkl 𝑧 " I ate an apple <eos> At each time 𝑙 the network actually produces a probability distribution over the output vocabulary • ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • The drawn word is provided as input to the next time • 35

Wha What the he ne network k actua ually y pr produc duces Ich :cd 𝑧 " CfgFh 𝑧 " e:FD 𝑧 " … iFjkl 𝑧 " I ate an apple <eos> Ich At each time 𝑙 the network actually produces a probability distribution over the output vocabulary • ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 36 The drawn word is provided as input to the next time •

What the Wha he ne network k actua ually y pr produc duces Ich :cd :cd 𝑧 " 𝑧 - CfgFh CfgFh 𝑧 " 𝑧 - e:FD e:FD 𝑧 " 𝑧 - … … iFjkl iFjkl 𝑧 " 𝑧 - I ate an apple <eos> Ich At each time 𝑙 the network actually produces a probability distribution over the output vocabulary • ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 37 The drawn word is provided as input to the next time •

Wha What the he ne network k actua ually y pr produc duces Ich habe :cd :cd 𝑧 " 𝑧 - CfgFh CfgFh 𝑧 " 𝑧 - e:FD e:FD 𝑧 " 𝑧 - … … iFjkl iFjkl 𝑧 " 𝑧 - I ate an apple <eos> Ich • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • The drawn word is provided as input to the next time • 38

Wha What the he ne network k actua ually y pr produc duces Ich habe :cd :cd 𝑧 " 𝑧 - CfgFh CfgFh 𝑧 " 𝑧 - e:FD e:FD 𝑧 " 𝑧 - … … iFjkl iFjkl 𝑧 " 𝑧 - I ate an apple <eos> Ich Ich habe • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 39 The drawn word is provided as input to the next time •

Wha What the he ne network k actua ually y pr produc duces Ich habe :cd :cd :cd 𝑧 " 𝑧 - 𝑧 . CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . … … … iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . I ate an apple <eos> Ich Ich habe • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • The drawn word is provided as input to the next time • 40

Wha What the he ne network k actua ually y pr produc duces Ich habe einen :cd :cd :cd 𝑧 " 𝑧 - 𝑧 . CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . … … … iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . I ate an apple <eos> Ich Ich habe • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 41 The drawn word is provided as input to the next time •

What the Wha he ne network k actua ually y pr produc duces Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd 𝑧 " 𝑧 - 𝑧 / 𝑧 . 𝑧 0 𝑧 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 e:FD e:FD e:FD e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 … … … … … … iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 I ate an apple <eos> Ich habe einen apfel gegessen • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 42 The drawn word is provided as input to the next time •

Generatin Ge ting an an outp tput t from th the net Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd 𝑧 " 𝑧 - 𝑧 / 𝑧 . 𝑧 0 𝑧 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 e:FD e:FD e:FD e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 … … … … … … iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 I ate an apple <eos> Ich habe einen apfel gegessen • At each time the network produces a probability distribution over words, given the entire input and previous outputs • At each time a word is drawn from the output distribution The drawn word is provided as input to the next time • The process continues until an <eos> is generated • 43

The probability of the output Th Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd 𝑧 " 𝑧 - 𝑧 / 𝑧 . 𝑧 0 𝑧 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 e:FD e:FD e:FD e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 … … … … … … iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 I ate an apple <eos> Ich habe einen apfel gegessen n p … 𝑧 m n o 𝑧 - n q 𝑄 𝑃 " , … , 𝑃 m |𝐽 " , … , 𝐽 $ = 𝑧 " • The objective of drawing: Produce the most likely output (that ends in an <eos>) n p … 𝑧 m n o 𝑧 - n q argmax 𝑧 " n o ,…,n q 44

The probability of the output Th Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd 𝑧 " 𝑧 - 𝑧 / 𝑧 . 𝑧 0 𝑧 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 Objective: e:FD e:FD e:FD e:FD e:FD e:FD n p … 𝑧 m 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 n o 𝑧 - n q argmax 𝑧 " n o ,…,n q … … … … … … iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 I ate an apple <eos> Ich habe einen apfel gegessen • Cannot just pick the most likely symbol at each time – That may cause the distribution to be more “confused” at the next time – Choosing a different, less likely word could cause the distribution at the next time to be more peaky, resulting in a more likely output overall 45

Greedy is Gr is not t good 𝑄(𝑃 . |𝑃 " , 𝑃 - , 𝐽 " , … , 𝐽 $ ) 𝑄(𝑃 . |𝑃 " , 𝑃 - , 𝐽 " , … , 𝐽 $ ) T=1 2 3 T=1 2 3 w 1 w 2 w 3 … w V w 1 w 2 w 3 … w V • Hypothetical example (from English speech recognition: Input is speech, output must be text) • “Nose” has highest probability at t=2 and is selected – The model is very confused at t=3 and assigns low probabilities to many words at the next time – Selecting any of these will result in low probability for the entire 3-word sequence • “Knows” has slightly lower probability than “nose”, but is still high and is selected – “he knows” is a reasonable beginning and the model assigns high probabilities to words such as “something” – Selecting one of these results in higher overall probability for the 3-word sequence 46

Gr Greedy is is not t good What should we 𝑄(𝑃 - |𝑃 " , 𝐽 " , … , 𝐽 $ ) have chosen at t=2?? “nose” “knows” Will selecting “nose” continue to have a bad effect into the w 1 w 2 w 3 … w V distant future? T=1 2 3 • Problem: Impossible to know a priori which word leads to the more promising future – Should we draw “nose” or “knows”? – Effect may not be obvious until several words down the line – Or the choice of the wrong word early may cumulatively lead to a poorer overall score over time 47

Greedy is Gr is not t good 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) What should we have chosen at t=1?? Choose “the” or “he”? w 1 the w 3 … he T=1 2 3 • Problem: Impossible to know a priori which word leads to the more promising future – Even earlier: Choosing the lower probability “the” instead of “he” at T=1 may have made a choice of “nose” more reasonable at T=2. • In general, making a poor choice at any time commits us to a poor future – But we cannot know at that time the choice was poor • Solution: Don’t choose.. 48

Solution: So : Mu Multiple choices I He We The • Retain both choices and fork the network – With every possible word as input 49

Problem: Multiple choices Pr ⋮ I ⋮ He ⋮ We ⋮ ⋮ The • Problem : This will blow up very quickly – For an output vocabulary of size 𝑊 , after 𝑈 output steps we’d have forked out 𝑊 v branches 50

Solution: So : Pru rune Beam I search He 𝑈𝑝𝑞 y 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) We ⋮ The • Solution: Prune – At each time, retain only the top K scoring forks 51

So Solution: : Pru rune Beam I search He 𝑈𝑝𝑞 y 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) We ⋮ The • Solution: Prune – At each time, retain only the top K scoring forks 52

So Solution: : Pru rune Beam I search ⋮ Note: based on product Knows He 𝑈𝑝𝑞 y 𝑄(𝑃 - 𝑃 " |𝐽 " , … , 𝐽 $ ) … = 𝑈𝑝𝑞 y 𝑄(𝑃 - |𝑃 " , 𝐽 " , … , 𝐽 $ )𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) I ⋮ The Nose • Solution: Prune … – At each time, retain only the top K scoring forks 53

So Solution: : Pru rune Beam I search ⋮ Note: based on product Knows He 𝑈𝑝𝑞 y 𝑄(𝑃 - 𝑃 " |𝐽 " , … , 𝐽 $ ) … = 𝑈𝑝𝑞 y 𝑄(𝑃 - |𝑃 " , 𝐽 " , … , 𝐽 $ )𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) I ⋮ The Nose • Solution: Prune … – At each time, retain only the top K scoring forks 54

So Solution: : Pru rune Beam search ⋮ Knows = 𝑈𝑝𝑞 y 𝑄 𝑃 - 𝑃 " , 𝐽 " , … , 𝐽 $ × He 𝑄 𝑃 - 𝑃 " , 𝐽 " , … , 𝐽 $ × 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) … ⋮ The Nose • Solution: Prune – At each time, retain only the top K scoring forks 55

So Solution: : Pru rune Beam search ⋮ Knows = 𝑈𝑝𝑞 y 𝑄 𝑃 - 𝑃 " , 𝐽 " , … , 𝐽 $ × He 𝑄 𝑃 - 𝑃 " , 𝐽 " , … , 𝐽 $ × 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) … ⋮ The Nose • Solution: Prune – At each time, retain only the top K scoring forks 56

So Solution: : Pru rune Beam search Knows He | 𝑈𝑝𝑞 y { 𝑄 𝑃 < 𝑃 " , … , 𝑃 <=" , 𝐽 " , … , 𝐽 $ <}" The Nose • Solution: Prune – At each time, retain only the top K scoring forks 57

Terminate Te Beam search Knows He <eos> The Nose • Terminate – When the current most likely path overall ends in <eos> • Or continue producing more outputs (each of which terminates in <eos>) to get N-best outputs 58

Termination: <eo Te eos> Beam Example has K = 2 <eos> search Knows He <eos> <eos> The Nose • Terminate – Paths cannot continue once the output an <eos> • So paths may be different lengths Select the most likely sequence ending in <eos> across all terminating sequences 59 •

Appl Applications ns • Machine Translation – My name is Tom à Ich heisse Tom/Mein name ist Tom • Dialog – “I have a problem” à “How may I help you” • Image to text – Picture à Caption for picture 60

Ma Machine Translation Examp mple • Hidden state clusters by meaning! “Sequence-to-sequence learning with neural networks”, Sutskever, Vinyals, and Le, 2014 61

Ma Machine Translation Examp mple “Sequence-to-sequence learning with neural networks”, Sutskever, Vinyals and Le, 2014 62

Hum Human an Mac achine hine Conver ersatio tion: n: Exam ample ple • Trained on human-human conversations • Task: Human text in, machine response out 63 “A neural conversational model”, Orin Vinyals and Quoc Le, 2015

A pr A probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 R " - . / 0 I ate an apple <eos> Ich habe einen apfel gegessen • All the information about the input sequence is embedded into a single vector – The “hidden” node layer at the end of the input sequence – This one node is “overloaded” with information • Particularly if the input is long 64

A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> • In reality: All hidden values carry information – Some of which may be diluted downstream • Different outputs are related to different inputs – Recall input and output may not be in sequence – Have no way of knowing a priori which input must connect to what output 65

Va Variants Ich habe einen apfel gegessen <eos> A better model: Encoded input embedding is input to all output timesteps <sos> an ate I <eos> apple A boy on a surfboard <eos> 66 <sos> A boy on surfboard a

A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 R " - . / 0 I ate an apple <eos> Ich habe einen apfel gegessen • All the information about the input sequence is embedded into a single vector – The “hidden” node layer at the end of the input sequence – This one node is “overloaded” with information • Particularly if the input is long 67

A A pr probl blem wi with h thi his s fr framework I ate an apple <eos> • In reality: All hidden values carry information – Some of which may be diluted downstream FIX ENCODER DECODER SEPARATION 68

A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> • In reality: All hidden values carry information – Some of which may be diluted downstream • Different outputs are related to different inputs – Recall input and output may not be in sequence 69

A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> • In reality: All hidden values carry information – Some of which may be diluted downstream • Different outputs are related to different inputs – Recall input and output may not be in sequence – Have no way of knowing a priori which input must connect to what output 70

A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> • Connecting everything to everything is infeasible – Variable sized inputs and outputs – Overparametrized – Connection pattern ignores the actual a synchronous dependence of output on input 71

So Solution: : Attention mo models 𝒕 =" 𝒊 =" 𝒊 R 𝒊 " 𝒊 - 𝒊 . I ate an apple <eos> • Separating the encoder and decoder in illustration 72

� � � � � � At Attention models 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 I𝛽 -,: 𝒊 : I𝛽 .,: 𝒊 : I𝛽 /,: 𝒊 : I𝛽 0,: 𝒊 : I𝛽 1,: 𝒊 : I𝛽 ",: 𝒊 : : : : : : : 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Compute a weighted combination of all the hidden outputs into a single vector – Weights vary by output time 73

� � � � � � � Solution: So : Attention mo models Note: Weights vary with output time 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 I𝛽 -,: 𝒊 : I𝛽 .,: 𝒊 : I𝛽 /,: 𝒊 : I𝛽 0,: 𝒊 : I𝛽 1,: 𝒊 : I𝛽 ",: 𝒊 : : : : : : : 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 Input to hidden decoder layer: ∑ 𝛽 <,: 𝒊 : : Weights: 𝛽 <,: are scalars I ate an apple <eos> • Compute a weighted combination of all the hidden outputs into a single vector – Weights vary by output time 74

Attention instead of simple encoder-decoder • Encoder-decoder models – needs to be able to compress all the necessary information of a source sentence into a fixed-length vector – performance deteriorates rapidly as the length of an input sentence increases. • Attention avoids this by: – allowing the RNN generating the output to focus on hidden states (generated by the first RNN) as they become relevant. Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 75

Soft Attention for Translation An RNN can attend over the output of another RNN. At every time step, it focuses on different positions in the other RNN. “I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 76

Soft Attention for Translation Distribution over input words “ I love coffee ” -> “ Me gusta el café ” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 77

� � � � � � � So Solution: : Attention mo models Input to hidden decoder layer: ∑ 𝛽 <,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 I𝛽 :,- 𝒊 : I𝛽 :,. 𝒊 : I𝛽 :,/ 𝒊 : I𝛽 :,0 𝒊 : I𝛽 :,1 𝒊 : I𝛽 :," 𝒊 : : : : : : : 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 𝛽 <,: = 𝑏 𝒊 : , 𝒕 <=" I ate an apple <eos> • Require a time-varying weight that specifies relationship of output time to input time – Weights are functions of current output state 81

� Attention models At Ich habe einen Input to hidden decoder layer: ∑ 𝛽 <,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 Sum to 1.0 Ich habe einen 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 𝛽 <,: = 𝑏 𝒊 : , 𝒕 <=" I ate an apple <eos> • The weights are a distribution over the input – Must automatically highlight the most important input components for any output 82

� � Attention models At Ich habe einen Input to hidden decoder layer: ∑ 𝛽 <,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 Sum to 1.0 Ich habe einen 𝑓 : 𝑢 = 𝑕 𝒊 : , 𝒕 <=" 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 𝑢 ) 𝛽 <,: = ∑ exp(𝑓 † 𝑢 ) † I ate an apple <eos> • “Raw” weight at any time: A function 𝑕() that works on the two hidden states • Actual weight: softmax over raw weights 83

� At Attention models Ich habe einen v 𝒕 <=" 𝑕 𝒊 : , 𝒕 <=" = 𝒊 : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 v 𝑿 E 𝒕 <=" 𝑕 𝒊 : , 𝒕 <=" = 𝒊 : 𝒊 : v 𝒖𝒃𝒐𝒊 𝑿 E 𝑕 𝒊 : , 𝒕 <=" = 𝒘 E 𝒕 <=" Ich habe einen 𝑕 𝒊 : , 𝒕 <=" = 𝑁𝑀𝑄 [𝒊 : , 𝒕 <=" ] 𝑓 : 𝑢 = 𝑕 𝒊 : , 𝒕 <=" 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 𝑢 ) 𝛽 <,: = ∑ exp(𝑓 † 𝑢 ) † I ate an apple <eos> • Typical options for 𝑕() … – Variables in red are to be learned 84

Co Convert rting an input (f (forward pass) ss) 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Pass the input through the encoder to produce hidden representations 𝒊 : 85

Co Convert rting an input (f (forward pass) ss) What is this? Multiple options Simplest: 𝒕 R = 𝒊 $ 𝒕 R If 𝒕 and 𝒊 are different sizes: 𝒕 R = 𝑿 k 𝒊 $ 𝑿 k is learnable parameter 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Compute weights for first output 86

� Convert Co rting an input (f (forward pass) ss) 𝒕 R v 𝑿 E 𝒕 R 𝑕 𝒊 : , 𝒕 R = 𝒊 : 𝑓 : 1 = 𝑕 𝒊 : , 𝒕 R 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 1 ) 𝛽 ",: = ∑ exp(𝑓 † 1 ) I ate an apple <eos> † • Compute weights (for every 𝒊 : ) for first output 87

� � Convert Co rting an input (f (forward pass) ss) 𝒜 " = I 𝛽 ",: 𝒊 : : 𝒕 R v 𝑿 E 𝒕 R 𝑕 𝒊 : , 𝒕 R = 𝒊 : 𝑓 : 1 = 𝑕 𝒊 : , 𝒕 R 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 1 ) 𝛽 ",: = ∑ exp(𝑓 † 1 ) I ate an apple <eos> † • Compute weights (for every 𝒊 : ) for first output • Compute weighted combination of hidden values 88

� � Convert Co rting an input (f (forward pass) ss) 𝒁 " 𝒜 " = I 𝛽 ",: 𝒊 : : 𝒕 R 𝒕 " v 𝑿 E 𝒕 R 𝑕 𝒊 : , 𝒕 R = 𝒊 : 𝑓 : 1 = 𝑕 𝒊 : , 𝒕 R 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 1 ) 𝛽 ",: = ∑ exp(𝑓 † 1 ) I ate an apple <eos> † • Produce the first output – Will be distribution over words 89

� Co Convert rting an input (f (forward pass) ss) 𝒁 " 𝒜 " = I 𝛽 ",: 𝒊 : : 𝒕 R 𝒕 " 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Produce the first output – Will be distribution over words – Draw a word from the distribution 90

� Ich 𝒁 " 𝒕 R 𝒕 " 𝒜 " v 𝑿 E 𝒕 " 𝑕 𝒊 : , 𝒕 " = 𝒊 : 𝑓 : 2 = 𝑕 𝒊 : , 𝒕 " 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 2 ) 𝛽 -,: = ∑ exp(𝑓 † 2 ) I ate an apple <eos> † 91

� � Ich 𝒁 " 𝒜 - = I 𝛽 -,: 𝒊 : : 𝒕 R 𝒕 " 𝒜 " v 𝑿 E 𝒕 " 𝑕 𝒊 : , 𝒕 " = 𝒊 : 𝑓 : 2 = 𝑕 𝒊 : , 𝒕 " 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 2 ) 𝛽 -,: = ∑ exp(𝑓 † 2 ) I ate an apple <eos> † 92

� Ich 𝒁 " 𝒁 - 𝒜 - = I 𝛽 -,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒜 " Ich 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Compute the output at t=2 • Will be a probability distribution over words 93

� Ich 𝒁 " 𝒁 - 𝒜 - = I 𝛽 -,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒜 " Ich 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Draw a word from the output distribution at t=2 94

� � Ich 𝒁 " 𝒁 - 𝒜 . = I 𝛽 .,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒜 " Ich v 𝑿 E 𝒕 - 𝒜 - 𝑕 𝒊 : , 𝒕 - = 𝒊 : 𝑓 : 3 = 𝑕 𝒊 : , 𝒕 - 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 3 ) 𝛽 .,: = ∑ exp(𝑓 † 3 ) I ate an apple <eos> † • Compute the weights for all instances for time = 3 95

� habe Ich 𝒁 " 𝒁 - 𝒁 . 𝒜 . = I 𝛽 .,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒜 " Ich habe 𝒜 - 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Compute the output at t=3 • Will be a probability distribution over words 96

� habe Ich 𝒁 " 𝒁 - 𝒁 . 𝒜 . = I 𝛽 .,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒜 " Ich habe 𝒜 - 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Draw a word from the distribution 97

� habe einen Ich 𝒁 " 𝒁 - 𝒁 . 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒜 " Ich habe 𝒜 - 𝒜 . 𝑓 : 4 = 𝑕 𝒊 : , 𝒕 . 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 4 ) 𝛽 /,: = ∑ exp(𝑓 † 4 ) I ate an apple <eos> † • Compute the weights for all instances for time = 4 98

einen habe Ich apfel gegessen <eos> 𝒁 " 𝒁 - 𝒁 . 𝒁 / 𝒁 0 𝒁 1 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 𝒜 " Ich habe einen apfel gegessen 𝒜 - 𝒜 . 𝒜 / 𝒜 0 𝒜 1 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • As before, the objective of drawing: Produce the most likely output (that ends in an <eos>) n p … 𝑧 n o 𝑧 n q argmax 𝑧 " " " n o ,…,n q • Simply selecting the most likely symbol at each time may result in suboptimal output 99

� � What does the attention learn? Ich 𝒁 " 𝒁 - 𝒜 " = I 𝛽 ",: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒜 " v 𝑿 E 𝒕 R 𝑕 𝒊 : , 𝒕 R = 𝒊 : Ich 𝑓 : 1 = 𝑕 𝒊 : , 𝒕 R exp(𝑓 : 1 ) 𝒊 R 𝒊 " 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝛽 ",: = ∑ exp(𝑓 † 1 ) † I ate an apple <eos> • The key component of this model is the attention weight – It captures the relative importance of each position in the input to the current output 100

Seq2Seq Models and Attention M. Soleymani Sharif University of - PowerPoint PPT Presentation

Seq2Seq Models and Attention M. Soleymani Sharif University of Technology Spring 2020 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019, and some from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017. Se Sequence-to

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Attention is All You Need (Vaswani et. al. 2017) Slides and figures when not cited are from:

Attention Models Attention Models: Motivation bird Image: H x W x 3 The whole input volume is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Lecture 27 Same format as midterm Seq2Seq, Attention; Review session this Friday! Generation

Attention Models Focus on parts of input Olof Mogren Improves NN performance on different

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

The Attention Economy What is the attention economy? A business model where you (as the

Multilingual and Multitask Learning in seq2seq Models CMSC 470 Marine Carpuat Multilingual

Why You Should Care About Byte-Level Seq2Seq Models in NLP South England Natural Language

Attention-based Encoder-Decoder Networks NLP challenges Methods for Spelling and Grammatical

IN5550 Neural Methods in Natural Language Processing Attention! Vinit Ravishankar

The Journey from LSTM to BERT All slides are my own. Citations provided for borrowed images

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention

Advanced Neural Machine Translation Gongbo Tang 21 September 2020 Outline NMT with Attention

Wh -quantification in Alternative Semantics Michael Yoshitaka Erlewine (mitcho) National

Four-Lesson Special The Holocaust, Anti-Semitism, and UsPart 4 June 7, 2016 Dean Bible

Data Analysis and Map-Reduce with MongoDB and pymongo Alexander C. S. Hendorf, EuroPython 2015,

Compsc sci 201 201 Wo Work, Nbody dy, , ArrayL yLists ts Susan Rodger January 29, 2020

Big Data, Deep Learning and Other Allegories: Scalability and

Proverbs A Chip Off the Old Block The Apple Doesnt Fall Far From the Tree

1. Hamster weights are normally distributed with = 10 lbs. and = 3.5 lbs. a. I randomly

Bi Bilinear Ba Bandits wi with Low-ra rank Structure Kwang-Sung Jun Boston University (will