Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen • We will illustrate with a single hidden layer, but the discussion generalizes to more layers 22
The “simple” translation model Th Ich habe einen apfel gegessen <eos> ENCODER I ate an apple <eos> Ich habe einen apfel gegessen DECODER 23
The “simple” translation model Th Ich habe einen apfel gegessen <eos> 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 " " " " " - - - - - I ate an apple <eos> Ich habe einen apfel gegessen • A more detailed look: The one-hot word representations may be compressed via embeddings – Embeddings will be learned along with the rest of the net – In the following slides we will not represent the projection matrices 24
Training th Tr the e system em Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen • Must learn to make predictions appropriately – Given “I ate an apple <eos>”, produce “Ich habe einen apfel gegessen <eos>”. 25
Training : Tr : Forward pass 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen • Forward pass: Input the source and target sequences, sequentially – Output will be a probability distribution over target symbol set (vocabulary) 26
Training : Tr : Backward pass Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen • Backward pass: Compute the loss between the output distribution and target word sequence 27
Training : Tr : Backward pass Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen • Backward pass: Compute the loss between the output distribution and target word sequence • Backpropagate the derivatives of the loss through the network to learn the net 28
Training : Tr : Backward pass Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 I ate an apple <eos> Ich habe einen apfel gegessen • In practice, if we apply SGD, we may randomly sample words from the output to actually use for the backprop and update – Typical usage: Randomly select one word from each input training instance (comprising an input-output pair) For each iteration • Randomly select training instance: (input, output) • • Forward pass Randomly select a single output y(t) and corresponding desired output d(t) for backprop • 29
Tr Trick of the trade: Reversing the input Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 an ate I <eos> apple Ich habe einen apfel gegessen • Standard trick of the trade: The input sequence is fed in reverse order – Things work better this way 30
Tr Trick of the trade: Reversing the input Ich habe einen apfel gegessen <eos> Loss Loss Loss Loss Loss Loss 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 an ate I <eos> apple Ich habe einen apfel gegessen • Standard trick of the trade: The input sequence is fed in reverse order – Things work better this way 31
Trick of the trade: Reversing the input Tr Ich habe einen apfel gegessen <eos> 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 " - . / 0 1 an ate I <eos> apple Ich habe einen apfel gegessen • Standard trick of the trade: The input sequence is fed in reverse order – Things work better this way • This happens both for training and during actual decode 32
Ov Overall training • Given several training instance (𝐘, 𝐙 XYZ[\X ) • Forward pass: Compute the output of the network for (𝐘, 𝐙 XYZ[\X ) with input in reverse order – Note, both 𝐘 and 𝐙 XYZ[\X are used in the forward pass • Backward pass: Compute the loss between the desired target 𝐙 XYZ[\X and the actual output 𝐙 – Propagate derivatives of loss for updates • We called 𝐘 as 𝑱 and 𝐙 XYZ[\X as 𝑷 during training 33
What the Wha he ne network k actua ually y pr produc duces :cd 𝑧 " CfgFh 𝑧 " e:FD 𝑧 " … iFjkl 𝑧 " I ate an apple <eos> At each time 𝑙 the network actually produces a probability distribution over the output vocabulary • ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • The drawn word is provided as input to the next time 34 •
Wha What the he ne network k actua ually y pr produc duces Ich :cd 𝑧 " CfgFh 𝑧 " e:FD 𝑧 " … iFjkl 𝑧 " I ate an apple <eos> At each time 𝑙 the network actually produces a probability distribution over the output vocabulary • ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • The drawn word is provided as input to the next time • 35
Wha What the he ne network k actua ually y pr produc duces Ich :cd 𝑧 " CfgFh 𝑧 " e:FD 𝑧 " … iFjkl 𝑧 " I ate an apple <eos> Ich At each time 𝑙 the network actually produces a probability distribution over the output vocabulary • ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 36 The drawn word is provided as input to the next time •
What the Wha he ne network k actua ually y pr produc duces Ich :cd :cd 𝑧 " 𝑧 - CfgFh CfgFh 𝑧 " 𝑧 - e:FD e:FD 𝑧 " 𝑧 - … … iFjkl iFjkl 𝑧 " 𝑧 - I ate an apple <eos> Ich At each time 𝑙 the network actually produces a probability distribution over the output vocabulary • ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 37 The drawn word is provided as input to the next time •
Wha What the he ne network k actua ually y pr produc duces Ich habe :cd :cd 𝑧 " 𝑧 - CfgFh CfgFh 𝑧 " 𝑧 - e:FD e:FD 𝑧 " 𝑧 - … … iFjkl iFjkl 𝑧 " 𝑧 - I ate an apple <eos> Ich • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • The drawn word is provided as input to the next time • 38
Wha What the he ne network k actua ually y pr produc duces Ich habe :cd :cd 𝑧 " 𝑧 - CfgFh CfgFh 𝑧 " 𝑧 - e:FD e:FD 𝑧 " 𝑧 - … … iFjkl iFjkl 𝑧 " 𝑧 - I ate an apple <eos> Ich Ich habe • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 39 The drawn word is provided as input to the next time •
Wha What the he ne network k actua ually y pr produc duces Ich habe :cd :cd :cd 𝑧 " 𝑧 - 𝑧 . CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . … … … iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . I ate an apple <eos> Ich Ich habe • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • The drawn word is provided as input to the next time • 40
Wha What the he ne network k actua ually y pr produc duces Ich habe einen :cd :cd :cd 𝑧 " 𝑧 - 𝑧 . CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . … … … iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . I ate an apple <eos> Ich Ich habe • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 41 The drawn word is provided as input to the next time •
What the Wha he ne network k actua ually y pr produc duces Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd 𝑧 " 𝑧 - 𝑧 / 𝑧 . 𝑧 0 𝑧 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 e:FD e:FD e:FD e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 … … … … … … iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 I ate an apple <eos> Ich habe einen apfel gegessen • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary ` = 𝑄 𝑧 + = 𝑥|𝑃 +=" , … , 𝑃 " , 𝐽 " , … , 𝐽 $ – 𝑧 + – The probability given the entire input sequence 𝐽 " , … , 𝐽 $ and the partial output sequence 𝑃 " , … , 𝑃 +=" until 𝑙 At each time a word is drawn from the output distribution • 42 The drawn word is provided as input to the next time •
Generatin Ge ting an an outp tput t from th the net Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd 𝑧 " 𝑧 - 𝑧 / 𝑧 . 𝑧 0 𝑧 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 e:FD e:FD e:FD e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 … … … … … … iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 I ate an apple <eos> Ich habe einen apfel gegessen • At each time the network produces a probability distribution over words, given the entire input and previous outputs • At each time a word is drawn from the output distribution The drawn word is provided as input to the next time • The process continues until an <eos> is generated • 43
The probability of the output Th Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd 𝑧 " 𝑧 - 𝑧 / 𝑧 . 𝑧 0 𝑧 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 e:FD e:FD e:FD e:FD e:FD e:FD 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 … … … … … … iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 I ate an apple <eos> Ich habe einen apfel gegessen n p … 𝑧 m n o 𝑧 - n q 𝑄 𝑃 " , … , 𝑃 m |𝐽 " , … , 𝐽 $ = 𝑧 " • The objective of drawing: Produce the most likely output (that ends in an <eos>) n p … 𝑧 m n o 𝑧 - n q argmax 𝑧 " n o ,…,n q 44
The probability of the output Th Ich habe einen apfel gegessen <eos> :cd :cd :cd :cd :cd :cd 𝑧 " 𝑧 - 𝑧 / 𝑧 . 𝑧 0 𝑧 1 CfgFh CfgFh CfgFh CfgFh CfgFh CfgFh 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 Objective: e:FD e:FD e:FD e:FD e:FD e:FD n p … 𝑧 m 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 n o 𝑧 - n q argmax 𝑧 " n o ,…,n q … … … … … … iFjkl iFjkl iFjkl iFjkl iFjkl iFjkl 𝑧 " 𝑧 - 𝑧 . 𝑧 / 𝑧 0 𝑧 1 I ate an apple <eos> Ich habe einen apfel gegessen • Cannot just pick the most likely symbol at each time – That may cause the distribution to be more “confused” at the next time – Choosing a different, less likely word could cause the distribution at the next time to be more peaky, resulting in a more likely output overall 45
Greedy is Gr is not t good 𝑄(𝑃 . |𝑃 " , 𝑃 - , 𝐽 " , … , 𝐽 $ ) 𝑄(𝑃 . |𝑃 " , 𝑃 - , 𝐽 " , … , 𝐽 $ ) T=1 2 3 T=1 2 3 w 1 w 2 w 3 … w V w 1 w 2 w 3 … w V • Hypothetical example (from English speech recognition: Input is speech, output must be text) • “Nose” has highest probability at t=2 and is selected – The model is very confused at t=3 and assigns low probabilities to many words at the next time – Selecting any of these will result in low probability for the entire 3-word sequence • “Knows” has slightly lower probability than “nose”, but is still high and is selected – “he knows” is a reasonable beginning and the model assigns high probabilities to words such as “something” – Selecting one of these results in higher overall probability for the 3-word sequence 46
Gr Greedy is is not t good What should we 𝑄(𝑃 - |𝑃 " , 𝐽 " , … , 𝐽 $ ) have chosen at t=2?? “nose” “knows” Will selecting “nose” continue to have a bad effect into the w 1 w 2 w 3 … w V distant future? T=1 2 3 • Problem: Impossible to know a priori which word leads to the more promising future – Should we draw “nose” or “knows”? – Effect may not be obvious until several words down the line – Or the choice of the wrong word early may cumulatively lead to a poorer overall score over time 47
Greedy is Gr is not t good 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) What should we have chosen at t=1?? Choose “the” or “he”? w 1 the w 3 … he T=1 2 3 • Problem: Impossible to know a priori which word leads to the more promising future – Even earlier: Choosing the lower probability “the” instead of “he” at T=1 may have made a choice of “nose” more reasonable at T=2. • In general, making a poor choice at any time commits us to a poor future – But we cannot know at that time the choice was poor • Solution: Don’t choose.. 48
Solution: So : Mu Multiple choices I He We The • Retain both choices and fork the network – With every possible word as input 49
Problem: Multiple choices Pr ⋮ I ⋮ He ⋮ We ⋮ ⋮ The • Problem : This will blow up very quickly – For an output vocabulary of size 𝑊 , after 𝑈 output steps we’d have forked out 𝑊 v branches 50
Solution: So : Pru rune Beam I search He 𝑈𝑝𝑞 y 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) We ⋮ The • Solution: Prune – At each time, retain only the top K scoring forks 51
So Solution: : Pru rune Beam I search He 𝑈𝑝𝑞 y 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) We ⋮ The • Solution: Prune – At each time, retain only the top K scoring forks 52
So Solution: : Pru rune Beam I search ⋮ Note: based on product Knows He 𝑈𝑝𝑞 y 𝑄(𝑃 - 𝑃 " |𝐽 " , … , 𝐽 $ ) … = 𝑈𝑝𝑞 y 𝑄(𝑃 - |𝑃 " , 𝐽 " , … , 𝐽 $ )𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) I ⋮ The Nose • Solution: Prune … – At each time, retain only the top K scoring forks 53
So Solution: : Pru rune Beam I search ⋮ Note: based on product Knows He 𝑈𝑝𝑞 y 𝑄(𝑃 - 𝑃 " |𝐽 " , … , 𝐽 $ ) … = 𝑈𝑝𝑞 y 𝑄(𝑃 - |𝑃 " , 𝐽 " , … , 𝐽 $ )𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) I ⋮ The Nose • Solution: Prune … – At each time, retain only the top K scoring forks 54
So Solution: : Pru rune Beam search ⋮ Knows = 𝑈𝑝𝑞 y 𝑄 𝑃 - 𝑃 " , 𝐽 " , … , 𝐽 $ × He 𝑄 𝑃 - 𝑃 " , 𝐽 " , … , 𝐽 $ × 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) … ⋮ The Nose • Solution: Prune – At each time, retain only the top K scoring forks 55
So Solution: : Pru rune Beam search ⋮ Knows = 𝑈𝑝𝑞 y 𝑄 𝑃 - 𝑃 " , 𝐽 " , … , 𝐽 $ × He 𝑄 𝑃 - 𝑃 " , 𝐽 " , … , 𝐽 $ × 𝑄(𝑃 " |𝐽 " , … , 𝐽 $ ) … ⋮ The Nose • Solution: Prune – At each time, retain only the top K scoring forks 56
So Solution: : Pru rune Beam search Knows He | 𝑈𝑝𝑞 y { 𝑄 𝑃 < 𝑃 " , … , 𝑃 <=" , 𝐽 " , … , 𝐽 $ <}" The Nose • Solution: Prune – At each time, retain only the top K scoring forks 57
Terminate Te Beam search Knows He <eos> The Nose • Terminate – When the current most likely path overall ends in <eos> • Or continue producing more outputs (each of which terminates in <eos>) to get N-best outputs 58
Termination: <eo Te eos> Beam Example has K = 2 <eos> search Knows He <eos> <eos> The Nose • Terminate – Paths cannot continue once the output an <eos> • So paths may be different lengths Select the most likely sequence ending in <eos> across all terminating sequences 59 •
Appl Applications ns • Machine Translation – My name is Tom à Ich heisse Tom/Mein name ist Tom • Dialog – “I have a problem” à “How may I help you” • Image to text – Picture à Caption for picture 60
Ma Machine Translation Examp mple • Hidden state clusters by meaning! “Sequence-to-sequence learning with neural networks”, Sutskever, Vinyals, and Le, 2014 61
Ma Machine Translation Examp mple “Sequence-to-sequence learning with neural networks”, Sutskever, Vinyals and Le, 2014 62
Hum Human an Mac achine hine Conver ersatio tion: n: Exam ample ple • Trained on human-human conversations • Task: Human text in, machine response out 63 “A neural conversational model”, Orin Vinyals and Quoc Le, 2015
A pr A probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 R " - . / 0 I ate an apple <eos> Ich habe einen apfel gegessen • All the information about the input sequence is embedded into a single vector – The “hidden” node layer at the end of the input sequence – This one node is “overloaded” with information • Particularly if the input is long 64
A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> • In reality: All hidden values carry information – Some of which may be diluted downstream • Different outputs are related to different inputs – Recall input and output may not be in sequence – Have no way of knowing a priori which input must connect to what output 65
Va Variants Ich habe einen apfel gegessen <eos> A better model: Encoded input embedding is input to all output timesteps <sos> an ate I <eos> apple A boy on a surfboard <eos> 66 <sos> A boy on surfboard a
A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 R " - . / 0 I ate an apple <eos> Ich habe einen apfel gegessen • All the information about the input sequence is embedded into a single vector – The “hidden” node layer at the end of the input sequence – This one node is “overloaded” with information • Particularly if the input is long 67
A A pr probl blem wi with h thi his s fr framework I ate an apple <eos> • In reality: All hidden values carry information – Some of which may be diluted downstream FIX ENCODER DECODER SEPARATION 68
A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> • In reality: All hidden values carry information – Some of which may be diluted downstream • Different outputs are related to different inputs – Recall input and output may not be in sequence 69
A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> • In reality: All hidden values carry information – Some of which may be diluted downstream • Different outputs are related to different inputs – Recall input and output may not be in sequence – Have no way of knowing a priori which input must connect to what output 70
A A pr probl blem wi with h thi his s fr framework Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> • Connecting everything to everything is infeasible – Variable sized inputs and outputs – Overparametrized – Connection pattern ignores the actual a synchronous dependence of output on input 71
So Solution: : Attention mo models 𝒕 =" 𝒊 =" 𝒊 R 𝒊 " 𝒊 - 𝒊 . I ate an apple <eos> • Separating the encoder and decoder in illustration 72
� � � � � � At Attention models 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 I𝛽 -,: 𝒊 : I𝛽 .,: 𝒊 : I𝛽 /,: 𝒊 : I𝛽 0,: 𝒊 : I𝛽 1,: 𝒊 : I𝛽 ",: 𝒊 : : : : : : : 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Compute a weighted combination of all the hidden outputs into a single vector – Weights vary by output time 73
� � � � � � � Solution: So : Attention mo models Note: Weights vary with output time 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 I𝛽 -,: 𝒊 : I𝛽 .,: 𝒊 : I𝛽 /,: 𝒊 : I𝛽 0,: 𝒊 : I𝛽 1,: 𝒊 : I𝛽 ",: 𝒊 : : : : : : : 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 Input to hidden decoder layer: ∑ 𝛽 <,: 𝒊 : : Weights: 𝛽 <,: are scalars I ate an apple <eos> • Compute a weighted combination of all the hidden outputs into a single vector – Weights vary by output time 74
Attention instead of simple encoder-decoder • Encoder-decoder models – needs to be able to compress all the necessary information of a source sentence into a fixed-length vector – performance deteriorates rapidly as the length of an input sentence increases. • Attention avoids this by: – allowing the RNN generating the output to focus on hidden states (generated by the first RNN) as they become relevant. Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 75
Soft Attention for Translation An RNN can attend over the output of another RNN. At every time step, it focuses on different positions in the other RNN. “I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 76
Soft Attention for Translation Distribution over input words “ I love coffee ” -> “ Me gusta el café ” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 77
Soft Attention for Translation Distribution over input words “ I love coffee ” -> “ Me gusta el café ” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 78
Soft Attention for Translation Distribution over input words “ I love coffee ” -> “ Me gusta el café ” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 79
Soft Attention for Translation Distribution over input words “ I love coffee ” -> “ Me gusta el café ” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 80
� � � � � � � So Solution: : Attention mo models Input to hidden decoder layer: ∑ 𝛽 <,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 I𝛽 :,- 𝒊 : I𝛽 :,. 𝒊 : I𝛽 :,/ 𝒊 : I𝛽 :,0 𝒊 : I𝛽 :,1 𝒊 : I𝛽 :," 𝒊 : : : : : : : 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 𝛽 <,: = 𝑏 𝒊 : , 𝒕 <=" I ate an apple <eos> • Require a time-varying weight that specifies relationship of output time to input time – Weights are functions of current output state 81
� Attention models At Ich habe einen Input to hidden decoder layer: ∑ 𝛽 <,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 Sum to 1.0 Ich habe einen 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 𝛽 <,: = 𝑏 𝒊 : , 𝒕 <=" I ate an apple <eos> • The weights are a distribution over the input – Must automatically highlight the most important input components for any output 82
� � Attention models At Ich habe einen Input to hidden decoder layer: ∑ 𝛽 <,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 Sum to 1.0 Ich habe einen 𝑓 : 𝑢 = 𝒊 : , 𝒕 <=" 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 𝑢 ) 𝛽 <,: = ∑ exp(𝑓 † 𝑢 ) † I ate an apple <eos> • “Raw” weight at any time: A function () that works on the two hidden states • Actual weight: softmax over raw weights 83
� At Attention models Ich habe einen v 𝒕 <=" 𝒊 : , 𝒕 <=" = 𝒊 : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 v 𝑿 E 𝒕 <=" 𝒊 : , 𝒕 <=" = 𝒊 : 𝒊 : v 𝒖𝒃𝒐𝒊 𝑿 E 𝒊 : , 𝒕 <=" = 𝒘 E 𝒕 <=" Ich habe einen 𝒊 : , 𝒕 <=" = 𝑁𝑀𝑄 [𝒊 : , 𝒕 <=" ] 𝑓 : 𝑢 = 𝒊 : , 𝒕 <=" 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 𝑢 ) 𝛽 <,: = ∑ exp(𝑓 † 𝑢 ) † I ate an apple <eos> • Typical options for () … – Variables in red are to be learned 84
Co Convert rting an input (f (forward pass) ss) 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Pass the input through the encoder to produce hidden representations 𝒊 : 85
Co Convert rting an input (f (forward pass) ss) What is this? Multiple options Simplest: 𝒕 R = 𝒊 $ 𝒕 R If 𝒕 and 𝒊 are different sizes: 𝒕 R = 𝑿 k 𝒊 $ 𝑿 k is learnable parameter 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Compute weights for first output 86
� Convert Co rting an input (f (forward pass) ss) 𝒕 R v 𝑿 E 𝒕 R 𝒊 : , 𝒕 R = 𝒊 : 𝑓 : 1 = 𝒊 : , 𝒕 R 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 1 ) 𝛽 ",: = ∑ exp(𝑓 † 1 ) I ate an apple <eos> † • Compute weights (for every 𝒊 : ) for first output 87
� � Convert Co rting an input (f (forward pass) ss) 𝒜 " = I 𝛽 ",: 𝒊 : : 𝒕 R v 𝑿 E 𝒕 R 𝒊 : , 𝒕 R = 𝒊 : 𝑓 : 1 = 𝒊 : , 𝒕 R 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 1 ) 𝛽 ",: = ∑ exp(𝑓 † 1 ) I ate an apple <eos> † • Compute weights (for every 𝒊 : ) for first output • Compute weighted combination of hidden values 88
� � Convert Co rting an input (f (forward pass) ss) 𝒁 " 𝒜 " = I 𝛽 ",: 𝒊 : : 𝒕 R 𝒕 " v 𝑿 E 𝒕 R 𝒊 : , 𝒕 R = 𝒊 : 𝑓 : 1 = 𝒊 : , 𝒕 R 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 1 ) 𝛽 ",: = ∑ exp(𝑓 † 1 ) I ate an apple <eos> † • Produce the first output – Will be distribution over words 89
� Co Convert rting an input (f (forward pass) ss) 𝒁 " 𝒜 " = I 𝛽 ",: 𝒊 : : 𝒕 R 𝒕 " 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Produce the first output – Will be distribution over words – Draw a word from the distribution 90
� Ich 𝒁 " 𝒕 R 𝒕 " 𝒜 " v 𝑿 E 𝒕 " 𝒊 : , 𝒕 " = 𝒊 : 𝑓 : 2 = 𝒊 : , 𝒕 " 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 2 ) 𝛽 -,: = ∑ exp(𝑓 † 2 ) I ate an apple <eos> † 91
� � Ich 𝒁 " 𝒜 - = I 𝛽 -,: 𝒊 : : 𝒕 R 𝒕 " 𝒜 " v 𝑿 E 𝒕 " 𝒊 : , 𝒕 " = 𝒊 : 𝑓 : 2 = 𝒊 : , 𝒕 " 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 2 ) 𝛽 -,: = ∑ exp(𝑓 † 2 ) I ate an apple <eos> † 92
� Ich 𝒁 " 𝒁 - 𝒜 - = I 𝛽 -,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒜 " Ich 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Compute the output at t=2 • Will be a probability distribution over words 93
� Ich 𝒁 " 𝒁 - 𝒜 - = I 𝛽 -,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒜 " Ich 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Draw a word from the output distribution at t=2 94
� � Ich 𝒁 " 𝒁 - 𝒜 . = I 𝛽 .,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒜 " Ich v 𝑿 E 𝒕 - 𝒜 - 𝒊 : , 𝒕 - = 𝒊 : 𝑓 : 3 = 𝒊 : , 𝒕 - 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 3 ) 𝛽 .,: = ∑ exp(𝑓 † 3 ) I ate an apple <eos> † • Compute the weights for all instances for time = 3 95
� habe Ich 𝒁 " 𝒁 - 𝒁 . 𝒜 . = I 𝛽 .,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒜 " Ich habe 𝒜 - 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Compute the output at t=3 • Will be a probability distribution over words 96
� habe Ich 𝒁 " 𝒁 - 𝒁 . 𝒜 . = I 𝛽 .,: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒜 " Ich habe 𝒜 - 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • Draw a word from the distribution 97
� habe einen Ich 𝒁 " 𝒁 - 𝒁 . 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒜 " Ich habe 𝒜 - 𝒜 . 𝑓 : 4 = 𝒊 : , 𝒕 . 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 exp(𝑓 : 4 ) 𝛽 /,: = ∑ exp(𝑓 † 4 ) I ate an apple <eos> † • Compute the weights for all instances for time = 4 98
einen habe Ich apfel gegessen <eos> 𝒁 " 𝒁 - 𝒁 . 𝒁 / 𝒁 0 𝒁 1 𝒕 R 𝒕 " 𝒕 - 𝒕 . 𝒕 / 𝒕 0 𝒕 1 𝒜 " Ich habe einen apfel gegessen 𝒜 - 𝒜 . 𝒜 / 𝒜 0 𝒜 1 𝒊 R 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝒊 0 I ate an apple <eos> • As before, the objective of drawing: Produce the most likely output (that ends in an <eos>) n p … 𝑧 n o 𝑧 n q argmax 𝑧 " " " n o ,…,n q • Simply selecting the most likely symbol at each time may result in suboptimal output 99
� � What does the attention learn? Ich 𝒁 " 𝒁 - 𝒜 " = I 𝛽 ",: 𝒊 : : 𝒕 R 𝒕 " 𝒕 - 𝒜 " v 𝑿 E 𝒕 R 𝒊 : , 𝒕 R = 𝒊 : Ich 𝑓 : 1 = 𝒊 : , 𝒕 R exp(𝑓 : 1 ) 𝒊 R 𝒊 " 𝒊 " 𝒊 - 𝒊 . 𝒊 / 𝛽 ",: = ∑ exp(𝑓 † 1 ) † I ate an apple <eos> • The key component of this model is the attention weight – It captures the relative importance of each position in the input to the current output 100
Recommend
More recommend