Seq2Seq Models and Attention
- M. Soleymani
Sharif University of Technology Spring 2020 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019, and some from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017.
Seq2Seq Models and Attention M. Soleymani Sharif University of - - PowerPoint PPT Presentation
Seq2Seq Models and Attention M. Soleymani Sharif University of Technology Spring 2020 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019, and some from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017. Se Sequence-to
Sharif University of Technology Spring 2020 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019, and some from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017.
– A sequence 𝑌" … 𝑌$ goes in – A different sequence 𝑍
" … 𝑍 & comes out
– Speech recognition: Speech goes in, a word sequence comes out
– Machine translation: Word sequence goes in, word sequence comes out
– No synchrony between 𝑌 and 𝑍.
2
– May even not have a notion of “alignment”
3
v
I ate an apple Ich habe einen apfel gegessen I ate an apple
4
I ate an apple Ich habe einen apfel gegessen
v
5
– Or, at a higher level, words
h0 𝑥" 𝑥- 𝑥. 𝑥/ 𝑥0 𝑥1 𝑥- 𝑥. 𝑥/ 𝑥0 𝑥1 𝑥2
6
𝑍 𝑢, 𝑗 = 𝑄(𝑊
:|𝑥" … 𝑥<=")
𝑀𝑝𝑡𝑡 𝐙<CDEF< 1 … 𝑈 , 𝐙(1 … 𝑈) = I 𝑌𝑓𝑜𝑢 𝐙<CDEF< 𝑢 , 𝐙(𝑢)
= − I log 𝑍(𝑢, 𝑥<,")
Y(t) h-1 Y(t) Loss 𝑄 𝑥" 𝑥- 𝑥. 𝑥/ 𝑥0 𝑥1 𝑥2 The probability assigned to the correct next word
7
𝑥R 𝑥" 𝑥- 𝑥. 𝑥/ 𝑥0 𝑥1 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄
𝑊
: is the i-th symbol in the vocabulary
– One-hot vectors
– Outputs an N-valued probability distribution rather than a one-hot vector
𝑄 𝑋
"
𝑄 𝑋
𝑋
.
8
𝑧/
"
𝑧/
𝑧/
$
𝑧<
: = 𝑄(𝑋 < = 𝑊 :|𝑋 " … 𝑋 <=")
The probability that the t-th word in the sequence is the i-th word in the vocabulary given all previous t-1 words
– One-hot vectors
– Outputs an N-valued probability distribution rather than a one-hot vector
– And set it as the next word in the series
𝑄 𝑋
"
𝑄 𝑋
𝑋
.
𝑋
/
9
𝑧/
"
𝑧/
𝑧/
$
𝑧<
: = 𝑄(𝑋 < = 𝑊 :|𝑋 " … 𝑋 <=")
The probability that the t-th word in the sequence is the i-th word in the vocabulary given all previous t-1 words
– And draw the next word from the output probability distribution
– In some cases, e.g. generating programs, there may be a natural termination
𝑄 𝑋
"
𝑄 𝑋
𝑋
.
𝑄 𝑋
/
10
𝑧0
"
𝑧0
𝑧0
$
𝑧<
: = 𝑄(𝑋 < = 𝑊 :|𝑋 " … 𝑋 <=")
The probability that the t-th word in the sequence is the i-th word in the vocabulary given all previous t-1 words
– And draw the next word from the output probability distribution
– In some cases, e.g. generating programs, there may be a natural termination
𝑄 𝑋
"
𝑄 𝑋
𝑋
.
𝑄 𝑋 𝑋
/
11
𝑧0
"
𝑧0
𝑧0
$
𝑧<
: = 𝑄(𝑋 < = 𝑊 :|𝑋 " … 𝑋 <=")
The probability that the t-th word in the sequence is the i-th word in the vocabulary given all previous t-1 words
– And draw the next word from the output probability distribution
– For text generation we will usually end at an <eos> (end of sequence) symbol
termination of a sequence and occurs only at the final position of sequences
𝑄 𝑋
"
𝑄 𝑋
𝑋
.
𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑋 𝑋
1
𝑋
2
𝑋
U
𝑋
V
𝑋
"R
𝑋
/
12
– A sequence 𝑌" … 𝑌$ goes in – A different sequence 𝑍
" … 𝑍 & comes out
– The output is in a different language..
13
I ate an apple Ich habe einen apfel gegessen
14
– Delayed self-referencing sequence-to-sequence
15
– The hidden activation at the <eos> “stores” all information about the sentence
to produce a sequence of outputs
– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced
16
I ate an apple <eos>
– The hidden activation at the <eos> “stores” all information about the sentence
a sequence of outputs
– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced
17
I ate an apple <eos>
– The hidden activation at the <eos> “stores” all information about the sentence
sequence of outputs
– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced
18
I ate an apple <eos> Ich
– The hidden activation at the <eos> “stores” all information about the sentence
sequence of outputs
– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced
19
I ate an apple <eos> Ich habe Ich
– The hidden activation at the <eos> “stores” all information about the sentence
sequence of outputs
– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced
20
I ate an apple <eos> Ich habe einen Ich habe
– The hidden activation at the <eos> “stores” all information about the sentence
sequence of outputs
– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced
21
I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen
22
more layers
I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen
23
I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen
compressed via embeddings
– Embeddings will be learned along with the rest of the net – In the following slides we will not represent the projection matrices
24
Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen 𝑄
"
𝑄
"
𝑄
"
𝑄
"
𝑄
"
𝑄
– Given “I ate an apple <eos>”, produce “Ich habe einen apfel gegessen <eos>”.
25
I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen
– Output will be a probability distribution over target symbol set (vocabulary)
26
I ate an apple <eos> Ich habe einen apfel gegessen 𝑍
"
𝑍
.
𝑍
/
𝑍 𝑍
1
word sequence
27
I ate an apple <eos> Ich habe einen apfel gegessen 𝑍
"
𝑍
.
𝑍
/
𝑍 𝑍
1
Ich habe einen apfel gegessen <eos>
Loss Loss
Loss Loss Loss Loss
sequence
28
I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>
𝑍
"
𝑍
.
𝑍
/
𝑍 𝑍
1
Loss Loss Loss Loss Loss Loss
the backprop and update
– Typical usage: Randomly select one word from each input training instance (comprising an input-output pair)
29
I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>
𝑍
"
𝑍
.
𝑍
/
𝑍 𝑍
1
Loss Loss Loss Loss Loss Loss
– Things work better this way
30
Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos> I ate an apple <eos>
𝑍
"
𝑍
.
𝑍
/
𝑍 𝑍
1
Loss Loss Loss Loss Loss Loss
– Things work better this way
31
Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos> I ate an apple <eos>
𝑍
"
𝑍
.
𝑍
/
𝑍 𝑍
1
Loss Loss Loss Loss Loss Loss
– Things work better this way
32
I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>
𝑍
"
𝑍
.
𝑍
/
𝑍 𝑍
1
– Note, both 𝐘 and 𝐙XYZ[\X are used in the forward pass
– Propagate derivatives of loss for updates
33
– 𝑧+
` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$
– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙
34
I ate an apple <eos>
𝑧"
:cd
𝑧"
e:FD
𝑧"
CfgFh
𝑧"
iFjkl
…
– 𝑧+
` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$
– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙
35
I ate an apple <eos> Ich
𝑧"
:cd
𝑧"
e:FD
𝑧"
CfgFh
𝑧"
iFjkl
…
– 𝑧+
` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$
– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙
36
I ate an apple <eos> Ich Ich
𝑧"
:cd
𝑧"
e:FD
𝑧"
CfgFh
𝑧"
iFjkl
…
– 𝑧+
` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$
– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙
37
I ate an apple <eos> Ich Ich
𝑧"
:cd
𝑧"
e:FD
𝑧"
CfgFh
𝑧"
iFjkl
… 𝑧-
:cd
𝑧-
e:FD
𝑧-
CfgFh
𝑧-
iFjkl
…
– 𝑧+
` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$
– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙
38
I ate an apple <eos>
𝑧"
:cd
𝑧"
e:FD
𝑧"
CfgFh
𝑧"
iFjkl
…
Ich Ich habe
𝑧-
:cd
𝑧-
e:FD
𝑧-
CfgFh
𝑧-
iFjkl
…
– 𝑧+
` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$
– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙
39
I ate an apple <eos> Ich habe
𝑧"
:cd
𝑧"
e:FD
𝑧"
CfgFh
𝑧"
iFjkl
…
Ich Ich habe
𝑧-
:cd
𝑧-
e:FD
𝑧-
CfgFh
𝑧-
iFjkl
…
– 𝑧+
` = 𝑄 𝑧 = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$
– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙
40
I ate an apple <eos> Ich habe
𝑧"
:cd
𝑧"
e:FD
𝑧"
CfgFh
𝑧"
iFjkl
…
Ich Ich habe
𝑧-
:cd
𝑧-
e:FD
𝑧-
CfgFh
𝑧-
iFjkl
… 𝑧.
:cd
𝑧.
e:FD
𝑧.
CfgFh
𝑧.
iFjkl
…
– 𝑧+
` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$
– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙
41
I ate an apple <eos> Ich habe Ich Ich habe einen
𝑧"
:cd
𝑧"
e:FD
𝑧"
CfgFh
𝑧"
iFjkl
… 𝑧-
:cd
𝑧-
e:FD
𝑧-
CfgFh
𝑧-
iFjkl
… 𝑧.
:cd
𝑧.
e:FD
𝑧.
CfgFh
𝑧.
iFjkl
…
– 𝑧+
` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$
– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙
42
I ate an apple <eos>
𝑧1
:cd
𝑧1
e:FD
𝑧1
CfgFh
𝑧1
iFjkl
…
Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>
𝑧"
:cd
𝑧"
e:FD
𝑧"
CfgFh
𝑧"
iFjkl
… 𝑧-
:cd
𝑧-
e:FD
𝑧-
CfgFh
𝑧-
iFjkl
… 𝑧.
:cd
𝑧.
e:FD
𝑧.
CfgFh
𝑧.
iFjkl
… 𝑧/
:cd
𝑧/
e:FD
𝑧/
CfgFh
𝑧/
iFjkl
… 𝑧0
:cd
𝑧0
e:FD
𝑧0
CfgFh
𝑧0
iFjkl
…
43
I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen
𝑧1
:cd
𝑧1
e:FD
𝑧1
CfgFh
𝑧1
iFjkl
… 𝑧"
:cd
𝑧"
e:FD
𝑧"
CfgFh
𝑧"
iFjkl
… 𝑧-
:cd
𝑧-
e:FD
𝑧-
CfgFh
𝑧-
iFjkl
… 𝑧.
:cd
𝑧.
e:FD
𝑧.
CfgFh
𝑧.
iFjkl
… 𝑧/
:cd
𝑧/
e:FD
𝑧/
CfgFh
𝑧/
iFjkl
… 𝑧0
:cd
𝑧0
e:FD
𝑧0
CfgFh
𝑧0
iFjkl
…
𝑄 𝑃", … , 𝑃m|𝐽" , … , 𝐽$ = 𝑧"
no𝑧- np … 𝑧m nq
argmax
no,…,nq
𝑧"
no𝑧- np … 𝑧m nq
44
I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen
𝑧1
:cd
𝑧1
e:FD
𝑧1
CfgFh
𝑧1
iFjkl
… 𝑧"
:cd
𝑧"
e:FD
𝑧"
CfgFh
𝑧"
iFjkl
… 𝑧-
:cd
𝑧-
e:FD
𝑧-
CfgFh
𝑧-
iFjkl
… 𝑧.
:cd
𝑧.
e:FD
𝑧.
CfgFh
𝑧.
iFjkl
… 𝑧/
:cd
𝑧/
e:FD
𝑧/
CfgFh
𝑧/
iFjkl
… 𝑧0
:cd
𝑧0
e:FD
𝑧0
CfgFh
𝑧0
iFjkl
…
– That may cause the distribution to be more “confused” at the next time – Choosing a different, less likely word could cause the distribution at the next time to be more peaky, resulting in a more likely output overall
45
I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen Objective: argmax
no,…,nq
𝑧"
no𝑧- np … 𝑧m nq
𝑧1
:cd
𝑧1
e:FD
𝑧1
CfgFh
𝑧1
iFjkl
… 𝑧"
:cd
𝑧"
e:FD
𝑧"
CfgFh
𝑧"
iFjkl
… 𝑧-
:cd
𝑧-
e:FD
𝑧-
CfgFh
𝑧-
iFjkl
… 𝑧.
:cd
𝑧.
e:FD
𝑧.
CfgFh
𝑧.
iFjkl
… 𝑧/
:cd
𝑧/
e:FD
𝑧/
CfgFh
𝑧/
iFjkl
… 𝑧0
:cd
𝑧0
e:FD
𝑧0
CfgFh
𝑧0
iFjkl
…
– The model is very confused at t=3 and assigns low probabilities to many words at the next time – Selecting any of these will result in low probability for the entire 3-word sequence
– “he knows” is a reasonable beginning and the model assigns high probabilities to words such as “something” – Selecting one of these results in higher overall probability for the 3-word sequence
46
T=1 2 3 T=1 2 3 w1 w2 w3 wV …
𝑄(𝑃.|𝑃", 𝑃-, 𝐽", … , 𝐽$)
w1 w2 w3 wV …
𝑄(𝑃.|𝑃", 𝑃-, 𝐽", … , 𝐽$)
promising future
– Should we draw “nose” or “knows”? – Effect may not be obvious until several words down the line – Or the choice of the wrong word early may cumulatively lead to a poorer overall score over time
47
T=1 2 3 w1 w2 w3 wV …
𝑄(𝑃-|𝑃", 𝐽", … , 𝐽$)
What should we have chosen at t=2?? Will selecting “nose” continue to have a bad effect into the distant future?
“nose” “knows”
– Even earlier: Choosing the lower probability “the” instead of “he” at T=1 may have made a choice of “nose” more reasonable at T=2.
– But we cannot know at that time the choice was poor
48
T=1 2 3 w1 the w3 he …
𝑄(𝑃"|𝐽", … , 𝐽$)
What should we have chosen at t=1?? Choose “the” or “he”?
– With every possible word as input
49 I He We The
– For an output vocabulary of size 𝑊, after 𝑈 output steps we’d have forked out 𝑊v branches
50 I He We The
⋮ ⋮ ⋮ ⋮ ⋮
– At each time, retain only the top K scoring forks
51 I He We The
⋮ 𝑈𝑝𝑞y 𝑄(𝑃"|𝐽", … , 𝐽$)
– At each time, retain only the top K scoring forks
52
I He We The
⋮ 𝑈𝑝𝑞y 𝑄(𝑃"|𝐽", … , 𝐽$)
– At each time, retain only the top K scoring forks
53
He The
𝑈𝑝𝑞y 𝑄(𝑃-𝑃"|𝐽", … , 𝐽$) Note: based on product = 𝑈𝑝𝑞y 𝑄(𝑃-|𝑃", 𝐽", … , 𝐽$)𝑄(𝑃"|𝐽", … , 𝐽$)
I Knows … I Nose …
⋮ ⋮
– At each time, retain only the top K scoring forks
54
He The
𝑈𝑝𝑞y 𝑄(𝑃-𝑃"|𝐽", … , 𝐽$) Note: based on product = 𝑈𝑝𝑞y 𝑄(𝑃-|𝑃", 𝐽", … , 𝐽$)𝑄(𝑃"|𝐽", … , 𝐽$)
I Knows … I Nose …
⋮ ⋮
– At each time, retain only the top K scoring forks
55
He The
= 𝑈𝑝𝑞y𝑄 𝑃- 𝑃", 𝐽", … , 𝐽$ × 𝑄 𝑃- 𝑃", 𝐽", … , 𝐽$ × 𝑄(𝑃"|𝐽", … , 𝐽$)
Knows Nose …
⋮ ⋮
– At each time, retain only the top K scoring forks
56
He The
= 𝑈𝑝𝑞y𝑄 𝑃- 𝑃", 𝐽", … , 𝐽$ × 𝑄 𝑃- 𝑃", 𝐽", … , 𝐽$ × 𝑄(𝑃"|𝐽", … , 𝐽$)
Knows Nose …
⋮ ⋮
– At each time, retain only the top K scoring forks
57
He The
𝑈𝑝𝑞y { 𝑄 𝑃< 𝑃", … , 𝑃<=", 𝐽", … , 𝐽$
| <}"
Knows Nose
– When the current most likely path overall ends in <eos>
58 He The Knows <eos> Nose
– Paths cannot continue once the output an <eos>
59 He The Knows <eos> Nose <eos> <eos>
Example has K = 2
– My name is Tom à Ich heisse Tom/Mein name ist Tom
– “I have a problem” à “How may I help you”
– Picture à Caption for picture
60
61
“Sequence-to-sequence learning with neural networks”, Sutskever, Vinyals, and Le, 2014
“Sequence-to-sequence learning with neural networks”, Sutskever, Vinyals and Le, 2014
62
63
“A neural conversational model”, Orin Vinyals and Quoc Le, 2015
– The “hidden” node layer at the end of the input sequence – This one node is “overloaded” with information
64
I ate an apple <eos> Ich habe einen apfel gegessen 𝑍
R
𝑍
"
𝑍
.
𝑍
/
𝑍 Ich habe einen apfel gegessen <eos>
– Some of which may be diluted downstream
– Recall input and output may not be in sequence – Have no way of knowing a priori which input must connect to what output
65
I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>
66
<sos> I ate an apple <eos> <sos> A better model: Encoded input embedding is input to all output timesteps A boy
a surfboard A boy
surfboard a <eos> Ich habe einen apfel gegessen <eos>
– The “hidden” node layer at the end of the input sequence – This one node is “overloaded” with information
67
I ate an apple <eos> Ich habe einen apfel gegessen 𝑍
R
𝑍
"
𝑍
.
𝑍
/
𝑍 Ich habe einen apfel gegessen <eos>
– Some of which may be diluted downstream
68
I ate an apple <eos> FIX ENCODER DECODER SEPARATION
– Some of which may be diluted downstream
– Recall input and output may not be in sequence
69
I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>
– Some of which may be diluted downstream
– Recall input and output may not be in sequence – Have no way of knowing a priori which input must connect to what output
70
I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>
– Variable sized inputs and outputs – Overparametrized – Connection pattern ignores the actual a synchronous dependence of output on input
71
I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>
72
I ate an apple <eos> 𝒊R 𝒊" 𝒊- 𝒊. 𝒊=" 𝒕="
– Weights vary by output time
73
I ate an apple <eos> 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1
I𝛽",:𝒊:
I𝛽-,:𝒊:
I𝛽.,:𝒊:
I𝛽/,:𝒊:
I𝛽0,:𝒊:
I𝛽1,:𝒊:
– Weights vary by output time
74
I ate an apple <eos> Note: Weights vary with output time Input to hidden decoder layer: ∑ 𝛽<,:𝒊:
Weights: 𝛽<,: are scalars 𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R
I𝛽",:𝒊:
I𝛽-,:𝒊:
I𝛽.,:𝒊:
I𝛽/,:𝒊:
I𝛽0,:𝒊:
I𝛽1,:𝒊:
– needs to be able to compress all the necessary information of a source sentence into a fixed-length vector – performance deteriorates rapidly as the length of an input sentence increases.
– allowing the RNN generating the output to focus on hidden states (generated by the first RNN) as they become relevant.
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 75
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 76
“I love coffee” -> “Me gusta el café”
Distribution over input words
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 77
“I love coffee” -> “Me gusta el café”
Distribution over input words
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 78
“I love coffee” -> “Me gusta el café”
Distribution over input words
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 79
“I love coffee” -> “Me gusta el café”
Distribution over input words
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 80
– Weights are functions of current output state
81
I ate an apple <eos>
I𝛽:,"𝒊:
I𝛽:,-𝒊:
I𝛽:,.𝒊:
I𝛽:,/𝒊:
I𝛽:,0𝒊:
I𝛽:,1𝒊:
Input to hidden decoder layer: ∑ 𝛽<,:𝒊:
𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R
– Must automatically highlight the most important input components for any output
82
I ate an apple <eos>
Input to hidden decoder layer: ∑ 𝛽<,:𝒊:
Ich habe einen Ich habe einen
Sum to 1.0
𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1
83
I ate an apple <eos>
𝛽<,: = exp(𝑓: 𝑢 ) ∑ exp(𝑓
† 𝑢 )
Input to hidden decoder layer: ∑ 𝛽<,:𝒊:
Ich habe einen Ich habe einen
Sum to 1.0 𝑓: 𝑢 = 𝒊:, 𝒕<="
𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1
– Variables in red are to be learned
84
I ate an apple <eos> Ich habe einen Ich habe einen
𝑓: 𝑢 = 𝒊:, 𝒕<="
𝒊:, 𝒕<=" = 𝒊:
v𝒕<="
𝒊:, 𝒕<=" = 𝒊:
v𝑿E𝒕<="
𝒊:, 𝒕<=" = 𝒘E
v𝒖𝒃𝒐𝒊 𝑿E
𝒊: 𝒕<=" 𝒊:, 𝒕<=" = 𝑁𝑀𝑄 [𝒊:, 𝒕<="]
𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1
𝛽<,: = exp(𝑓: 𝑢 ) ∑ exp(𝑓
† 𝑢 )
85
I ate an apple <eos>
𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R
86
I ate an apple <eos> 𝒕R What is this? Multiple options Simplest: 𝒕R = 𝒊$ If 𝒕 and 𝒊 are different sizes: 𝒕R = 𝑿k𝒊$ 𝑿k is learnable parameter
𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R
87
I ate an apple <eos> 𝒕R
𝛽",: = exp(𝑓: 1 ) ∑ exp(𝑓
† 1 )
𝑓: 1 = 𝒊:, 𝒕R
𝒊:, 𝒕R = 𝒊:
v𝑿E𝒕R
𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R
88
I ate an apple <eos> 𝒕R 𝒜" = I 𝛽",:𝒊:
𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R
𝛽",: = exp(𝑓: 1 ) ∑ exp(𝑓
† 1 )
𝑓: 1 = 𝒊:, 𝒕R
𝒊:, 𝒕R = 𝒊:
v𝑿E𝒕R
– Will be distribution over words
89
I ate an apple <eos> 𝒕R 𝒕" 𝒜" = I 𝛽",:𝒊:
𝒁"
𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R
𝛽",: = exp(𝑓: 1 ) ∑ exp(𝑓
† 1 )
𝑓: 1 = 𝒊:, 𝒕R
𝒊:, 𝒕R = 𝒊:
v𝑿E𝒕R
– Will be distribution over words – Draw a word from the distribution
90
I ate an apple <eos>
𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒁" 𝒜" = I 𝛽",:𝒊:
91
I ate an apple <eos> 𝒕R 𝒕" 𝒁" 𝒜" Ich
𝛽-,: = exp(𝑓: 2 ) ∑ exp(𝑓
† 2 )
𝑓: 2 = 𝒊:, 𝒕"
𝒊:, 𝒕" = 𝒊:
v𝑿E𝒕"
𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R
92
I ate an apple <eos> 𝒕R 𝒕" 𝒁" 𝒜" Ich 𝒜- = I 𝛽-,:𝒊:
𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R
𝛽-,: = exp(𝑓: 2 ) ∑ exp(𝑓
† 2 )
𝑓: 2 = 𝒊:, 𝒕"
𝒊:, 𝒕" = 𝒊:
v𝑿E𝒕"
93
I ate an apple <eos> 𝒕- Ich 𝒁- 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒁" 𝒜" Ich 𝒜- = I 𝛽-,:𝒊:
94
I ate an apple <eos> 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕- Ich 𝒁- 𝒕R 𝒕" 𝒁" 𝒜" 𝒜- = I 𝛽-,:𝒊:
Ich
95
I ate an apple <eos> 𝒜-
𝛽.,: = exp(𝑓: 3 ) ∑ exp(𝑓
† 3 )
𝑓: 3 = 𝒊:, 𝒕-
𝒊:, 𝒕- = 𝒊:
v𝑿E𝒕-
𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕- Ich 𝒁- 𝒕R 𝒕" 𝒁" 𝒜" Ich 𝒜. = I 𝛽.,:𝒊:
96
I ate an apple <eos> 𝒕R 𝒕" 𝒁" 𝒜" Ich 𝒜. = I 𝛽.,:𝒊:
𝒕- Ich 𝒁- 𝒜- 𝒕. habe 𝒁. habe 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R
97
I ate an apple <eos> 𝒕R 𝒕" 𝒁" 𝒜" Ich 𝒜. = I 𝛽.,:𝒊:
𝒕- Ich 𝒁- 𝒜- 𝒕. habe 𝒁. habe 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R
98
I ate an apple <eos> Ich habe 𝒜-
𝛽/,: = exp(𝑓: 4 ) ∑ exp(𝑓
† 4 )
𝑓: 4 = 𝒊:, 𝒕.
einen 𝒜. 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒁" 𝒜" 𝒕- Ich 𝒁- 𝒕. habe 𝒁.
I ate an apple <eos> 𝒕R 𝒕" 𝒁" 𝒜" Ich 𝒕- Ich 𝒁- 𝒜- 𝒕. habe 𝒁. habe einen 𝒜. 𝒕/ einen 𝒁/ apfel gegessen <eos> 𝒕0 apfel 𝒁0 𝒕1 gegessen 𝒁1 𝒜/ 𝒜0 𝒜1
99
argmax
no,…,nq
𝑧
" no𝑧 " np … 𝑧 " nq
𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R
– It captures the relative importance of each position in the input to the current output
100
I ate an apple <eos> 𝒊" 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊R 𝒕R 𝒕" 𝒁" 𝒜" Ich 𝒜" = I 𝛽",:𝒊:
𝒕- Ich 𝒁-
𝛽",: = exp(𝑓: 1 ) ∑ exp(𝑓
† 1 )
𝑓: 1 = 𝒊:, 𝒕R
𝒊:, 𝒕R = 𝒊:
v𝑿E𝒕R
Context vector (input to decoder): 𝑨< = I 𝛽<,:ℎ:
v :}"
Mixture weights: 𝛽<,: = exp (𝑓<:) ∑ exp (𝑓<+)
v +}"
Alignment score (how well do input words near j match
𝑓<: = (𝑡<=", ℎ:)
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
𝑡<=" 𝑡< 𝑨< alignment model: s a feedforward neural network which is jointly trained with all the other components of the proposed system
101
102
i t t Plot of 𝜷𝒖,𝒋 Color shows value (white is larger) Note how most important input words for any output word get automatically highlighted The general trend is somewhat linear because word order is roughly similar in both languages i
103
Bahdanau et al. "Neural Machine Translation by Jointly Learning to Align and Translate", 2014
– Convert one sequence to another
104
105
I ate an apple <eos> Ich habe einen apfel gegessen
𝑧1
:cd
𝑧1
—˜
𝑧1
dC<
… 𝑧"
:cd
𝑧"
—˜
𝑧"
dC<
… 𝑧-
:cd
𝑧-
—˜
𝑧-
dC<
… 𝑧.
:cd
𝑧.
—˜
𝑧.
dC<
… 𝑧/
:cd
𝑧/
—˜
𝑧/
dC<
… 𝑧0
:cd
𝑧0
—˜
𝑧0
dC<
…
𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R
106
I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen<eos>
Loss Loss Loss Loss Loss Loss
𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1
𝑧1
:cd
𝑧1
—˜
𝑧1
dC<
… 𝑧"
:cd
𝑧"
—˜
𝑧"
dC<
… 𝑧-
:cd
𝑧-
—˜
𝑧-
dC<
… 𝑧.
:cd
𝑧.
—˜
𝑧.
dC<
… 𝑧/
:cd
𝑧/
—˜
𝑧/
dC<
… 𝑧0
:cd
𝑧0
—˜
𝑧0
dC<
…
Back propagation also updates parameters of the “attention” function ()
– E.g. “Effective Approaches to Attention-based Neural Machine Translation”, Luong et al., 2015 – Other variants
– Bidirectional networks in encoder – E.g. “Neural Machine Translation by Jointly Learning to Align and Translate”, Bahdanau et al. 2016
107
From Y. Bengio CVPR 2015 Tutorial
Bidirectional encoder RNN Decoder RNN Attention Model
108
109
I ate an apple <eos> 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1 Ich habe apfel gegessen
𝑧R
:cd
𝑧R
—˜
𝑧R
dC<
… 𝑧"
:cd
𝑧"
—˜
𝑧"
dC<
… 𝑧-
:cd
𝑧-
—˜
𝑧-
dC<
… 𝑧.
:cd
𝑧.
—˜
𝑧.
dC<
… 𝑧/
:cd
𝑧/
—˜
𝑧/
dC<
… 𝑧0
:cd
𝑧0
—˜
𝑧0
dC<
…
Ich habe einen apfel gegessen<eos>
Loss Loss Loss Loss Loss Loss
***
– Machine translation
– Speech recognition
110
111
– Filter outputs at each location are the equivalent of 𝒊𝑗 in the regular sequence-to-sequence model
112
113
A boy
a surfboard<eos> A boy
a surfboard 𝑧R
C
𝑧R
ej™
𝑧R
cC<
… 𝑧"
C
𝑧
" ej™
𝑧"
cC<
… 𝑧-
C
𝑧-
ej™
𝑧-
cC<
… 𝑧.
C
𝑧.
ej™
𝑧.
cC<
… 𝑧/
C
𝑧/
ej™
𝑧/
cC<
… 𝑧0
C
𝑧0
ej™
𝑧0
cC<
…
Loss Loss Loss Loss Loss Loss
<sos>
114
CNN
Image: H x W x 3 Features: D
h0
Hidden state: H
h1 y1 h2 y2
First word Second word
d1
Distribution
d2
RNN only looks at whole image,
What if the RNN looks at different parts of the image at each timestep?
115
CNN
Image: H x W x 3 Features: L x D
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
116
CNN
Image: H x W x 3 Features: L x D
h0
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
117
CNN
Image: H x W x 3 Features: L x D
h0 a1
Distribution over L locations
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
118
CNN
Image: H x W x 3 Features: L x D
h0 a1
Weighted combination of features Distribution over L locations
z1
Weighted features: D
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
119
CNN
Image: H x W x 3 Features: L x D
h0 a1 z1
Weighted combination of features
h1
Distribution over L locations Weighted features: D
y1
First word
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
120
CNN
Image: H x W x 3 Features: L x D
h0 a1 z1
Weighted combination
y1 h1
First word Distribution over L locations
a2 d1
Weighted features: D Distribution
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
121
CNN
Image: H x W x 3 Features: L x D
h0 a1 z1
Weighted combination
y1 h1
First word Distribution over L locations
a2 d1 z2
Weighted features: D Distribution
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
122
CNN
Image: H x W x 3 Features: L x D
h0 a1 z1
Weighted combination
y1 h1
First word Distribution over L locations
a2 d1 h2 z2 y2
Weighted features: D Distribution
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
123
CNN
Image: H x W x 3 Features: L x D
h0 a1 z1
Weighted combination
y1 h1
First word Distribution over L locations
a2 d1 h2 a3 d2 z2 y2
Weighted features: D Distribution
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
124
CNN
Image: H x W x 3 Grid of features (Each D-dimensional)
a b c d pa pb pc pd
Distribution over grid locations pa + pb + pc + pc = 1 From RNN:
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
125
CNN
Image: H x W x 3 Grid of features (Each D-dimensional)
a b c d pa pb pc pd
Distribution over grid locations pa + pb + pc + pc = 1 From RNN: Context vector z (D-dimensional)
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
126
CNN
Image: H x W x 3 Grid of features (Each D-dimensional)
a b c d pa pb pc pd
Distribution over grid locations pa + pb + pc + pc = 1 From RNN: Context vector z (D-dimensional) Soft attention: Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
127
Hard attention Soft attention Model want to attend to salient part of an image while generating its caption
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
128
129
CNN
Image: H x W x 3 Grid of features (Each D-dimensional)
a b c d pa pb pc pd
Distribution over grid locations pa + pb + pc + pc = 1 From RNN: Context vector z (D-dimensional) Soft attention: Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent Hard attention: Sample ONE location according to p, z = that vector With argmax, dz/dp is zero almost everywhere … Can’t use gradient descent; need reinforcement learning
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
130
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
131
132
133
134
135