Seq2Seq Models and Attention M. Soleymani Sharif University of - - PowerPoint PPT Presentation

seq2seq models and attention
SMART_READER_LITE
LIVE PREVIEW

Seq2Seq Models and Attention M. Soleymani Sharif University of - - PowerPoint PPT Presentation

Seq2Seq Models and Attention M. Soleymani Sharif University of Technology Spring 2020 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019, and some from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017. Se Sequence-to


slide-1
SLIDE 1

Seq2Seq Models and Attention

  • M. Soleymani

Sharif University of Technology Spring 2020 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019, and some from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017.

slide-2
SLIDE 2
  • Problem:

– A sequence 𝑌" … 𝑌$ goes in – A different sequence 𝑍

" … 𝑍 & comes out

  • E.g.

– Speech recognition: Speech goes in, a word sequence comes out

  • Alternately output may be phoneme or character sequence

– Machine translation: Word sequence goes in, word sequence comes out

  • In general 𝑂 ≠ 𝑁

– No synchrony between 𝑌 and 𝑍.

2

Se Sequence-to to-se sequence modelling

slide-3
SLIDE 3

Se Sequence to sequence

  • Sequence goes in, sequence comes out
  • No notion of “synchrony” between input and output

– May even not have a notion of “alignment”

  • E.g. “I ate an apple” à “Ich habe einen apfel gegessen”

3

v

Seq2seq Seq2seq

I ate an apple Ich habe einen apfel gegessen I ate an apple

slide-4
SLIDE 4
  • Sequence goes in, sequence comes out
  • No notion of “synchrony” between input and output

–May even not have a notion of “alignment”

  • E.g. “I ate an apple” à “Ich habe einen apfel gegessen”

4

Seq2seq

I ate an apple Ich habe einen apfel gegessen

To Today

v

slide-5
SLIDE 5

Re Recap: Predicting text

  • Simple problem: Given a series of symbols (characters or words) w1

w2… wn, predict the next symbol (character or word) wn+1

5

slide-6
SLIDE 6

Re Recap: Text Modelling

  • Learn a model that can predict the next character given a sequence of

characters

– Or, at a higher level, words

  • After observing inputs 𝑥" … 𝑥+ it predicts 𝑥+,"

h0 𝑥" 𝑥- 𝑥. 𝑥/ 𝑥0 𝑥1 𝑥- 𝑥. 𝑥/ 𝑥0 𝑥1 𝑥2

6

slide-7
SLIDE 7
  • Input: symbols as one-hot vectors
  • Dimensionality of the input vector is the size of the “vocabulary”
  • Projected down to lower-dimensional “embeddings”
  • Output: Probability distribution over symbols

𝑍 𝑢, 𝑗 = 𝑄(𝑊

:|𝑥" … 𝑥<=")

  • Loss

𝑀𝑝𝑡𝑡 𝐙<CDEF< 1 … 𝑈 , 𝐙(1 … 𝑈) = I 𝑌𝑓𝑜𝑢 𝐙<CDEF< 𝑢 , 𝐙(𝑢)

  • <

= − I log 𝑍(𝑢, 𝑥<,")

  • <

Y(t) h-1 Y(t) Loss 𝑄 𝑥" 𝑥- 𝑥. 𝑥/ 𝑥0 𝑥1 𝑥2 The probability assigned to the correct next word

7

𝑥R 𝑥" 𝑥- 𝑥. 𝑥/ 𝑥0 𝑥1 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄

Re Recap: Training

𝑊

: is the i-th symbol in the vocabulary

slide-8
SLIDE 8

Re Recap: Generating Language or Sy Synthesis

  • On trained model : Provide the first few words

– One-hot vectors

  • After the last input word, the network generates a probability distribution over words

– Outputs an N-valued probability distribution rather than a one-hot vector

𝑄 𝑋

"

𝑄 𝑋

  • 𝑄

𝑋

.

8

𝑧/

"

𝑧/

𝑧/

$

𝑧<

: = 𝑄(𝑋 < = 𝑊 :|𝑋 " … 𝑋 <=")

The probability that the t-th word in the sequence is the i-th word in the vocabulary given all previous t-1 words

slide-9
SLIDE 9

Re Recap: Generating Language or Synthesis

  • On trained model : Provide the first few words

– One-hot vectors

  • After the last input word, the network generates a probability distribution over words

– Outputs an N-valued probability distribution rather than a one-hot vector

  • Draw a word from the distribution

– And set it as the next word in the series

𝑄 𝑋

"

𝑄 𝑋

  • 𝑄

𝑋

.

𝑋

/

9

𝑧/

"

𝑧/

𝑧/

$

𝑧<

: = 𝑄(𝑋 < = 𝑊 :|𝑋 " … 𝑋 <=")

The probability that the t-th word in the sequence is the i-th word in the vocabulary given all previous t-1 words

slide-10
SLIDE 10

Re Recap: Generating Language or Synthesis

  • Feed the drawn word as the next word in the series

– And draw the next word from the output probability distribution

  • Continue this process until we terminate generation

– In some cases, e.g. generating programs, there may be a natural termination

𝑄 𝑋

"

𝑄 𝑋

  • 𝑄

𝑋

.

𝑄 𝑋

/

10

𝑧0

"

𝑧0

𝑧0

$

𝑧<

: = 𝑄(𝑋 < = 𝑊 :|𝑋 " … 𝑋 <=")

The probability that the t-th word in the sequence is the i-th word in the vocabulary given all previous t-1 words

slide-11
SLIDE 11

Re Recap: Generating Language or Synthesis

  • Feed the drawn word as the next word in the series

– And draw the next word from the output probability distribution

  • Continue this process until we terminate generation

– In some cases, e.g. generating programs, there may be a natural termination

𝑄 𝑋

"

𝑄 𝑋

  • 𝑄

𝑋

.

𝑄 𝑋 𝑋

/

11

𝑧0

"

𝑧0

𝑧0

$

𝑧<

: = 𝑄(𝑋 < = 𝑊 :|𝑋 " … 𝑋 <=")

The probability that the t-th word in the sequence is the i-th word in the vocabulary given all previous t-1 words

slide-12
SLIDE 12

Re Recap: Generating Language or Synthesis

  • Feed the drawn word as the next word in the series

– And draw the next word from the output probability distribution

  • Continue this process until we terminate generation

– For text generation we will usually end at an <eos> (end of sequence) symbol

  • The <eos> symbol is a special symbol included in the vocabulary, which indicates the

termination of a sequence and occurs only at the final position of sequences

𝑄 𝑋

"

𝑄 𝑋

  • 𝑄

𝑋

.

𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑋 𝑋

1

𝑋

2

𝑋

U

𝑋

V

𝑋

"R

𝑋

/

12

slide-13
SLIDE 13
  • Problem:

– A sequence 𝑌" … 𝑌$ goes in – A different sequence 𝑍

" … 𝑍 & comes out

  • Similar to predicting text, but with a difference

– The output is in a different language..

13

Seq2seq

I ate an apple Ich habe einen apfel gegessen

Re Returning our problem

slide-14
SLIDE 14
  • Delayed sequence to sequence

14

Mo Modelling the problem

slide-15
SLIDE 15
  • Delayed sequence to sequence

– Delayed self-referencing sequence-to-sequence

15

Mo Modelling the problem

slide-16
SLIDE 16
  • The input sequence feeds into an recurrent structure
  • The input sequence is terminated by an explicit <eos> symbol

– The hidden activation at the <eos> “stores” all information about the sentence

  • Subsequently a second RNN uses the hidden activation as initial state

to produce a sequence of outputs

– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced

16

I ate an apple <eos>

Th The “simple” translation model

slide-17
SLIDE 17
  • The input sequence feeds into an recurrent structure
  • The input sequence is terminated by an explicit <eos> symbol

– The hidden activation at the <eos> “stores” all information about the sentence

  • Subsequently a second RNN uses the hidden activation as initial state to produce

a sequence of outputs

– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced

17

I ate an apple <eos>

Th The “simple” translation model

slide-18
SLIDE 18
  • The input sequence feeds into an recurrent structure
  • The input sequence is terminated by an explicit <eos> symbol

– The hidden activation at the <eos> “stores” all information about the sentence

  • Subsequently a second RNN uses the hidden activation as initial state to produce a

sequence of outputs

– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced

18

I ate an apple <eos> Ich

Th The “simple” translation model

slide-19
SLIDE 19
  • The input sequence feeds into an recurrent structure
  • The input sequence is terminated by an explicit <eos> symbol

– The hidden activation at the <eos> “stores” all information about the sentence

  • Subsequently a second RNN uses the hidden activation as initial state to produce a

sequence of outputs

– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced

19

I ate an apple <eos> Ich habe Ich

Th The “simple” translation model

slide-20
SLIDE 20
  • The input sequence feeds into an recurrent structure
  • The input sequence is terminated by an explicit <eos> symbol

– The hidden activation at the <eos> “stores” all information about the sentence

  • Subsequently a second RNN uses the hidden activation as initial state to produce a

sequence of outputs

– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced

20

I ate an apple <eos> Ich habe einen Ich habe

Th The “simple” translation model

slide-21
SLIDE 21
  • The input sequence feeds into an recurrent structure
  • The input sequence is terminated by an explicit <eos> symbol

– The hidden activation at the <eos> “stores” all information about the sentence

  • Subsequently a second RNN uses the hidden activation as initial state to produce a

sequence of outputs

– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced

21

I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen

Th The “simple” translation model

slide-22
SLIDE 22

22

  • We will illustrate with a single hidden layer, but the discussion generalizes to

more layers

I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen

slide-23
SLIDE 23

23

I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen

ENCODER DECODER

Th The “simple” translation model

slide-24
SLIDE 24
  • A more detailed look: The one-hot word representations may be

compressed via embeddings

– Embeddings will be learned along with the rest of the net – In the following slides we will not represent the projection matrices

24

Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen 𝑄

"

𝑄

"

𝑄

"

𝑄

"

𝑄

"

𝑄

  • 𝑄
  • 𝑄
  • 𝑄
  • 𝑄
  • Th

The “simple” translation model

slide-25
SLIDE 25
  • Must learn to make predictions appropriately

– Given “I ate an apple <eos>”, produce “Ich habe einen apfel gegessen <eos>”.

25

I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen

Tr Training th the e system em

slide-26
SLIDE 26
  • Forward pass: Input the source and target sequences, sequentially

– Output will be a probability distribution over target symbol set (vocabulary)

26

I ate an apple <eos> Ich habe einen apfel gegessen 𝑍

"

𝑍

  • 𝑍

.

𝑍

/

𝑍 𝑍

1

Tr Training : : Forward pass

slide-27
SLIDE 27
  • Backward pass: Compute the loss between the output distribution and target

word sequence

27

I ate an apple <eos> Ich habe einen apfel gegessen 𝑍

"

𝑍

  • 𝑍

.

𝑍

/

𝑍 𝑍

1

Ich habe einen apfel gegessen <eos>

Loss Loss

Tr Training : : Backward pass

Loss Loss Loss Loss

slide-28
SLIDE 28
  • Backward pass: Compute the loss between the output distribution and target word

sequence

  • Backpropagate the derivatives of the loss through the network to learn the net

28

I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>

Tr Training : : Backward pass

𝑍

"

𝑍

  • 𝑍

.

𝑍

/

𝑍 𝑍

1

Loss Loss Loss Loss Loss Loss

slide-29
SLIDE 29
  • In practice, if we apply SGD, we may randomly sample words from the output to actually use for

the backprop and update

– Typical usage: Randomly select one word from each input training instance (comprising an input-output pair)

  • For each iteration
  • Randomly select training instance: (input, output)
  • Forward pass
  • Randomly select a single output y(t) and corresponding desired output d(t) for backprop

29

I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>

Tr Training : : Backward pass

𝑍

"

𝑍

  • 𝑍

.

𝑍

/

𝑍 𝑍

1

Loss Loss Loss Loss Loss Loss

slide-30
SLIDE 30
  • Standard trick of the trade: The input sequence is fed in reverse order

– Things work better this way

30

Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos> I ate an apple <eos>

Tr Trick of the trade: Reversing the input

𝑍

"

𝑍

  • 𝑍

.

𝑍

/

𝑍 𝑍

1

Loss Loss Loss Loss Loss Loss

slide-31
SLIDE 31
  • Standard trick of the trade: The input sequence is fed in reverse order

– Things work better this way

31

Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos> I ate an apple <eos>

Tr Trick of the trade: Reversing the input

𝑍

"

𝑍

  • 𝑍

.

𝑍

/

𝑍 𝑍

1

Loss Loss Loss Loss Loss Loss

slide-32
SLIDE 32
  • Standard trick of the trade: The input sequence is fed in reverse order

– Things work better this way

  • This happens both for training and during actual decode

32

I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>

Tr Trick of the trade: Reversing the input

𝑍

"

𝑍

  • 𝑍

.

𝑍

/

𝑍 𝑍

1

slide-33
SLIDE 33

Ov Overall training

  • Given several training instance (𝐘, 𝐙XYZ[\X)
  • Forward pass: Compute the output of the network for (𝐘, 𝐙XYZ[\X)

with input in reverse order

– Note, both 𝐘 and 𝐙XYZ[\X are used in the forward pass

  • Backward pass: Compute the loss between the desired target

𝐙XYZ[\X and the actual output 𝐙

– Propagate derivatives of loss for updates

  • We called 𝐘 as 𝑱 and 𝐙XYZ[\X as 𝑷 during training

33

slide-34
SLIDE 34
  • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary

– 𝑧+

` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$

– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙

  • At each time a word is drawn from the output distribution
  • The drawn word is provided as input to the next time

34

I ate an apple <eos>

Wha What the he ne network k actua ually y pr produc duces

𝑧"

:cd

𝑧"

e:FD

𝑧"

CfgFh

𝑧"

iFjkl

slide-35
SLIDE 35
  • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary

– 𝑧+

` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$

– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙

  • At each time a word is drawn from the output distribution
  • The drawn word is provided as input to the next time

35

I ate an apple <eos> Ich

Wha What the he ne network k actua ually y pr produc duces

𝑧"

:cd

𝑧"

e:FD

𝑧"

CfgFh

𝑧"

iFjkl

slide-36
SLIDE 36
  • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary

– 𝑧+

` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$

– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙

  • At each time a word is drawn from the output distribution
  • The drawn word is provided as input to the next time

36

I ate an apple <eos> Ich Ich

Wha What the he ne network k actua ually y pr produc duces

𝑧"

:cd

𝑧"

e:FD

𝑧"

CfgFh

𝑧"

iFjkl

slide-37
SLIDE 37
  • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary

– 𝑧+

` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$

– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙

  • At each time a word is drawn from the output distribution
  • The drawn word is provided as input to the next time

37

I ate an apple <eos> Ich Ich

Wha What the he ne network k actua ually y pr produc duces

𝑧"

:cd

𝑧"

e:FD

𝑧"

CfgFh

𝑧"

iFjkl

… 𝑧-

:cd

𝑧-

e:FD

𝑧-

CfgFh

𝑧-

iFjkl

slide-38
SLIDE 38
  • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary

– 𝑧+

` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$

– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙

  • At each time a word is drawn from the output distribution
  • The drawn word is provided as input to the next time

38

I ate an apple <eos>

𝑧"

:cd

𝑧"

e:FD

𝑧"

CfgFh

𝑧"

iFjkl

Ich Ich habe

Wha What the he ne network k actua ually y pr produc duces

𝑧-

:cd

𝑧-

e:FD

𝑧-

CfgFh

𝑧-

iFjkl

slide-39
SLIDE 39
  • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary

– 𝑧+

` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$

– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙

  • At each time a word is drawn from the output distribution
  • The drawn word is provided as input to the next time

39

I ate an apple <eos> Ich habe

𝑧"

:cd

𝑧"

e:FD

𝑧"

CfgFh

𝑧"

iFjkl

Ich Ich habe

Wha What the he ne network k actua ually y pr produc duces

𝑧-

:cd

𝑧-

e:FD

𝑧-

CfgFh

𝑧-

iFjkl

slide-40
SLIDE 40
  • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary

– 𝑧+

` = 𝑄 𝑧 = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$

– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙

  • At each time a word is drawn from the output distribution
  • The drawn word is provided as input to the next time

40

I ate an apple <eos> Ich habe

𝑧"

:cd

𝑧"

e:FD

𝑧"

CfgFh

𝑧"

iFjkl

Ich Ich habe

Wha What the he ne network k actua ually y pr produc duces

𝑧-

:cd

𝑧-

e:FD

𝑧-

CfgFh

𝑧-

iFjkl

… 𝑧.

:cd

𝑧.

e:FD

𝑧.

CfgFh

𝑧.

iFjkl

slide-41
SLIDE 41
  • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary

– 𝑧+

` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$

– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙

  • At each time a word is drawn from the output distribution
  • The drawn word is provided as input to the next time

41

I ate an apple <eos> Ich habe Ich Ich habe einen

Wha What the he ne network k actua ually y pr produc duces

𝑧"

:cd

𝑧"

e:FD

𝑧"

CfgFh

𝑧"

iFjkl

… 𝑧-

:cd

𝑧-

e:FD

𝑧-

CfgFh

𝑧-

iFjkl

… 𝑧.

:cd

𝑧.

e:FD

𝑧.

CfgFh

𝑧.

iFjkl

slide-42
SLIDE 42
  • At each time 𝑙 the network actually produces a probability distribution over the output vocabulary

– 𝑧+

` = 𝑄 𝑧+ = 𝑥|𝑃+=", … , 𝑃", 𝐽", … , 𝐽$

– The probability given the entire input sequence 𝐽", … , 𝐽$ and the partial output sequence 𝑃", … , 𝑃+=" until 𝑙

  • At each time a word is drawn from the output distribution
  • The drawn word is provided as input to the next time

42

I ate an apple <eos>

𝑧1

:cd

𝑧1

e:FD

𝑧1

CfgFh

𝑧1

iFjkl

Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>

Wha What the he ne network k actua ually y pr produc duces

𝑧"

:cd

𝑧"

e:FD

𝑧"

CfgFh

𝑧"

iFjkl

… 𝑧-

:cd

𝑧-

e:FD

𝑧-

CfgFh

𝑧-

iFjkl

… 𝑧.

:cd

𝑧.

e:FD

𝑧.

CfgFh

𝑧.

iFjkl

… 𝑧/

:cd

𝑧/

e:FD

𝑧/

CfgFh

𝑧/

iFjkl

… 𝑧0

:cd

𝑧0

e:FD

𝑧0

CfgFh

𝑧0

iFjkl

slide-43
SLIDE 43
  • At each time the network produces a probability distribution over words, given the entire input and previous outputs
  • At each time a word is drawn from the output distribution
  • The drawn word is provided as input to the next time
  • The process continues until an <eos> is generated

43

I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen

Ge Generatin ting an an outp tput t from th the net

𝑧1

:cd

𝑧1

e:FD

𝑧1

CfgFh

𝑧1

iFjkl

… 𝑧"

:cd

𝑧"

e:FD

𝑧"

CfgFh

𝑧"

iFjkl

… 𝑧-

:cd

𝑧-

e:FD

𝑧-

CfgFh

𝑧-

iFjkl

… 𝑧.

:cd

𝑧.

e:FD

𝑧.

CfgFh

𝑧.

iFjkl

… 𝑧/

:cd

𝑧/

e:FD

𝑧/

CfgFh

𝑧/

iFjkl

… 𝑧0

:cd

𝑧0

e:FD

𝑧0

CfgFh

𝑧0

iFjkl

slide-44
SLIDE 44

𝑄 𝑃", … , 𝑃m|𝐽" , … , 𝐽$ = 𝑧"

no𝑧- np … 𝑧m nq

  • The objective of drawing: Produce the most likely output (that ends in an <eos>)

argmax

no,…,nq

𝑧"

no𝑧- np … 𝑧m nq

44

I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen

Th The probability of the output

𝑧1

:cd

𝑧1

e:FD

𝑧1

CfgFh

𝑧1

iFjkl

… 𝑧"

:cd

𝑧"

e:FD

𝑧"

CfgFh

𝑧"

iFjkl

… 𝑧-

:cd

𝑧-

e:FD

𝑧-

CfgFh

𝑧-

iFjkl

… 𝑧.

:cd

𝑧.

e:FD

𝑧.

CfgFh

𝑧.

iFjkl

… 𝑧/

:cd

𝑧/

e:FD

𝑧/

CfgFh

𝑧/

iFjkl

… 𝑧0

:cd

𝑧0

e:FD

𝑧0

CfgFh

𝑧0

iFjkl

slide-45
SLIDE 45
  • Cannot just pick the most likely symbol at each time

– That may cause the distribution to be more “confused” at the next time – Choosing a different, less likely word could cause the distribution at the next time to be more peaky, resulting in a more likely output overall

45

I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen Objective: argmax

no,…,nq

𝑧"

no𝑧- np … 𝑧m nq

Th The probability of the output

𝑧1

:cd

𝑧1

e:FD

𝑧1

CfgFh

𝑧1

iFjkl

… 𝑧"

:cd

𝑧"

e:FD

𝑧"

CfgFh

𝑧"

iFjkl

… 𝑧-

:cd

𝑧-

e:FD

𝑧-

CfgFh

𝑧-

iFjkl

… 𝑧.

:cd

𝑧.

e:FD

𝑧.

CfgFh

𝑧.

iFjkl

… 𝑧/

:cd

𝑧/

e:FD

𝑧/

CfgFh

𝑧/

iFjkl

… 𝑧0

:cd

𝑧0

e:FD

𝑧0

CfgFh

𝑧0

iFjkl

slide-46
SLIDE 46
  • Hypothetical example (from English speech recognition: Input is speech, output must be text)
  • “Nose” has highest probability at t=2 and is selected

– The model is very confused at t=3 and assigns low probabilities to many words at the next time – Selecting any of these will result in low probability for the entire 3-word sequence

  • “Knows” has slightly lower probability than “nose”, but is still high and is selected

– “he knows” is a reasonable beginning and the model assigns high probabilities to words such as “something” – Selecting one of these results in higher overall probability for the 3-word sequence

46

T=1 2 3 T=1 2 3 w1 w2 w3 wV …

𝑄(𝑃.|𝑃", 𝑃-, 𝐽", … , 𝐽$)

w1 w2 w3 wV …

𝑄(𝑃.|𝑃", 𝑃-, 𝐽", … , 𝐽$)

Gr Greedy is is not t good

slide-47
SLIDE 47
  • Problem: Impossible to know a priori which word leads to the more

promising future

– Should we draw “nose” or “knows”? – Effect may not be obvious until several words down the line – Or the choice of the wrong word early may cumulatively lead to a poorer overall score over time

47

T=1 2 3 w1 w2 w3 wV …

𝑄(𝑃-|𝑃", 𝐽", … , 𝐽$)

What should we have chosen at t=2?? Will selecting “nose” continue to have a bad effect into the distant future?

Gr Greedy is is not t good

“nose” “knows”

slide-48
SLIDE 48
  • Problem: Impossible to know a priori which word leads to the more promising future

– Even earlier: Choosing the lower probability “the” instead of “he” at T=1 may have made a choice of “nose” more reasonable at T=2.

  • In general, making a poor choice at any time commits us to a poor future

– But we cannot know at that time the choice was poor

  • Solution: Don’t choose..

48

T=1 2 3 w1 the w3 he …

𝑄(𝑃"|𝐽", … , 𝐽$)

What should we have chosen at t=1?? Choose “the” or “he”?

Gr Greedy is is not t good

slide-49
SLIDE 49
  • Retain both choices and fork the network

– With every possible word as input

49 I He We The

So Solution: : Mu Multiple choices

slide-50
SLIDE 50
  • Problem: This will blow up very quickly

– For an output vocabulary of size 𝑊, after 𝑈 output steps we’d have forked out 𝑊v branches

50 I He We The

⋮ ⋮ ⋮ ⋮ ⋮

Pr Problem: Multiple choices

slide-51
SLIDE 51
  • Solution: Prune

– At each time, retain only the top K scoring forks

51 I He We The

⋮ 𝑈𝑝𝑞y 𝑄(𝑃"|𝐽", … , 𝐽$)

So Solution: : Pru rune

Beam search

slide-52
SLIDE 52
  • Solution: Prune

– At each time, retain only the top K scoring forks

52

So Solution: : Pru rune

I He We The

⋮ 𝑈𝑝𝑞y 𝑄(𝑃"|𝐽", … , 𝐽$)

Beam search

slide-53
SLIDE 53
  • Solution: Prune

– At each time, retain only the top K scoring forks

53

So Solution: : Pru rune

He The

𝑈𝑝𝑞y 𝑄(𝑃-𝑃"|𝐽", … , 𝐽$) Note: based on product = 𝑈𝑝𝑞y 𝑄(𝑃-|𝑃", 𝐽", … , 𝐽$)𝑄(𝑃"|𝐽", … , 𝐽$)

I Knows … I Nose …

⋮ ⋮

Beam search

slide-54
SLIDE 54
  • Solution: Prune

– At each time, retain only the top K scoring forks

54

So Solution: : Pru rune

He The

𝑈𝑝𝑞y 𝑄(𝑃-𝑃"|𝐽", … , 𝐽$) Note: based on product = 𝑈𝑝𝑞y 𝑄(𝑃-|𝑃", 𝐽", … , 𝐽$)𝑄(𝑃"|𝐽", … , 𝐽$)

I Knows … I Nose …

⋮ ⋮

Beam search

slide-55
SLIDE 55
  • Solution: Prune

– At each time, retain only the top K scoring forks

55

So Solution: : Pru rune

He The

= 𝑈𝑝𝑞y𝑄 𝑃- 𝑃", 𝐽", … , 𝐽$ × 𝑄 𝑃- 𝑃", 𝐽", … , 𝐽$ × 𝑄(𝑃"|𝐽", … , 𝐽$)

Knows Nose …

⋮ ⋮

Beam search

slide-56
SLIDE 56
  • Solution: Prune

– At each time, retain only the top K scoring forks

56

So Solution: : Pru rune

He The

= 𝑈𝑝𝑞y𝑄 𝑃- 𝑃", 𝐽", … , 𝐽$ × 𝑄 𝑃- 𝑃", 𝐽", … , 𝐽$ × 𝑄(𝑃"|𝐽", … , 𝐽$)

Knows Nose …

⋮ ⋮

Beam search

slide-57
SLIDE 57
  • Solution: Prune

– At each time, retain only the top K scoring forks

57

So Solution: : Pru rune

He The

𝑈𝑝𝑞y { 𝑄 𝑃< 𝑃", … , 𝑃<=", 𝐽", … , 𝐽$

| <}"

Knows Nose

Beam search

slide-58
SLIDE 58
  • Terminate

– When the current most likely path overall ends in <eos>

  • Or continue producing more outputs (each of which terminates in <eos>) to get N-best outputs

58 He The Knows <eos> Nose

Te Terminate

Beam search

slide-59
SLIDE 59
  • Terminate

– Paths cannot continue once the output an <eos>

  • So paths may be different lengths
  • Select the most likely sequence ending in <eos> across all terminating sequences

59 He The Knows <eos> Nose <eos> <eos>

Example has K = 2

Te Termination: <eo eos>

Beam search

slide-60
SLIDE 60

Appl Applications ns

  • Machine Translation

– My name is Tom à Ich heisse Tom/Mein name ist Tom

  • Dialog

– “I have a problem” à “How may I help you”

  • Image to text

– Picture à Caption for picture

60

slide-61
SLIDE 61
  • Hidden state clusters by meaning!

61

Ma Machine Translation Examp mple

“Sequence-to-sequence learning with neural networks”, Sutskever, Vinyals, and Le, 2014

slide-62
SLIDE 62

“Sequence-to-sequence learning with neural networks”, Sutskever, Vinyals and Le, 2014

62

Ma Machine Translation Examp mple

slide-63
SLIDE 63
  • Trained on human-human conversations
  • Task: Human text in, machine response out

63

Hum Human an Mac achine hine Conver ersatio tion: n: Exam ample ple

“A neural conversational model”, Orin Vinyals and Quoc Le, 2015

slide-64
SLIDE 64
  • All the information about the input sequence is

embedded into a single vector

– The “hidden” node layer at the end of the input sequence – This one node is “overloaded” with information

  • Particularly if the input is long

64

I ate an apple <eos> Ich habe einen apfel gegessen 𝑍

R

𝑍

"

𝑍

  • 𝑍

.

𝑍

/

𝑍 Ich habe einen apfel gegessen <eos>

A A pr probl blem wi with h thi his s fr framework

slide-65
SLIDE 65

A A pr probl blem wi with h thi his s fr framework

  • In reality: All hidden values carry information

– Some of which may be diluted downstream

  • Different outputs are related to different inputs

– Recall input and output may not be in sequence – Have no way of knowing a priori which input must connect to what output

65

I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>

slide-66
SLIDE 66

Va Variants

66

<sos> I ate an apple <eos> <sos> A better model: Encoded input embedding is input to all output timesteps A boy

  • n

a surfboard A boy

  • n

surfboard a <eos> Ich habe einen apfel gegessen <eos>

slide-67
SLIDE 67

A A pr probl blem wi with h thi his s fr framework

  • All the information about the input sequence is embedded into a

single vector

– The “hidden” node layer at the end of the input sequence – This one node is “overloaded” with information

  • Particularly if the input is long

67

I ate an apple <eos> Ich habe einen apfel gegessen 𝑍

R

𝑍

"

𝑍

  • 𝑍

.

𝑍

/

𝑍 Ich habe einen apfel gegessen <eos>

slide-68
SLIDE 68

A A pr probl blem wi with h thi his s fr framework

  • In reality: All hidden values carry information

– Some of which may be diluted downstream

68

I ate an apple <eos> FIX ENCODER DECODER SEPARATION

slide-69
SLIDE 69

A A pr probl blem wi with h thi his s fr framework

  • In reality: All hidden values carry information

– Some of which may be diluted downstream

  • Different outputs are related to different inputs

– Recall input and output may not be in sequence

69

I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>

slide-70
SLIDE 70

A A pr probl blem wi with h thi his s fr framework

  • In reality: All hidden values carry information

– Some of which may be diluted downstream

  • Different outputs are related to different inputs

– Recall input and output may not be in sequence – Have no way of knowing a priori which input must connect to what output

70

I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>

slide-71
SLIDE 71

A A pr probl blem wi with h thi his s fr framework

  • Connecting everything to everything is infeasible

– Variable sized inputs and outputs – Overparametrized – Connection pattern ignores the actual a synchronous dependence of output on input

71

I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos>

slide-72
SLIDE 72

So Solution: : Attention mo models

  • Separating the encoder and decoder in illustration

72

I ate an apple <eos> 𝒊R 𝒊" 𝒊- 𝒊. 𝒊=" 𝒕="

slide-73
SLIDE 73

At Attention models

  • Compute a weighted combination of all the hidden
  • utputs into a single vector

– Weights vary by output time

73

I ate an apple <eos> 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1

I𝛽",:𝒊:

  • :

I𝛽-,:𝒊:

  • :

I𝛽.,:𝒊:

  • :

I𝛽/,:𝒊:

  • :

I𝛽0,:𝒊:

  • :

I𝛽1,:𝒊:

  • :
slide-74
SLIDE 74

So Solution: : Attention mo models

  • Compute a weighted combination of all the hidden
  • utputs into a single vector

– Weights vary by output time

74

I ate an apple <eos> Note: Weights vary with output time Input to hidden decoder layer: ∑ 𝛽<,:𝒊:

  • :

Weights: 𝛽<,: are scalars 𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R

I𝛽",:𝒊:

  • :

I𝛽-,:𝒊:

  • :

I𝛽.,:𝒊:

  • :

I𝛽/,:𝒊:

  • :

I𝛽0,:𝒊:

  • :

I𝛽1,:𝒊:

  • :
slide-75
SLIDE 75

Attention instead of simple encoder-decoder

  • Encoder-decoder models

– needs to be able to compress all the necessary information of a source sentence into a fixed-length vector – performance deteriorates rapidly as the length of an input sentence increases.

  • Attention avoids this by:

– allowing the RNN generating the output to focus on hidden states (generated by the first RNN) as they become relevant.

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 75

slide-76
SLIDE 76

Soft Attention for Translation

An RNN can attend over the output of another RNN. At every time step, it focuses on different positions in the other RNN. “I love coffee” -> “Me gusta el café”

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 76

slide-77
SLIDE 77

Soft Attention for Translation

“I love coffee” -> “Me gusta el café”

Distribution over input words

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 77

slide-78
SLIDE 78

Soft Attention for Translation

“I love coffee” -> “Me gusta el café”

Distribution over input words

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 78

slide-79
SLIDE 79

Soft Attention for Translation

“I love coffee” -> “Me gusta el café”

Distribution over input words

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 79

slide-80
SLIDE 80

Soft Attention for Translation

“I love coffee” -> “Me gusta el café”

Distribution over input words

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015 80

slide-81
SLIDE 81

So Solution: : Attention mo models

  • Require a time-varying weight that specifies relationship
  • f output time to input time

– Weights are functions of current output state

81

I ate an apple <eos>

I𝛽:,"𝒊:

  • :

I𝛽:,-𝒊:

  • :

I𝛽:,.𝒊:

  • :

I𝛽:,/𝒊:

  • :

I𝛽:,0𝒊:

  • :

I𝛽:,1𝒊:

  • :

𝛽<,: = 𝑏 𝒊:, 𝒕<="

Input to hidden decoder layer: ∑ 𝛽<,:𝒊:

  • :

𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R

slide-82
SLIDE 82
  • The weights are a distribution over the input

– Must automatically highlight the most important input components for any output

82

I ate an apple <eos>

𝛽<,: = 𝑏 𝒊:, 𝒕<="

Input to hidden decoder layer: ∑ 𝛽<,:𝒊:

  • :

Ich habe einen Ich habe einen

Sum to 1.0

At Attention models

𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1

slide-83
SLIDE 83
  • “Raw” weight at any time: A function 𝑕() that works on the two hidden states
  • Actual weight: softmax over raw weights

83

I ate an apple <eos>

𝛽<,: = exp(𝑓: 𝑢 ) ∑ exp(𝑓

† 𝑢 )

Input to hidden decoder layer: ∑ 𝛽<,:𝒊:

  • :

Ich habe einen Ich habe einen

Sum to 1.0 𝑓: 𝑢 = 𝑕 𝒊:, 𝒕<="

At Attention models

𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1

slide-84
SLIDE 84
  • Typical options for 𝑕()…

– Variables in red are to be learned

84

I ate an apple <eos> Ich habe einen Ich habe einen

𝑓: 𝑢 = 𝑕 𝒊:, 𝒕<="

𝑕 𝒊:, 𝒕<=" = 𝒊:

v𝒕<="

𝑕 𝒊:, 𝒕<=" = 𝒊:

v𝑿E𝒕<="

𝑕 𝒊:, 𝒕<=" = 𝒘E

v𝒖𝒃𝒐𝒊 𝑿E

𝒊: 𝒕<=" 𝑕 𝒊:, 𝒕<=" = 𝑁𝑀𝑄 [𝒊:, 𝒕<="]

At Attention models

𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1

𝛽<,: = exp(𝑓: 𝑢 ) ∑ exp(𝑓

† 𝑢 )

slide-85
SLIDE 85
  • Pass the input through the encoder to produce

hidden representations 𝒊:

85

I ate an apple <eos>

Co Convert rting an input (f (forward pass) ss)

𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R

slide-86
SLIDE 86
  • Compute weights for first output

86

I ate an apple <eos> 𝒕R What is this? Multiple options Simplest: 𝒕R = 𝒊$ If 𝒕 and 𝒊 are different sizes: 𝒕R = 𝑿k𝒊$ 𝑿k is learnable parameter

Co Convert rting an input (f (forward pass) ss)

𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R

slide-87
SLIDE 87
  • Compute weights (for every 𝒊:) for first output

87

I ate an apple <eos> 𝒕R

𝛽",: = exp(𝑓: 1 ) ∑ exp(𝑓

† 1 )

𝑓: 1 = 𝑕 𝒊:, 𝒕R

𝑕 𝒊:, 𝒕R = 𝒊:

v𝑿E𝒕R

Co Convert rting an input (f (forward pass) ss)

𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R

slide-88
SLIDE 88
  • Compute weights (for every 𝒊:) for first output
  • Compute weighted combination of hidden values

88

I ate an apple <eos> 𝒕R 𝒜" = I 𝛽",:𝒊:

  • :

Co Convert rting an input (f (forward pass) ss)

𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R

𝛽",: = exp(𝑓: 1 ) ∑ exp(𝑓

† 1 )

𝑓: 1 = 𝑕 𝒊:, 𝒕R

𝑕 𝒊:, 𝒕R = 𝒊:

v𝑿E𝒕R

slide-89
SLIDE 89
  • Produce the first output

– Will be distribution over words

89

I ate an apple <eos> 𝒕R 𝒕" 𝒜" = I 𝛽",:𝒊:

  • :

𝒁"

Co Convert rting an input (f (forward pass) ss)

𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R

𝛽",: = exp(𝑓: 1 ) ∑ exp(𝑓

† 1 )

𝑓: 1 = 𝑕 𝒊:, 𝒕R

𝑕 𝒊:, 𝒕R = 𝒊:

v𝑿E𝒕R

slide-90
SLIDE 90
  • Produce the first output

– Will be distribution over words – Draw a word from the distribution

90

I ate an apple <eos>

Co Convert rting an input (f (forward pass) ss)

𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒁" 𝒜" = I 𝛽",:𝒊:

  • :
slide-91
SLIDE 91

91

I ate an apple <eos> 𝒕R 𝒕" 𝒁" 𝒜" Ich

𝛽-,: = exp(𝑓: 2 ) ∑ exp(𝑓

† 2 )

𝑓: 2 = 𝑕 𝒊:, 𝒕"

𝑕 𝒊:, 𝒕" = 𝒊:

v𝑿E𝒕"

𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R

slide-92
SLIDE 92

92

I ate an apple <eos> 𝒕R 𝒕" 𝒁" 𝒜" Ich 𝒜- = I 𝛽-,:𝒊:

  • :

𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R

𝛽-,: = exp(𝑓: 2 ) ∑ exp(𝑓

† 2 )

𝑓: 2 = 𝑕 𝒊:, 𝒕"

𝑕 𝒊:, 𝒕" = 𝒊:

v𝑿E𝒕"

slide-93
SLIDE 93

93

  • Compute the output at t=2
  • Will be a probability distribution over words

I ate an apple <eos> 𝒕- Ich 𝒁- 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒁" 𝒜" Ich 𝒜- = I 𝛽-,:𝒊:

  • :
slide-94
SLIDE 94

94

  • Draw a word from the output distribution at t=2

I ate an apple <eos> 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕- Ich 𝒁- 𝒕R 𝒕" 𝒁" 𝒜" 𝒜- = I 𝛽-,:𝒊:

  • :

Ich

slide-95
SLIDE 95

95

  • Compute the weights for all instances for time = 3

I ate an apple <eos> 𝒜-

𝛽.,: = exp(𝑓: 3 ) ∑ exp(𝑓

† 3 )

𝑓: 3 = 𝑕 𝒊:, 𝒕-

𝑕 𝒊:, 𝒕- = 𝒊:

v𝑿E𝒕-

𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕- Ich 𝒁- 𝒕R 𝒕" 𝒁" 𝒜" Ich 𝒜. = I 𝛽.,:𝒊:

  • :
slide-96
SLIDE 96

96

  • Compute the output at t=3
  • Will be a probability distribution over words

I ate an apple <eos> 𝒕R 𝒕" 𝒁" 𝒜" Ich 𝒜. = I 𝛽.,:𝒊:

  • :

𝒕- Ich 𝒁- 𝒜- 𝒕. habe 𝒁. habe 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R

slide-97
SLIDE 97

97

  • Draw a word from the distribution

I ate an apple <eos> 𝒕R 𝒕" 𝒁" 𝒜" Ich 𝒜. = I 𝛽.,:𝒊:

  • :

𝒕- Ich 𝒁- 𝒜- 𝒕. habe 𝒁. habe 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R

slide-98
SLIDE 98

98

  • Compute the weights for all instances for time = 4

I ate an apple <eos> Ich habe 𝒜-

𝛽/,: = exp(𝑓: 4 ) ∑ exp(𝑓

† 4 )

𝑓: 4 = 𝑕 𝒊:, 𝒕.

einen 𝒜. 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒁" 𝒜" 𝒕- Ich 𝒁- 𝒕. habe 𝒁.

slide-99
SLIDE 99

I ate an apple <eos> 𝒕R 𝒕" 𝒁" 𝒜" Ich 𝒕- Ich 𝒁- 𝒜- 𝒕. habe 𝒁. habe einen 𝒜. 𝒕/ einen 𝒁/ apfel gegessen <eos> 𝒕0 apfel 𝒁0 𝒕1 gegessen 𝒁1 𝒜/ 𝒜0 𝒜1

99

  • As before, the objective of drawing: Produce the most likely output (that ends in an <eos>)

argmax

no,…,nq

𝑧

" no𝑧 " np … 𝑧 " nq

  • Simply selecting the most likely symbol at each time may result in suboptimal output

𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R

slide-100
SLIDE 100
  • The key component of this model is the attention weight

– It captures the relative importance of each position in the input to the current output

100

I ate an apple <eos> 𝒊" 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊R 𝒕R 𝒕" 𝒁" 𝒜" Ich 𝒜" = I 𝛽",:𝒊:

  • :

𝒕- Ich 𝒁-

What does the attention learn?

𝛽",: = exp(𝑓: 1 ) ∑ exp(𝑓

† 1 )

𝑓: 1 = 𝑕 𝒊:, 𝒕R

𝑕 𝒊:, 𝒕R = 𝒊:

v𝑿E𝒕R

slide-101
SLIDE 101

Context vector (input to decoder): 𝑨< = I 𝛽<,:ℎ:

v :}"

Mixture weights: 𝛽<,: = exp (𝑓<:) ∑ exp (𝑓<+)

v +}"

Alignment score (how well do input words near j match

  • utput words at position i):

𝑓<: = 𝑕(𝑡<=", ℎ:)

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

𝑡<=" 𝑡< 𝑨< alignment model: s a feedforward neural network which is jointly trained with all the other components of the proposed system

Attention for Translation

101

slide-102
SLIDE 102

102

i t t Plot of 𝜷𝒖,𝒋 Color shows value (white is larger) Note how most important input words for any output word get automatically highlighted The general trend is somewhat linear because word order is roughly similar in both languages i

“A “Alignments” example: Ba Bahdanau et et al.

slide-103
SLIDE 103

Attention Effect in machine translation

103

Bahdanau et al. "Neural Machine Translation by Jointly Learning to Align and Translate", 2014

slide-104
SLIDE 104

Tr Training the network

  • We have seen how a trained network can be used to compute
  • utputs

– Convert one sequence to another

  • Lets consider training..

104

slide-105
SLIDE 105

105

  • Given training input (source sequence, target sequence) pairs
  • Forward pass: Pass the actual Pass the input sequence through the encoder
  • At each time the output is a probability distribution over words

I ate an apple <eos> Ich habe einen apfel gegessen

𝑧1

:cd

𝑧1

—˜

𝑧1

dC<

… 𝑧"

:cd

𝑧"

—˜

𝑧"

dC<

… 𝑧-

:cd

𝑧-

—˜

𝑧-

dC<

… 𝑧.

:cd

𝑧.

—˜

𝑧.

dC<

… 𝑧/

:cd

𝑧/

—˜

𝑧/

dC<

… 𝑧0

:cd

𝑧0

—˜

𝑧0

dC<

𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R

slide-106
SLIDE 106

106

  • Backward pass: Compute a loss between target output and output distributions
  • Backpropagate derivatives through the network

I ate an apple <eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen<eos>

Loss Loss Loss Loss Loss Loss

𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1

𝑧1

:cd

𝑧1

—˜

𝑧1

dC<

… 𝑧"

:cd

𝑧"

—˜

𝑧"

dC<

… 𝑧-

:cd

𝑧-

—˜

𝑧-

dC<

… 𝑧.

:cd

𝑧.

—˜

𝑧.

dC<

… 𝑧/

:cd

𝑧/

—˜

𝑧/

dC<

… 𝑧0

:cd

𝑧0

—˜

𝑧0

dC<

Back propagation also updates parameters of the “attention” function 𝑕()

slide-107
SLIDE 107

Va Various extensions

  • Attention: Local attention vs global attention

– E.g. “Effective Approaches to Attention-based Neural Machine Translation”, Luong et al., 2015 – Other variants

  • Bidirectional processing of input sequence

– Bidirectional networks in encoder – E.g. “Neural Machine Translation by Jointly Learning to Align and Translate”, Bahdanau et al. 2016

107

slide-108
SLIDE 108

Attention for Translation

From Y. Bengio CVPR 2015 Tutorial

Bidirectional encoder RNN Decoder RNN Attention Model

108

slide-109
SLIDE 109

109

  • Teacher forcing: Occasionally pass the

system output as input during training

  • The “Gumbel noise” trick: Making

drawing from a distribution differentiable

Tr Tricks of the trade…

I ate an apple <eos> 𝒊" 𝒊- 𝒊. 𝒊/ 𝒊0 𝒊R 𝒕R 𝒕" 𝒕- 𝒕. 𝒕/ 𝒕0 𝒕1 Ich habe apfel gegessen

𝑧R

:cd

𝑧R

—˜

𝑧R

dC<

… 𝑧"

:cd

𝑧"

—˜

𝑧"

dC<

… 𝑧-

:cd

𝑧-

—˜

𝑧-

dC<

… 𝑧.

:cd

𝑧.

—˜

𝑧.

dC<

… 𝑧/

:cd

𝑧/

—˜

𝑧/

dC<

… 𝑧0

:cd

𝑧0

—˜

𝑧0

dC<

Ich habe einen apfel gegessen<eos>

Loss Loss Loss Loss Loss Loss

***

slide-110
SLIDE 110

So Some me imp mpressive results..

  • Attention-based models are currently responsible for the state of the

art in many sequence-conversion systems

– Machine translation

  • Input: Speech in source language
  • Output: Speech in target language

– Speech recognition

  • Input: Speech audio feature vector sequence
  • Output: Transcribed word or character sequence

110

slide-111
SLIDE 111

Other Applications

  • Ba et al 2014, Visual attention for recognition
  • Mnih et al 2014, Visual attention for recognition
  • Chorowski et al, 2014, Speech recognition
  • Graves et al 2014, Neural Turing machines
  • Yao et al 2015, Video description generation
  • Vinyals et al, 2015, Conversational Agents
  • Xu et al 2015, Image caption generation
  • Xu et al 2015, Visual Question Answering

111

slide-112
SLIDE 112
  • “Show, attend, and tell: Neural image caption generation with visual attention”, Xu et al., 2016
  • Encoder network is a convolutional neural network

– Filter outputs at each location are the equivalent of 𝒊𝑗 in the regular sequence-to-sequence model

112

At Attention models in image captioning

slide-113
SLIDE 113

113

Re Recap: Training of Image Captioning Networks

  • Training: Given several (Image, Caption) pairs
  • The image network is pretrained on a large corpus, e.g. image net
  • Forward pass: Produce output distributions given the image and caption
  • Backward pass: Compute the loss w.r.t. training caption, and backpropagate derivatives

A boy

  • n

a surfboard<eos> A boy

  • n

a surfboard 𝑧R

C

𝑧R

ej™

𝑧R

cC<

… 𝑧"

C

𝑧

" ej™

𝑧"

cC<

… 𝑧-

C

𝑧-

ej™

𝑧-

cC<

… 𝑧.

C

𝑧.

ej™

𝑧.

cC<

… 𝑧/

C

𝑧/

ej™

𝑧/

cC<

… 𝑧0

C

𝑧0

ej™

𝑧0

cC<

Loss Loss Loss Loss Loss Loss

<sos>

slide-114
SLIDE 114

Image Captioning with Attention

114

slide-115
SLIDE 115

Recall: RNN for Captioning

CNN

Image: H x W x 3 Features: D

h0

Hidden state: H

h1 y1 h2 y2

First word Second word

d1

Distribution

  • ver vocab

d2

RNN only looks at whole image,

  • nce

What if the RNN looks at different parts of the image at each timestep?

115

slide-116
SLIDE 116

Soft Attention for Captioning

CNN

Image: H x W x 3 Features: L x D

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

116

slide-117
SLIDE 117

Soft Attention for Captioning

CNN

Image: H x W x 3 Features: L x D

h0

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

117

slide-118
SLIDE 118

Soft Attention for Captioning

CNN

Image: H x W x 3 Features: L x D

h0 a1

Distribution over L locations

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

118

slide-119
SLIDE 119

Soft Attention for Captioning

CNN

Image: H x W x 3 Features: L x D

h0 a1

Weighted combination of features Distribution over L locations

z1

Weighted features: D

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

119

slide-120
SLIDE 120

Soft Attention for Captioning

CNN

Image: H x W x 3 Features: L x D

h0 a1 z1

Weighted combination of features

h1

Distribution over L locations Weighted features: D

y1

First word

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

120

slide-121
SLIDE 121

Soft Attention for Captioning

CNN

Image: H x W x 3 Features: L x D

h0 a1 z1

Weighted combination

  • f features

y1 h1

First word Distribution over L locations

a2 d1

Weighted features: D Distribution

  • ver vocab

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

121

slide-122
SLIDE 122

Soft Attention for Captioning

CNN

Image: H x W x 3 Features: L x D

h0 a1 z1

Weighted combination

  • f features

y1 h1

First word Distribution over L locations

a2 d1 z2

Weighted features: D Distribution

  • ver vocab

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

122

slide-123
SLIDE 123

Soft Attention for Captioning

CNN

Image: H x W x 3 Features: L x D

h0 a1 z1

Weighted combination

  • f features

y1 h1

First word Distribution over L locations

a2 d1 h2 z2 y2

Weighted features: D Distribution

  • ver vocab

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

123

slide-124
SLIDE 124

Soft Attention for Captioning

CNN

Image: H x W x 3 Features: L x D

h0 a1 z1

Weighted combination

  • f features

y1 h1

First word Distribution over L locations

a2 d1 h2 a3 d2 z2 y2

Weighted features: D Distribution

  • ver vocab

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

124

slide-125
SLIDE 125

Soft vs Hard Attention

CNN

Image: H x W x 3 Grid of features (Each D-dimensional)

a b c d pa pb pc pd

Distribution over grid locations pa + pb + pc + pc = 1 From RNN:

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

125

slide-126
SLIDE 126

Soft vs Hard Attention

CNN

Image: H x W x 3 Grid of features (Each D-dimensional)

a b c d pa pb pc pd

Distribution over grid locations pa + pb + pc + pc = 1 From RNN: Context vector z (D-dimensional)

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

126

slide-127
SLIDE 127

Soft vs Hard Attention

CNN

Image: H x W x 3 Grid of features (Each D-dimensional)

a b c d pa pb pc pd

Distribution over grid locations pa + pb + pc + pc = 1 From RNN: Context vector z (D-dimensional) Soft attention: Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

127

slide-128
SLIDE 128

Soft Attention for Captioning

Hard attention Soft attention Model want to attend to salient part of an image while generating its caption

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

128

slide-129
SLIDE 129

Soft vs Hard Attention Models

Hard attention:

  • Attend to a single input location among the set of locations.
  • Can’t use gradient descent.
  • Need reinforcement learning or …

Soft attention:

  • Compute a weighted combination (attention) over some inputs

using an attention network.

  • Can use backpropagation to train end-to-end.

129

slide-130
SLIDE 130

Soft vs Hard Attention

CNN

Image: H x W x 3 Grid of features (Each D-dimensional)

a b c d pa pb pc pd

Distribution over grid locations pa + pb + pc + pc = 1 From RNN: Context vector z (D-dimensional) Soft attention: Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent Hard attention: Sample ONE location according to p, z = that vector With argmax, dz/dp is zero almost everywhere … Can’t use gradient descent; need reinforcement learning

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

130

slide-131
SLIDE 131

Soft Attention for Captioning

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

131

slide-132
SLIDE 132

Visual Question Answering

132

slide-133
SLIDE 133

Visual Question Answering: RNNs with Attention

133

slide-134
SLIDE 134

In In clo losing ing

  • Have looked at various forms of sequence-to-sequence models
  • Generalizations of recurrent neural network formalisms
  • For more details, please refer to papers

134

slide-135
SLIDE 135

Resources

  • Bahdanau et al., Neural Machine Translation by Jointly Learning to

Align and Translate, ICLR 2015.

  • Zu et al., Show, Attend and Tell: Neural Image Caption Generation

with Visual Attention, CVPR 2016.

135