Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu*, Jimmy Ba † , Ryan Kiros † , Kyunghyun Cho*, Aaron Courville*, Ruslan Salakhutdinov † , Richard Zemel † , Yoshua Bengio* eal*/ University of Toronto † Universit´ e de Montr´ (some figures from Hugo Larochelle) July 8, 2015 1 / 46
Caption generation is another building block social roles goals and intentions High situation Level causality 1 sec functionality Scene Understanding Level of Human activity understanding Task Time actions Detection 150 ms segmentation shape tracking Low texture 90 ms Leve l feature and descriptors Object Scene Activity Figure: adapted from a figure from Feifei Li 2 / 46
What our model does: Figure: A bird flying over a body of water . 3 / 46
Overview Recent work in image caption generation Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results 4 / 46
This talk: Recent work in image caption generation Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results 5 / 46
Recent surge of interest in image captioning ◮ Submissions on this topic at CVPR 2015 (from groups at Google, Berkeley, Stanford, Microsoft.. etc) ◮ Inspired by some successes in machine translation (Kalchbrenner et al. 2013, Sutskever et al. 2014, Cho et al. 2014) 6 / 46
Theme: Use a convnet to condition Figure: from Karpathy et al. (2015) 7 / 46
Figure: Vinyal et al. (2015) model is quite similar 8 / 46
This talk: Recent work in image caption generation Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results 9 / 46
What are some things we know about human attention? 10 / 46
(1) human vision is foveated & sequential ◮ Particular parts of an image come to the forefront 1 2 1 2 3 3 ◮ It is a sequential decision process (“saccades”, glimpses) 11 / 46
(2) bottom-up input influences Figure: from Borji and Itti. (2013) [2] 12 / 46
mechanisms ¡at ¡work… (3) top-down task level control Figure: from Yarbus (1967) 13 / 46
Summary: useful aspects of attention ◮ foveated visual field (spatial focus) ◮ sequential decision making (temporal dynamics) ◮ bottom-up input influence ◮ top-down modulation of specific task 14 / 46
This talk: Recent work in image caption generation Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results 15 / 46
Our proposed attention model ◮ ”Low Level” convolutional feature extraction: a = { a 1 , a 2 , .., a L } ◮ Compute the importance of each of these regions α = { α 1 , α 2 , .., α L } ◮ Combine α and a to represent the image (context: ˆ z i ) 16 / 46
A little bit more specific output = ( a, man, is, jumping, into, a, lake, . ) 17 / 46
Convolutional feature extraction output = ( a, man, is, jumping, into, a, lake, . ) Convolutional Neural Network Annotation Vectors a j 18 / 46
Given a initial hidden state (predicted from image).. output = ( a, man, is, jumping, into, a, lake, . ) Recurrent h i State Convolutional Neural Network Annotation Vectors a j 19 / 46
Predict the “importance” of each region output = ( a, man, is, jumping, into, a, lake, . ) Recurrent h i State Mechanism Attention α j α j Σ =1 Convolutional Neural Network Annotation Vectors a j 20 / 46
Combine with annotation vectors.. output = ( a, man, is, jumping, into, a, lake, . ) Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 21 / 46
Feed into next hidden state and predict the next word output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 22 / 46
In the next step, we use the new hidden state output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 23 / 46
Continue until end of sequence output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 24 / 46
The attention is driven by the recurrent state + image ◮ At every time step, compute the importance of each region depending on the top-down + bottom-up signals e ti = f att( a i , h t − 1 ) exp( e ti ) α ti = � L k =1 exp( e tk ) ◮ We use a softmax to constrain that these weights sum to 1 ◮ We explore two different ways use the above distribution to compute a meaningful image representation 25 / 46
Stochastic or Deterministic? output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Stochastic Attention Attention α or weight j + Deterministic α j Σ =1 Convolutional Neural Network Annotation Vectors a j 26 / 46
Quick note on our decoder: LSTM (Hochreiter et al. 1997) z t z t h t-1 h t-1 Ey t-1 Ey t-1 h t-1 i o input gate output gate c h t z t input modulator memory cell Ey t-1 f forget gate h t-1 Ey t-1 z t 27 / 46
Deterministic (Soft) Attention ◮ Feed in a attention weighted image input: L � ˆ z t = α t , i a i i =1 ◮ This is what A. Graves (2013)/D. Bahdanau et al (2015) did in handwriting recognition/machine translation 28 / 46
Alternatively: Stochastic (Hard) Attention ◮ Sample α stochastically at every time step ◮ In RL terms, think of softmax α as a Boltzmann Policy: � L s = p ( s | a ) log p ( y | s , a ) ≤ log p ( y | a ) s N s n | a ) s n , a ) ∂ W ≈ 1 ∂ L s � ∂ log p ( y | ˜ s n , a ) ∂ log p (˜ � � + log p ( y | ˜ N ∂ W ∂ W n =1 By Williams 1992, and re-popularized recently by Mnih et al. 2014, Ba et al. 2015 29 / 46
Quantitative Results 30 / 46
A footnote on these metrics 31 / 46
Under automatic metrics, humans are not great :( 32 / 46
But human evaluation (mechanical turks) is quite different 33 / 46
Stochastic or Deterministic? output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Stochastic Attention Attention α or weight j + Deterministic α j Σ =1 Convolutional Neural Network Annotation Vectors a j 34 / 46
Visualizing our learned attention: the good 35 / 46
Visualizing the our learned attention: the bad 36 / 46
Other fun things you can do: 37 / 46
A soccer ball .. 38 / 46
Two cakes on a plate.. 39 / 46
Important previous work 40 / 46
attention in machine translation � � ��� � (4) � ��� � � � � ��� � ��� ap- � ��� � ��� distinct � � � � � � � � � � � � � � � � � � � � com- � � � � Figure: also from UdeM lab (Bahdanau et al. 2014) [1] 41 / 46
attention mechanism in handwritten character generation ������� �������� ������ �������� ������ ���������� Figure: from (Graves et al. 2013) [3] 42 / 46
Recently, many more.. 43 / 46
Thanks for attending! 44 / 46
Thanks for attending! Code: https://github.com/kelvinxu/arctic-captions 45 / 46
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 , 2014. Ali Borji and Laurent Itti. State-of-the-art in visual attention modeling. Pattern Analysis and Machine Intelligence, IEEE Transactions on , 35(1):185–207, 2013. Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 , 2013. 46 / 46
Recommend
More recommend