show attend and tell
play

Show, Attend and Tell: Neural Image Caption Generation with Visual - PowerPoint PPT Presentation

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu*, Jimmy Ba , Ryan Kiros , Kyunghyun Cho*, Aaron Courville*, Ruslan Salakhutdinov , Richard Zemel , Yoshua Bengio* eal*/ University of Toronto


  1. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu*, Jimmy Ba † , Ryan Kiros † , Kyunghyun Cho*, Aaron Courville*, Ruslan Salakhutdinov † , Richard Zemel † , Yoshua Bengio* eal*/ University of Toronto † Universit´ e de Montr´ (some figures from Hugo Larochelle) July 8, 2015 1 / 46

  2. Caption generation is another building block social roles goals and intentions High situation Level causality 1 sec functionality Scene Understanding Level of Human activity understanding Task Time actions Detection 150 ms segmentation shape tracking Low texture 90 ms Leve l feature and descriptors Object Scene Activity Figure: adapted from a figure from Feifei Li 2 / 46

  3. What our model does: Figure: A bird flying over a body of water . 3 / 46

  4. Overview Recent work in image caption generation Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results 4 / 46

  5. This talk: Recent work in image caption generation Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results 5 / 46

  6. Recent surge of interest in image captioning ◮ Submissions on this topic at CVPR 2015 (from groups at Google, Berkeley, Stanford, Microsoft.. etc) ◮ Inspired by some successes in machine translation (Kalchbrenner et al. 2013, Sutskever et al. 2014, Cho et al. 2014) 6 / 46

  7. Theme: Use a convnet to condition Figure: from Karpathy et al. (2015) 7 / 46

  8. Figure: Vinyal et al. (2015) model is quite similar 8 / 46

  9. This talk: Recent work in image caption generation Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results 9 / 46

  10. What are some things we know about human attention? 10 / 46

  11. (1) human vision is foveated & sequential ◮ Particular parts of an image come to the forefront 1 2 1 2 3 3 ◮ It is a sequential decision process (“saccades”, glimpses) 11 / 46

  12. (2) bottom-up input influences Figure: from Borji and Itti. (2013) [2] 12 / 46

  13. mechanisms ¡at ¡work… (3) top-down task level control Figure: from Yarbus (1967) 13 / 46

  14. Summary: useful aspects of attention ◮ foveated visual field (spatial focus) ◮ sequential decision making (temporal dynamics) ◮ bottom-up input influence ◮ top-down modulation of specific task 14 / 46

  15. This talk: Recent work in image caption generation Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results 15 / 46

  16. Our proposed attention model ◮ ”Low Level” convolutional feature extraction: a = { a 1 , a 2 , .., a L } ◮ Compute the importance of each of these regions α = { α 1 , α 2 , .., α L } ◮ Combine α and a to represent the image (context: ˆ z i ) 16 / 46

  17. A little bit more specific output = ( a, man, is, jumping, into, a, lake, . ) 17 / 46

  18. Convolutional feature extraction output = ( a, man, is, jumping, into, a, lake, . ) Convolutional Neural Network Annotation Vectors a j 18 / 46

  19. Given a initial hidden state (predicted from image).. output = ( a, man, is, jumping, into, a, lake, . ) Recurrent h i State Convolutional Neural Network Annotation Vectors a j 19 / 46

  20. Predict the “importance” of each region output = ( a, man, is, jumping, into, a, lake, . ) Recurrent h i State Mechanism Attention α j α j Σ =1 Convolutional Neural Network Annotation Vectors a j 20 / 46

  21. Combine with annotation vectors.. output = ( a, man, is, jumping, into, a, lake, . ) Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 21 / 46

  22. Feed into next hidden state and predict the next word output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 22 / 46

  23. In the next step, we use the new hidden state output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 23 / 46

  24. Continue until end of sequence output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 24 / 46

  25. The attention is driven by the recurrent state + image ◮ At every time step, compute the importance of each region depending on the top-down + bottom-up signals e ti = f att( a i , h t − 1 ) exp( e ti ) α ti = � L k =1 exp( e tk ) ◮ We use a softmax to constrain that these weights sum to 1 ◮ We explore two different ways use the above distribution to compute a meaningful image representation 25 / 46

  26. Stochastic or Deterministic? output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Stochastic Attention Attention α or weight j + Deterministic α j Σ =1 Convolutional Neural Network Annotation Vectors a j 26 / 46

  27. Quick note on our decoder: LSTM (Hochreiter et al. 1997) z t z t h t-1 h t-1 Ey t-1 Ey t-1 h t-1 i o input gate output gate c h t z t input modulator memory cell Ey t-1 f forget gate h t-1 Ey t-1 z t 27 / 46

  28. Deterministic (Soft) Attention ◮ Feed in a attention weighted image input: L � ˆ z t = α t , i a i i =1 ◮ This is what A. Graves (2013)/D. Bahdanau et al (2015) did in handwriting recognition/machine translation 28 / 46

  29. Alternatively: Stochastic (Hard) Attention ◮ Sample α stochastically at every time step ◮ In RL terms, think of softmax α as a Boltzmann Policy: � L s = p ( s | a ) log p ( y | s , a ) ≤ log p ( y | a ) s N s n | a ) s n , a ) ∂ W ≈ 1 ∂ L s � ∂ log p ( y | ˜ s n , a ) ∂ log p (˜ � � + log p ( y | ˜ N ∂ W ∂ W n =1 By Williams 1992, and re-popularized recently by Mnih et al. 2014, Ba et al. 2015 29 / 46

  30. Quantitative Results 30 / 46

  31. A footnote on these metrics 31 / 46

  32. Under automatic metrics, humans are not great :( 32 / 46

  33. But human evaluation (mechanical turks) is quite different 33 / 46

  34. Stochastic or Deterministic? output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Stochastic Attention Attention α or weight j + Deterministic α j Σ =1 Convolutional Neural Network Annotation Vectors a j 34 / 46

  35. Visualizing our learned attention: the good 35 / 46

  36. Visualizing the our learned attention: the bad 36 / 46

  37. Other fun things you can do: 37 / 46

  38. A soccer ball .. 38 / 46

  39. Two cakes on a plate.. 39 / 46

  40. Important previous work 40 / 46

  41. attention in machine translation � � ��� � (4) � ��� � � � � ��� � ��� ap- � ��� � ��� distinct � � � � � � � � � � � � � � � � � � � � com- � � � � Figure: also from UdeM lab (Bahdanau et al. 2014) [1] 41 / 46

  42. attention mechanism in handwritten character generation ������� �������� ������ �������� ������ ���������� Figure: from (Graves et al. 2013) [3] 42 / 46

  43. Recently, many more.. 43 / 46

  44. Thanks for attending! 44 / 46

  45. Thanks for attending! Code: https://github.com/kelvinxu/arctic-captions 45 / 46

  46. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 , 2014. Ali Borji and Laurent Itti. State-of-the-art in visual attention modeling. Pattern Analysis and Machine Intelligence, IEEE Transactions on , 35(1):185–207, 2013. Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 , 2013. 46 / 46

Recommend


More recommend