Show, Attend and Tell: Neural Image Caption Generation with Visual - PowerPoint PPT Presentation

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu*, Jimmy Ba † , Ryan Kiros † , Kyunghyun Cho*, Aaron Courville*, Ruslan Salakhutdinov † , Richard Zemel † , Yoshua Bengio* eal*/ University of Toronto † Universit´ e de Montr´ (some figures from Hugo Larochelle) July 8, 2015 1 / 46

Caption generation is another building block social roles goals and intentions High situation Level causality 1 sec functionality Scene Understanding Level of Human activity understanding Task Time actions Detection 150 ms segmentation shape tracking Low texture 90 ms Leve l feature and descriptors Object Scene Activity Figure: adapted from a figure from Feifei Li 2 / 46

What our model does: Figure: A bird flying over a body of water . 3 / 46

Overview Recent work in image caption generation Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results 4 / 46

This talk: Recent work in image caption generation Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results 5 / 46

Recent surge of interest in image captioning ◮ Submissions on this topic at CVPR 2015 (from groups at Google, Berkeley, Stanford, Microsoft.. etc) ◮ Inspired by some successes in machine translation (Kalchbrenner et al. 2013, Sutskever et al. 2014, Cho et al. 2014) 6 / 46

Theme: Use a convnet to condition Figure: from Karpathy et al. (2015) 7 / 46

Figure: Vinyal et al. (2015) model is quite similar 8 / 46

What are some things we know about human attention? 10 / 46

(1) human vision is foveated & sequential ◮ Particular parts of an image come to the forefront 1 2 1 2 3 3 ◮ It is a sequential decision process (“saccades”, glimpses) 11 / 46

(2) bottom-up input influences Figure: from Borji and Itti. (2013) [2] 12 / 46

mechanisms ¡at ¡work… (3) top-down task level control Figure: from Yarbus (1967) 13 / 46

Summary: useful aspects of attention ◮ foveated visual field (spatial focus) ◮ sequential decision making (temporal dynamics) ◮ bottom-up input influence ◮ top-down modulation of specific task 14 / 46

Our proposed attention model ◮ ”Low Level” convolutional feature extraction: a = { a 1 , a 2 , .., a L } ◮ Compute the importance of each of these regions α = { α 1 , α 2 , .., α L } ◮ Combine α and a to represent the image (context: ˆ z i ) 16 / 46

A little bit more specific output = ( a, man, is, jumping, into, a, lake, . ) 17 / 46

Convolutional feature extraction output = ( a, man, is, jumping, into, a, lake, . ) Convolutional Neural Network Annotation Vectors a j 18 / 46

Given a initial hidden state (predicted from image).. output = ( a, man, is, jumping, into, a, lake, . ) Recurrent h i State Convolutional Neural Network Annotation Vectors a j 19 / 46

Predict the “importance” of each region output = ( a, man, is, jumping, into, a, lake, . ) Recurrent h i State Mechanism Attention α j α j Σ =1 Convolutional Neural Network Annotation Vectors a j 20 / 46

Combine with annotation vectors.. output = ( a, man, is, jumping, into, a, lake, . ) Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 21 / 46

Feed into next hidden state and predict the next word output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 22 / 46

In the next step, we use the new hidden state output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 23 / 46

Continue until end of sequence output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 24 / 46

The attention is driven by the recurrent state + image ◮ At every time step, compute the importance of each region depending on the top-down + bottom-up signals e ti = f att( a i , h t − 1 ) exp( e ti ) α ti = � L k =1 exp( e tk ) ◮ We use a softmax to constrain that these weights sum to 1 ◮ We explore two different ways use the above distribution to compute a meaningful image representation 25 / 46

Stochastic or Deterministic? output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Stochastic Attention Attention α or weight j + Deterministic α j Σ =1 Convolutional Neural Network Annotation Vectors a j 26 / 46

Quick note on our decoder: LSTM (Hochreiter et al. 1997) z t z t h t-1 h t-1 Ey t-1 Ey t-1 h t-1 i o input gate output gate c h t z t input modulator memory cell Ey t-1 f forget gate h t-1 Ey t-1 z t 27 / 46

Deterministic (Soft) Attention ◮ Feed in a attention weighted image input: L � ˆ z t = α t , i a i i =1 ◮ This is what A. Graves (2013)/D. Bahdanau et al (2015) did in handwriting recognition/machine translation 28 / 46

Alternatively: Stochastic (Hard) Attention ◮ Sample α stochastically at every time step ◮ In RL terms, think of softmax α as a Boltzmann Policy: � L s = p ( s | a ) log p ( y | s , a ) ≤ log p ( y | a ) s N s n | a ) s n , a ) ∂ W ≈ 1 ∂ L s � ∂ log p ( y | ˜ s n , a ) ∂ log p (˜ � � + log p ( y | ˜ N ∂ W ∂ W n =1 By Williams 1992, and re-popularized recently by Mnih et al. 2014, Ba et al. 2015 29 / 46

Quantitative Results 30 / 46

A footnote on these metrics 31 / 46

Under automatic metrics, humans are not great :( 32 / 46

But human evaluation (mechanical turks) is quite different 33 / 46

Stochastic or Deterministic? output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Stochastic Attention Attention α or weight j + Deterministic α j Σ =1 Convolutional Neural Network Annotation Vectors a j 34 / 46

Visualizing our learned attention: the good 35 / 46

Visualizing the our learned attention: the bad 36 / 46

Other fun things you can do: 37 / 46

A soccer ball .. 38 / 46

Two cakes on a plate.. 39 / 46

Important previous work 40 / 46

attention in machine translation � � �� (4) � �� ap- � �� distinct � � � � � � � � � � � � � � � � � � � � com- � � � � Figure: also from UdeM lab (Bahdanau et al. 2014) [1] 41 / 46

attention mechanism in handwritten character generation �� Figure: from (Graves et al. 2013) [3] 42 / 46

Recently, many more.. 43 / 46

Thanks for attending! 44 / 46

Thanks for attending! Code: https://github.com/kelvinxu/arctic-captions 45 / 46

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 , 2014. Ali Borji and Laurent Itti. State-of-the-art in visual attention modeling. Pattern Analysis and Machine Intelligence, IEEE Transactions on , 35(1):185–207, 2013. Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 , 2013. 46 / 46

Show, Attend and Tell: Neural Image Caption Generation with Visual - PowerPoint PPT Presentation

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Ba , Ryan Kiros , Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov , Richard Zemel , Yoshua Bengio eal*/ University of Toronto

Show n Tell (Berkeley Style) Bruce A. Mah bmah@tenet.Berkeley.EDU 15 June 1992 Show

Show, Attend, and Tell Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Lei

To Attend or not to Attend: A Case Study on Syntactic Structures for Semantic Relatedness

Todays Agenda Welcome New Hires Updates and Notable News Show & Tell: Giving

Trident Gum Packaging Concept THE FAMILY// SHOW, TEACH, TELL /// Lulu Tanenbaum DESIGNERS

SKiLLDiSPLAY Show what you can do! Slide 1 21 % The skills you should check out for TCCE 9 LTS Show

Product and services exhibitors Click here to view the 2019 Auto Show recap video Presentation

The Generator: electric zone Click here to view the 2019 Auto Show recap video Presentation In

D5000 PETS Budget Training 2013-2014 Don't tell me what you value, show me your budget, and

Paris, on taime : a show that marries dance and comedy, a show full of amazing costumes and

2015/02/15 CACTUS PEAR BY CACTUS PEAR BY CACTUS PEAR BY CACTUS PEAR BY- - - -PRODUCTS: SHOW

FIMAI Trade Show FIMAI is an International Industrial Environment and Sustainability trade show,

ABOUT THE SHOW CONEXPO- CON/AGG is North Americas largest construction trade show representing

Junior Information Night Class of 2020 Where Can I Find Info on Colleges? Attend College Fairs

Overview Announcements: Homework 1 due today Homework 2 will be posted soon CMPSCI

Light Field Vision for Transparent Object Categorization and Segmentation

QUESTION: How could our conscious experiences be made out of physical stuff? Consciousness poses

Interactive Visual Summary of Major Communities in a Large Network Yanhong

CS160: INFORMATION VISUALIZATION Prof. Marti Hearst August 4, 2015 INFORMATION VISUALIZATION

CS325 Artificial Intelligence Chs. 9, 12 Knowledge Representation and Inference Cengiz

WebEx Tech Support: 1-866-229-3239 NAM Leadership Consortium Vital Signs Initiative J. Michael

UICC Cares ! The IPOS Quality Standard Distress, the 6th Vital Sign The Strategic Objective

Show, Attend and Tell: Neural Image Caption Generation with Visual - PowerPoint PPT Presentation

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu*, Jimmy Ba , Ryan Kiros , Kyunghyun Cho*, Aaron Courville*, Ruslan Salakhutdinov , Richard Zemel , Yoshua Bengio* eal*/ University of Toronto

Show n Tell (Berkeley Style) Bruce A. Mah bmah@tenet.Berkeley.EDU 15 June 1992 Show

Show, Attend, and Tell Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Lei

To Attend or not to Attend: A Case Study on Syntactic Structures for Semantic Relatedness

Todays Agenda Welcome New Hires Updates and Notable News Show &amp; Tell: Giving

Trident Gum Packaging Concept THE FAMILY// SHOW, TEACH, TELL /// Lulu Tanenbaum DESIGNERS

SKiLLDiSPLAY Show what you can do! Slide 1 21 % The skills you should check out for TCCE 9 LTS Show

Product and services exhibitors Click here to view the 2019 Auto Show recap video Presentation

The Generator: electric zone Click here to view the 2019 Auto Show recap video Presentation In

D5000 PETS Budget Training 2013-2014 Don't tell me what you value, show me your budget, and

Paris, on taime : a show that marries dance and comedy, a show full of amazing costumes and

2015/02/15 CACTUS PEAR BY CACTUS PEAR BY CACTUS PEAR BY CACTUS PEAR BY- - - -PRODUCTS: SHOW

FIMAI Trade Show FIMAI is an International Industrial Environment and Sustainability trade show,

ABOUT THE SHOW CONEXPO- CON/AGG is North Americas largest construction trade show representing

Junior Information Night Class of 2020 Where Can I Find Info on Colleges? Attend College Fairs

Overview Announcements: Homework 1 due today Homework 2 will be posted soon CMPSCI

Light Field Vision for Transparent Object Categorization and Segmentation

QUESTION: How could our conscious experiences be made out of physical stuff? Consciousness poses

Interactive Visual Summary of Major Communities in a Large Network Yanhong

CS160: INFORMATION VISUALIZATION Prof. Marti Hearst August 4, 2015 INFORMATION VISUALIZATION

CS325 Artificial Intelligence Chs. 9, 12 Knowledge Representation and Inference Cengiz

WebEx Tech Support: 1-866-229-3239 NAM Leadership Consortium Vital Signs Initiative J. Michael

UICC Cares ! The IPOS Quality Standard Distress, the 6th Vital Sign The Strategic Objective

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Ba , Ryan Kiros , Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov , Richard Zemel , Yoshua Bengio eal*/ University of Toronto

Todays Agenda Welcome New Hires Updates and Notable News Show & Tell: Giving