Show, Attend and Tell: Neural Image Caption Generation with Visual - PowerPoint PPT Presentation

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. ICML 2015 Presented By: Sai Krishna Bollam

Outline • Introduction • Model Overview • Model Details • Encoder • Decoder • Attention • Experiments • Results • Conclusion 2

Introduction • Multimodal Machine Learning • Relate information from multiple modalities: speech, image, language etc. • Scene understanding • Automatic caption generation • Task: Given an image, generate a sentence describing it • Object Detection and Machine Translation • Image to Language translation A woman throwing a frisbee in a park. A bird flying over a body of water. 3

Model Overview Encoder-Decoder framework Learn alignments from scratch Analogous to translation but.. Using Attention over low level feature maps Encoder output is not a single vector Instead of joint object-text embedding Bahdanau et al. (2014) 4

Model Details: Encoder • Model: • Input: Raw image • Output: Sequence of C words from vocabulary of size K 𝒛 = 𝑧 1 , … , 𝑧 𝐷 , 𝑧 𝑗 ∈ ℝ 𝐿 • Encoder: Convolutional Neural Network • Input: Raw image • Output: multiple feature vectors (annotation vectors) from lower conv layers 𝐛 = a 1 , … , a 𝑀 , a 𝑗 ∈ ℝ 𝐸 Conv Image FC NN L x D 5

Model Details: Decoder • LSTM Network z 𝑢 : Context vector ෝ 6

Model Details: Decoder – Context Vector Context Vector ( ො z 𝑢 ): A dynamic representation of relevant part of image at time 𝑢 z 𝑢 = 𝜚( a 𝑗 , {α 𝑗 }) ො Attention. Calculated Annotation vectors using 𝑔 𝑏𝑢𝑢 𝑔 𝑏𝑢𝑢 : Attention Model, an MLP conditioned on previous hidden state 7

Model Representation Distribution over L Distribution locations over vocab a1 a2 d1 a3 d2 CNN h0 h1 h2 Features: Image: L x D Weighted H x W x 3 z1 y1 z2 y2 features: D Weighted combination First of features word 8 Based on CS231n by Fei-Fei Li, Justin Johnson & Serena Yeung

Attention Mechanism: Stochastic Attention Stochastic “Hard” Attention At every time step, focus on exactly 1 location (a i ) 𝑡 𝑢,𝑗 = 1 iff 𝑗 𝑢ℎ location is used to extract visual features 𝑞 𝑡 𝑢,𝑗 = 1 𝑡 𝑘<𝑢 , 𝐛 = α 𝑢,𝑗 = softmax (𝑔 𝑏𝑢𝑢 a 𝑗 , 𝐢 𝑢−1 ) 𝐴 𝑢 = ෍ ො 𝑡 𝑢,𝑗 a 𝑗 Sample a 𝑗 based on Multinoulli distribution 𝑗 𝑀 𝑡 = ෍ 𝑞 𝑡 𝐛 log 𝑞 𝐳 𝑡, 𝐛 ≤ log[𝑞 𝑡 𝐛 𝑞 𝐳 𝑡, 𝐛 ] 𝑡 = log 𝑞(𝐳 ∣ 𝐛) 9

Attention Mechanism: Stochastic Attention Monte Carlo based sampling approximation Wikipedia REINFORCE learning rule 10

Attention Mechanisms: Deterministic Attention Deterministic “Soft” Attention Expectation of context vector, instead of sampling. Differentiable! 𝑀 𝔽 𝑞 𝑡 𝑢 a ො 𝐴 𝑢 = ෍ α 𝑢,𝑗 a 𝑗 𝑗=1 𝑀 Soft attention weighted vector 𝜚 a 𝑗 , α 𝑗 = ෍ α 𝑗 a 𝑗 𝑗 Normalized Weighted Geometric Mean 11

Attention Mechanisms: Deterministic Attention Doubly Stochastic Attention Encourages model to pay σ 𝑢 α 𝑢,𝑗 ≈ 1 Introduce regularization: equal attention to every part of image over time 𝛾 𝑢 = 𝜏(𝑔 𝛾 𝐢 𝑢 − 1 ) 𝑀 𝜚 𝑏 𝑗 , 𝛽 𝑗 = 𝛾 ෍ 𝛽 𝑗 𝑏 𝑗 𝑗 12

Attention Mechanisms Soft attention: Summarize ALL locations a b z = p a a+ p b b+ p c c+ p d d CNN c d Derivative dz/dp is nice! Grid of features Train with gradient Image: (Each D- descent Context vector z H x W x 3 dimensional) (D-dimensional) Hard attention : p a p b Sample ONE location From according to p, z = that RNN: p c p d vector Distribution over With argmax, dz/dp is zero grid locations almost everywhere … p a + p b + p c + p c = 1 Can’t use gradient descent; need reinforcement learning 13 Based on CS231n by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Visualizing Attention Soft attention Hard attention 14

Experiments Encoder: Training: Oxford VGGnet Flickr8k: RMSProp Flickr30k/MS COCO: Adam - pretrained on ImageNet - Feature maps from 4 th conv layer Dropout before pooling. 14x14x512 flattened Early stopping on BLEU to 196 x 512 ( L x D ) Batching by sentence lengths Metrics : Datasets: BLEU-1, 2, 3, 4 Flickr8k : 8,000 Flickr30k : 30,000 - No brevity penalty MS COCO : 82,783 METEOR Vocabulary : 10,000 words 5 reference sentences per image 15

Results 16

Key Points • Learn latent alignments from scratch • Better context to decoder • Attends to non object regions • Joint representation • Visualizing attention to interpret functioning • Stochastic Attention • Deterministic Attention 17

Thank you Questions? 18

Show, Attend and Tell: Neural Image Caption Generation with Visual - PowerPoint PPT Presentation

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. ICML 2015 Presented By: Sai Krishna Bollam

Show, Attend, and Tell Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu*, Jimmy Ba

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy

June 12, 2020 Type to enter a caption. Greeter Graham Drake Type to enter a caption. Give

Image Caption Image Caption Image Caption Lorem ipsum dolor sit amet, consectetur adipiscing

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Show n Tell (Berkeley Style) Bruce A. Mah bmah@tenet.Berkeley.EDU 15 June 1992 Show

RNNs for Image Caption Generation James Guevara Recurrent Neural Networks Contain at least

April 3, 2020 Type to enter a caption. Estate Planning | 9 Estate Planning | 10 Jamie

CSC2539 - Datasets and Metrics for Image Caption Generation Kaustav Kundu University of Toronto

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

To Attend or not to Attend: A Case Study on Syntactic Structures for Semantic Relatedness

SPICE: Semantic Propositional Image Caption Evaluation Presented to the COCO Consortium, Sept

Birmingham 10 April 2019 1 2 Our UK Business Birmingham Image caption 3 UK Store

Defa u lt arg u ments IN TR OD U C TION TO W R ITIN G FU N C TION S IN R Richie Co on C u

Karcher means of positive definite matrices Yongdo Lim Sungkyunkwan University January 14, 2014

Introduction to CMS Data 02/28/2013 Presented by Erin Mann, ResDAC Technical Advisor About

Massachusetts Community Health & Healthy Aging Funds Community Health Improvement Planning

Centroids Beyond First Meaning of This . . . Proof Defuzzification A Version of the First . . .

Workshop on 1 st JILP Data Prefetching Championship http://www.jilp.org/dpc/ Alaa Alameldeen

ACIP Meeting, 24 June 2020 1 Public health burden of invasive meningococcal disease

Team Exercises Michael R. Gunson Jet Propulsion Laboratory Michael.Gunson@jpl.nasa.gov 818 354

Show, Attend and Tell: Neural Image Caption Generation with Visual - PowerPoint PPT Presentation

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. ICML 2015 Presented By: Sai Krishna Bollam

Show, Attend, and Tell Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu*, Jimmy Ba

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy

June 12, 2020 Type to enter a caption. Greeter Graham Drake Type to enter a caption. Give

Image Caption Image Caption Image Caption Lorem ipsum dolor sit amet, consectetur adipiscing

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Show n Tell (Berkeley Style) Bruce A. Mah bmah@tenet.Berkeley.EDU 15 June 1992 Show

RNNs for Image Caption Generation James Guevara Recurrent Neural Networks Contain at least

April 3, 2020 Type to enter a caption. Estate Planning | 9 Estate Planning | 10 Jamie

CSC2539 - Datasets and Metrics for Image Caption Generation Kaustav Kundu University of Toronto

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

To Attend or not to Attend: A Case Study on Syntactic Structures for Semantic Relatedness

SPICE: Semantic Propositional Image Caption Evaluation Presented to the COCO Consortium, Sept

Birmingham 10 April 2019 1 2 Our UK Business Birmingham Image caption 3 UK Store

Defa u lt arg u ments IN TR OD U C TION TO W R ITIN G FU N C TION S IN R Richie Co on C u

Karcher means of positive definite matrices Yongdo Lim Sungkyunkwan University January 14, 2014

Introduction to CMS Data 02/28/2013 Presented by Erin Mann, ResDAC Technical Advisor About

Massachusetts Community Health &amp; Healthy Aging Funds Community Health Improvement Planning

Centroids Beyond First Meaning of This . . . Proof Defuzzification A Version of the First . . .

Workshop on 1 st JILP Data Prefetching Championship http://www.jilp.org/dpc/ Alaa Alameldeen

ACIP Meeting, 24 June 2020 1 Public health burden of invasive meningococcal disease

Team Exercises Michael R. Gunson Jet Propulsion Laboratory Michael.Gunson@jpl.nasa.gov 818 354

Massachusetts Community Health & Healthy Aging Funds Community Health Improvement Planning