Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. ICML 2015 Presented By: Sai Krishna Bollam
Outline • Introduction • Model Overview • Model Details • Encoder • Decoder • Attention • Experiments • Results • Conclusion 2
Introduction • Multimodal Machine Learning • Relate information from multiple modalities: speech, image, language etc. • Scene understanding • Automatic caption generation • Task: Given an image, generate a sentence describing it • Object Detection and Machine Translation • Image to Language translation A woman throwing a frisbee in a park. A bird flying over a body of water. 3
Model Overview Encoder-Decoder framework Learn alignments from scratch Analogous to translation but.. Using Attention over low level feature maps Encoder output is not a single vector Instead of joint object-text embedding Bahdanau et al. (2014) 4
Model Details: Encoder • Model: • Input: Raw image • Output: Sequence of C words from vocabulary of size K 𝒛 = 𝑧 1 , … , 𝑧 𝐷 , 𝑧 𝑗 ∈ ℝ 𝐿 • Encoder: Convolutional Neural Network • Input: Raw image • Output: multiple feature vectors (annotation vectors) from lower conv layers 𝐛 = a 1 , … , a 𝑀 , a 𝑗 ∈ ℝ 𝐸 Conv Image FC NN L x D 5
Model Details: Decoder • LSTM Network z 𝑢 : Context vector ෝ 6
Model Details: Decoder – Context Vector Context Vector ( ො z 𝑢 ): A dynamic representation of relevant part of image at time 𝑢 z 𝑢 = 𝜚( a 𝑗 , {α 𝑗 }) ො Attention. Calculated Annotation vectors using 𝑔 𝑏𝑢𝑢 𝑔 𝑏𝑢𝑢 : Attention Model, an MLP conditioned on previous hidden state 7
Model Representation Distribution over L Distribution locations over vocab a1 a2 d1 a3 d2 CNN h0 h1 h2 Features: Image: L x D Weighted H x W x 3 z1 y1 z2 y2 features: D Weighted combination First of features word 8 Based on CS231n by Fei-Fei Li, Justin Johnson & Serena Yeung
Attention Mechanism: Stochastic Attention Stochastic “Hard” Attention At every time step, focus on exactly 1 location (a i ) 𝑡 𝑢,𝑗 = 1 iff 𝑗 𝑢ℎ location is used to extract visual features 𝑞 𝑡 𝑢,𝑗 = 1 𝑡 𝑘<𝑢 , 𝐛 = α 𝑢,𝑗 = softmax (𝑔 𝑏𝑢𝑢 a 𝑗 , 𝐢 𝑢−1 ) 𝐴 𝑢 = ො 𝑡 𝑢,𝑗 a 𝑗 Sample a 𝑗 based on Multinoulli distribution 𝑗 𝑀 𝑡 = 𝑞 𝑡 𝐛 log 𝑞 𝐳 𝑡, 𝐛 ≤ log[𝑞 𝑡 𝐛 𝑞 𝐳 𝑡, 𝐛 ] 𝑡 = log 𝑞(𝐳 ∣ 𝐛) 9
Attention Mechanism: Stochastic Attention Monte Carlo based sampling approximation Wikipedia REINFORCE learning rule 10
Attention Mechanisms: Deterministic Attention Deterministic “Soft” Attention Expectation of context vector, instead of sampling. Differentiable! 𝑀 𝔽 𝑞 𝑡 𝑢 a ො 𝐴 𝑢 = α 𝑢,𝑗 a 𝑗 𝑗=1 𝑀 Soft attention weighted vector 𝜚 a 𝑗 , α 𝑗 = α 𝑗 a 𝑗 𝑗 Normalized Weighted Geometric Mean 11
Attention Mechanisms: Deterministic Attention Doubly Stochastic Attention Encourages model to pay σ 𝑢 α 𝑢,𝑗 ≈ 1 Introduce regularization: equal attention to every part of image over time 𝛾 𝑢 = 𝜏(𝑔 𝛾 𝐢 𝑢 − 1 ) 𝑀 𝜚 𝑏 𝑗 , 𝛽 𝑗 = 𝛾 𝛽 𝑗 𝑏 𝑗 𝑗 12
Attention Mechanisms Soft attention: Summarize ALL locations a b z = p a a+ p b b+ p c c+ p d d CNN c d Derivative dz/dp is nice! Grid of features Train with gradient Image: (Each D- descent Context vector z H x W x 3 dimensional) (D-dimensional) Hard attention : p a p b Sample ONE location From according to p, z = that RNN: p c p d vector Distribution over With argmax, dz/dp is zero grid locations almost everywhere … p a + p b + p c + p c = 1 Can’t use gradient descent; need reinforcement learning 13 Based on CS231n by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Visualizing Attention Soft attention Hard attention 14
Experiments Encoder: Training: Oxford VGGnet Flickr8k: RMSProp Flickr30k/MS COCO: Adam - pretrained on ImageNet - Feature maps from 4 th conv layer Dropout before pooling. 14x14x512 flattened Early stopping on BLEU to 196 x 512 ( L x D ) Batching by sentence lengths Metrics : Datasets: BLEU-1, 2, 3, 4 Flickr8k : 8,000 Flickr30k : 30,000 - No brevity penalty MS COCO : 82,783 METEOR Vocabulary : 10,000 words 5 reference sentences per image 15
Results 16
Key Points • Learn latent alignments from scratch • Better context to decoder • Attends to non object regions • Joint representation • Visualizing attention to interpret functioning • Stochastic Attention • Deterministic Attention 17
Thank you Questions? 18
Recommend
More recommend