Tutorial on Recent Advances in Visual Captioning Luowei Zhou 06/15/2020 1
Outline • Problem Overview • Visual Captioning Taxonomy • Image Captioning • Datasets and Evaluation • Video Description • Grounded Caption Generation • Dense Caption Generation • Conclusion • Q&A 2
Problem Overview • Visual Captioning – Describe the content of an image or video with a natural language sentence. A cat is sitting next to a pine tree, looking up. A dog is playing piano with a girl. 3 Cat image is free to use under the Pixabay License. Dog video is free to use under the Creative Commons license.
Applications of Visual Captioning • Alt-text generation (from PowerPoint) • Content-based image retrieval (CBIR) • Or just for fun! 4
5 A fun video running visual captioning model real-time made by Kyle McDonald. Source: https://vimeo.com/146492001
Visual Captioning Taxonomy Template- based Image Retrieval- Visual based Methods Domain Captioning Deep Learning- Video Short clips based Region Semantic Attention Attention Long Transformer Soft CNN-LSTM videos -based Attention 6
Image Captioning with CNN-LSTM “Show and Tell” • Problem Formulation • The Encoder-Decoder framework Visual Language “Cat sitting outside” Encoder Decoder 7 Image credit: Vinyals et al. “Show and Tell: A Neural Image Caption Generator”, CVPR 2015.
Image Captioning with Soft Attention • Soft Attention – Dynamically attend to input content based on query. • Basic elements: query – 𝑟 , keys - 𝐿 , and values – 𝑊 • In our case, keys and values are usually identical. They come from the CNN activation map. • Query 𝑟 is determined by the global image feature or LSTM’s hidden states. Bahdanau et al. “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015. 8 Xu et al. “Show, Attend and Tell”, ICML 2015.
Image Captioning with Soft Attention h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image 9 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention Alignment scores e t,i,j = f att (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 e 1,2,1 e 1,2,2 e 1,2,3 e 1,3,1 e 1,3,2 e 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image 10 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 softmax a t,:,: = softmax(e t,:,: ) a 1,2,1 a 1,2,2 a 1,2,3 e 1,2,1 e 1,2,2 e 1,2,3 e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image 11 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 softmax a t,:,: = softmax(e t,:,: ) a 1,2,1 a 1,2,2 a 1,2,3 e 1,2,1 e 1,2,2 e 1,2,3 c t = ∑ i,j a t,i,j h i,j e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 c 1 Use a CNN to compute a grid of features for an image 12 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 softmax a t,:,: = softmax(e t,:,: ) cat a 1,2,1 a 1,2,2 a 1,2,3 e 1,2,1 e 1,2,2 e 1,2,3 c t = ∑ i,j a t,i,j h i,j e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] 13 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention e t,i,j = f att (s t-1 , h i,j ) a t,:,: = softmax(e t,:,: ) cat c t = ∑ i,j a t,i,j h i,j y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] 14 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention Alignment scores e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = softmax(e t,:,: ) cat e 2,2,1 e 2,2,2 e 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] 15 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 softmax a t,:,: = softmax(e t,:,: ) cat a 2,2,1 a 2,2,2 a 2,2,3 e 2,2,1 e 2,2,2 e 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] 16 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = softmax(e t,:,: ) softmax cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 Use a CNN to compute a grid of features for an image [START] 17 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 softmax a t,:,: = softmax(e t,:,: ) sitting cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 y 2 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 s 2 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 y 1 Use a CNN to compute a grid of features for an image [START] cat 18 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention Each timestep of decoder e t,i,j = f att (s t-1 , h i,j ) uses a different context a t,:,: = softmax(e t,:,: ) sitting [STOP] cat outside vector that looks at different c t = ∑ i,j a t,i,j h i,j parts of the input image y 1 y 2 y 3 y 4 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 s 2 s 3 s 4 h 3,1 h 3,2 h 3,3 c 1 y 0 c 3 y 2 c 4 y 3 c 2 y 1 Use a CNN to compute a grid of features for an image [START] cat sitting outside 19 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention 20 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention 21 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Region Attention • Variants of Soft Attention based on the feature input • Grid activation features (covered) • Region proposal features Faster R-CNN 22
Image Captioning with “Fancier” Attention Semantic attention Adaptive Attention • Visual attributes • Knowing when to & not to attend to the image You et al. “Image captioning with semantic attention”, CVPR 2016. Yao et al. “Boosting Image Captioning with Attributes”, ICC V 2017. 23 Lu et al. “Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning”, CVPR 2017.
Image Captioning with “Fancier” Attention Attention on Attention X-Linear Attention • Spatial and channel-wise bilinear attention Huang et al. “Attention on Attention for Image Captioning”, ICCV 2019. 24 Pan et al. “X - Linear Attention Networks for Image Captioning”, CVPR 2020.
Image Captioning with “Fancier” Attention Hierarchy Parsing and GCNs Auto-Encoding Scene Graphs • Hierarchal tree structure in image • Scene Graphs in image and text Yao et al. “Hierarchy Parsing for Image Captioning”, ICCV 2019. 25 Yang et al. “Auto - encoding scene graphs for image captioning”, CVPR 2019.
Recommend
More recommend