Why Did You Say That? Explaining and Diversifying Captioning Models Kate Saenko VQA Workshop, CVPR, July 26, 2017
Explaining: Top-down saliency guided by captions http://ai.bu.edu/caption-guided-saliency/ Vasili Abir Jianming Kate Ramanishka Das Zhang Saenko Boston University Boston University Boston University Adobe Research
Captioning A woman is cutting a piece of meat 3 Kate Saenko
Why did the network say that? 4 Kate Saenko
Captioning A woman is .. cooking A man is talking about… science 5 Kate Saenko
? A woman is cutting a piece of meat 6 Kate Saenko
7 Kate Saenko
Explaining the network’s captions Predicted sentence: A woman is cutting a piece of meat can the network localize objects? Kate Saenko 8
Related: Attention layers “Attention Layers”: Sequentially process regions in a single image. Objective: Model learns “where to look” next. • soft attention adds special Image Captioning attention layer • Only spatial or only temporal • Hard to do spatio-temporal attention • Can we get salient regions girl teddy bear without adding such layers? Show, Attend and Tell [Xu et al. ICML’15] Kate Saenko 9
Key idea: probe the network with small part of input Encoder Decoder . . . . . . Encode P(word) Network • No need for special attention layer • Get spatio-temporal attention for free Kate Saenko 10
Encoder-decoder framework slide: Vasili Ramanishka for video description Encoder CNN Average 8x8x2048 1x2048 LSTM 11
Encoder-decoder framework slide: Vasili Ramanishka for video description Encoder CNN Average 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM Decode r is car a man … 12
Encoder-decoder framework slide: Vasili Ramanishka for video description Encoder CNN Average 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM Decode r is car a man … 13
slide: Vasili Ramanishka Saliency Estimation CNN 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM is car a man … … 14
slide: Vasili Ramanishka Saliency Estimation CNN 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM is car a man … … 15
slide: Vasili Ramanishka Saliency Estimation CNN 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM Decode r a man is car … … Kate Saenko 16
slide: Vasili Ramanishka Saliency Estimation “A man is driving a car” normalization 17
Spatiotemporal saliency Predicted sentence: A woman is cutting a piece of meat Kate Saenko 18
Spatiotemporal saliency woman phone Kate Saenko 19
Image captioning with the same architecture CNN LSTM … v i h i WxHxC 1xC … Kate Saenko 20
Image captioning with the same architecture Input query: A man in a jacket is standing at the slot machine 21 Kate Saenko
Flickr30kEntities 22 Kate Saenko Plummer et al., ICCV 2015
Pointing game in Flickr30kEntities 23 Kate Saenko
Comparison to Soft Attention on Flickr30kEntities Attention correctness Pointing game accuracy Captioning performance [14] C. Liu, J. Mao, F. Sha, and A. L. Yuille. Attention correctness in neural image captioning, 2016, implementation of K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image 24 caption generation with visual attention. In ICML 2015 Kate Saenko
Video summarization: predicted sentence Kate Saenko 25
Video summarization: arbitrary query Kate Saenko 26
Diversifying: Captioning Images with Diverse Objects Lisa Anne Subhashini Marcus Raymond Kate Trevor Hendricks Venugopalan Rohrbach Mooney Saenko Darrell UT Austin UC Berkeley Boston Univ. 29
slide: Subhashini Venugopalan Object Recognition Can identify 1000’s of categories of objects. 14M images, 22K classes [Deng et al. CVPR’09] 30
slide: Subhashini Venugopalan Visual Description Berkeley LRCN [Donahue et al. CVPR’15] : A brown bear standing on top of a lush green field. MSR CaptionBot [http://captionbot.ai/] : A large brown bear walking through a forest. MSCOCO 80 classes 31
slide: Subhashini Venugopalan Novel Object Captioner (NOC) We present Novel Object Captioner which can compose descriptions of 100s of objects in context. NOC (ours): Describe novel objects without paired image-caption data. An okapi standing in the middle of a field. + + MSCOCO Visual Classifiers. okapi init + train Existing captioners. A horse standing in the dirt. MSCOCO 32
slide: Subhashini Venugopalan Insights 1. Need to recognize and describe objects outside of image-caption datasets. okapi 33
slide: Subhashini Venugopalan Insight 1: Train effectively on external sources Image-Specific Loss Text-Specific Loss Visual features from unpaired image Embed data Embed LSTM Language model from CNN unannotated text data 34
slide: Subhashini Venugopalan Insights 2. Describe unseen objects that are similar to objects seen in image-caption datasets. okapi zebra 35
slide: Subhashini Venugopalan Insight 2: Capture semantic similarity of words scone Image-Specific Loss Text-Specific Loss zebra W T glove Embed Embed cake LSTM CNN okapi W glove tutu dress 36
slide: Subhashini Venugopalan Insight 2: Capture semantic similarity of words scone Image-Specific Loss Text-Specific Loss zebra W T glove Embed Embed cake LSTM CNN okapi W glove tutu dress MSCOCO 37
slide: Subhashini Venugopalan Combine to form a Caption Model Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed Embed Embed init init LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO Not different from existing caption models. Problem: Forgetting. 38
slide: Subhashini Venugopalan Insights 3. Overcome “forgetting” since pre- training alone is not sufficient. [Catastrophic Forgetting in Neural Networks. Kirkpatrick et al. PNAS 2017] 39
slide: Subhashini Venugopalan Insight 3: Jointly train on multiple sources Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed joint joint Embed Embed training training shared shared LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO 40
slide: Subhashini Venugopalan Novel Object Captioner (NOC) Model Joint-Objective Loss Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed joint joint Embed Embed training training shared shared LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO 41
slide: Subhashini Venugopalan Empirical Evaluation: COCO dataset In-Domain setting MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”An elephant galloping in the Elephant, Galloping, in the green grass” green grass” Green, Grass ”Two people playing ball in a ”Two people playing People, Playing, Ball, ball in a field” field” Field ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”Someone is about to ”Someone is about to eat some Eat, Pizza eat some pizza” pizza” Kitchen, ”A kitchen counter with ”A microwave is sitting on top of a Microwave a microwave on it” kitchen counter ” 48
slide: Subhashini Venugopalan Empirical Evaluation: COCO heldout dataset MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”An elephant galloping in the Elephant, Galloping, in the green grass” green grass” Green, Grass ”Two people playing ball in a ”Two people playing People, Playing, Ball, ball in a field” field” Field ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”Someone is about to ”A white plate topped with cheesy Pizza eat some pizza” pizza and toppings.” ”A kitchen counter with Microwave ”A white refrigerator, stove, oven a microwave on it” dishwasher and microwave” Held-out 49
slide: Subhashini Venugopalan Empirical Evaluation: COCO MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”A small elephant standing on top Two, elephants, in the green grass” of a dirt field” Path, walking ”A hitter swinging his bat to hit ”Two people playing Baseball, batting, ball in a field” the ball” boy, swinging ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”A white plate topped with cheesy Pizza pizza and toppings.” Microwave ”A white refrigerator, stove, oven dishwasher and microwave” ● CNN is pre-trained on ImageNet 50
slide: Subhashini Venugopalan Empirical Evaluation: Metrics F1 (Utility) : Ability to recognize and incorporate new words. (Is the word/object mentioned in the caption?) METEOR: Fluency and sentence quality. 51
Recommend
More recommend