why did you say that explaining and diversifying
play

Why Did You Say That? Explaining and Diversifying Captioning Models - PowerPoint PPT Presentation

Why Did You Say That? Explaining and Diversifying Captioning Models Kate Saenko VQA Workshop, CVPR, July 26, 2017 Explaining: Top-down saliency guided by captions http://ai.bu.edu/caption-guided-saliency/ Vasili Abir Jianming Kate


  1. Why Did You Say That? Explaining and Diversifying Captioning Models Kate Saenko VQA Workshop, CVPR, July 26, 2017

  2. Explaining: Top-down saliency guided by captions http://ai.bu.edu/caption-guided-saliency/ Vasili Abir Jianming Kate Ramanishka Das Zhang Saenko Boston University Boston University Boston University Adobe Research

  3. Captioning A woman is cutting a piece of meat 3 Kate Saenko

  4. Why did the network say that? 4 Kate Saenko

  5. Captioning A woman is .. cooking A man is talking about… science 5 Kate Saenko

  6. ? A woman is cutting a piece of meat 6 Kate Saenko

  7. 7 Kate Saenko

  8. Explaining the network’s captions Predicted sentence: A woman is cutting a piece of meat can the network localize objects? Kate Saenko 8

  9. Related: Attention layers “Attention Layers”: Sequentially process regions in a single image. Objective: Model learns “where to look” next. • soft attention adds special Image Captioning attention layer • Only spatial or only temporal • Hard to do spatio-temporal attention • Can we get salient regions girl teddy bear without adding such layers? Show, Attend and Tell [Xu et al. ICML’15] Kate Saenko 9

  10. Key idea: probe the network with small part of input Encoder Decoder . . . . . . Encode P(word) Network • No need for special attention layer • Get spatio-temporal attention for free Kate Saenko 10

  11. Encoder-decoder framework slide: Vasili Ramanishka for video description Encoder CNN Average 8x8x2048 1x2048 LSTM 11

  12. Encoder-decoder framework slide: Vasili Ramanishka for video description Encoder CNN Average 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM Decode r is car a man … 12

  13. Encoder-decoder framework slide: Vasili Ramanishka for video description Encoder CNN Average 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM Decode r is car a man … 13

  14. slide: Vasili Ramanishka Saliency Estimation CNN 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM is car a man … … 14

  15. slide: Vasili Ramanishka Saliency Estimation CNN 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM is car a man … … 15

  16. slide: Vasili Ramanishka Saliency Estimation CNN 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM Decode r a man is car … … Kate Saenko 16

  17. slide: Vasili Ramanishka Saliency Estimation “A man is driving a car” normalization 17

  18. Spatiotemporal saliency Predicted sentence: A woman is cutting a piece of meat Kate Saenko 18

  19. Spatiotemporal saliency woman phone Kate Saenko 19

  20. Image captioning with the same architecture CNN LSTM … v i h i WxHxC 1xC … Kate Saenko 20

  21. Image captioning with the same architecture Input query: A man in a jacket is standing at the slot machine 21 Kate Saenko

  22. Flickr30kEntities 22 Kate Saenko Plummer et al., ICCV 2015

  23. Pointing game in Flickr30kEntities 23 Kate Saenko

  24. Comparison to Soft Attention on Flickr30kEntities Attention correctness Pointing game accuracy Captioning performance [14] C. Liu, J. Mao, F. Sha, and A. L. Yuille. Attention correctness in neural image captioning, 2016, implementation of K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image 24 caption generation with visual attention. In ICML 2015 Kate Saenko

  25. Video summarization: predicted sentence Kate Saenko 25

  26. Video summarization: arbitrary query Kate Saenko 26

  27. Diversifying: Captioning Images with Diverse Objects Lisa Anne Subhashini Marcus Raymond Kate Trevor Hendricks Venugopalan Rohrbach Mooney Saenko Darrell UT Austin UC Berkeley Boston Univ. 29

  28. slide: Subhashini Venugopalan Object Recognition Can identify 1000’s of categories of objects. 14M images, 22K classes [Deng et al. CVPR’09] 30

  29. slide: Subhashini Venugopalan Visual Description Berkeley LRCN [Donahue et al. CVPR’15] : A brown bear standing on top of a lush green field. MSR CaptionBot [http://captionbot.ai/] : A large brown bear walking through a forest. MSCOCO 80 classes 31

  30. slide: Subhashini Venugopalan Novel Object Captioner (NOC) We present Novel Object Captioner which can compose descriptions of 100s of objects in context. NOC (ours): Describe novel objects without paired image-caption data. An okapi standing in the middle of a field. + + MSCOCO Visual Classifiers. okapi init + train Existing captioners. A horse standing in the dirt. MSCOCO 32

  31. slide: Subhashini Venugopalan Insights 1. Need to recognize and describe objects outside of image-caption datasets. okapi 33

  32. slide: Subhashini Venugopalan Insight 1: Train effectively on external sources Image-Specific Loss Text-Specific Loss Visual features from unpaired image Embed data Embed LSTM Language model from CNN unannotated text data 34

  33. slide: Subhashini Venugopalan Insights 2. Describe unseen objects that are similar to objects seen in image-caption datasets. okapi zebra 35

  34. slide: Subhashini Venugopalan Insight 2: Capture semantic similarity of words scone Image-Specific Loss Text-Specific Loss zebra W T glove Embed Embed cake LSTM CNN okapi W glove tutu dress 36

  35. slide: Subhashini Venugopalan Insight 2: Capture semantic similarity of words scone Image-Specific Loss Text-Specific Loss zebra W T glove Embed Embed cake LSTM CNN okapi W glove tutu dress MSCOCO 37

  36. slide: Subhashini Venugopalan Combine to form a Caption Model Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed Embed Embed init init LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO Not different from existing caption models. Problem: Forgetting. 38

  37. slide: Subhashini Venugopalan Insights 3. Overcome “forgetting” since pre- training alone is not sufficient. [Catastrophic Forgetting in Neural Networks. Kirkpatrick et al. PNAS 2017] 39

  38. slide: Subhashini Venugopalan Insight 3: Jointly train on multiple sources Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed joint joint Embed Embed training training shared shared LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO 40

  39. slide: Subhashini Venugopalan Novel Object Captioner (NOC) Model Joint-Objective Loss Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed joint joint Embed Embed training training shared shared LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO 41

  40. slide: Subhashini Venugopalan Empirical Evaluation: COCO dataset In-Domain setting MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”An elephant galloping in the Elephant, Galloping, in the green grass” green grass” Green, Grass ”Two people playing ball in a ”Two people playing People, Playing, Ball, ball in a field” field” Field ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”Someone is about to ”Someone is about to eat some Eat, Pizza eat some pizza” pizza” Kitchen, ”A kitchen counter with ”A microwave is sitting on top of a Microwave a microwave on it” kitchen counter ” 48

  41. slide: Subhashini Venugopalan Empirical Evaluation: COCO heldout dataset MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”An elephant galloping in the Elephant, Galloping, in the green grass” green grass” Green, Grass ”Two people playing ball in a ”Two people playing People, Playing, Ball, ball in a field” field” Field ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”Someone is about to ”A white plate topped with cheesy Pizza eat some pizza” pizza and toppings.” ”A kitchen counter with Microwave ”A white refrigerator, stove, oven a microwave on it” dishwasher and microwave” Held-out 49

  42. slide: Subhashini Venugopalan Empirical Evaluation: COCO MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”A small elephant standing on top Two, elephants, in the green grass” of a dirt field” Path, walking ”A hitter swinging his bat to hit ”Two people playing Baseball, batting, ball in a field” the ball” boy, swinging ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”A white plate topped with cheesy Pizza pizza and toppings.” Microwave ”A white refrigerator, stove, oven dishwasher and microwave” ● CNN is pre-trained on ImageNet 50

  43. slide: Subhashini Venugopalan Empirical Evaluation: Metrics F1 (Utility) : Ability to recognize and incorporate new words. (Is the word/object mentioned in the caption?) METEOR: Fluency and sentence quality. 51

Recommend


More recommend