Why Did You Say That? Explaining and Diversifying Captioning Models - PowerPoint PPT Presentation

Why Did You Say That? Explaining and Diversifying Captioning Models Kate Saenko VQA Workshop, CVPR, July 26, 2017

Explaining: Top-down saliency guided by captions http://ai.bu.edu/caption-guided-saliency/ Vasili Abir Jianming Kate Ramanishka Das Zhang Saenko Boston University Boston University Boston University Adobe Research

Captioning A woman is cutting a piece of meat 3 Kate Saenko

Why did the network say that? 4 Kate Saenko

Captioning A woman is .. cooking A man is talking about… science 5 Kate Saenko

? A woman is cutting a piece of meat 6 Kate Saenko

7 Kate Saenko

Explaining the network’s captions Predicted sentence: A woman is cutting a piece of meat can the network localize objects? Kate Saenko 8

Related: Attention layers “Attention Layers”: Sequentially process regions in a single image. Objective: Model learns “where to look” next. • soft attention adds special Image Captioning attention layer • Only spatial or only temporal • Hard to do spatio-temporal attention • Can we get salient regions girl teddy bear without adding such layers? Show, Attend and Tell [Xu et al. ICML’15] Kate Saenko 9

Key idea: probe the network with small part of input Encoder Decoder . . . . . . Encode P(word) Network • No need for special attention layer • Get spatio-temporal attention for free Kate Saenko 10

Encoder-decoder framework slide: Vasili Ramanishka for video description Encoder CNN Average 8x8x2048 1x2048 LSTM 11

Encoder-decoder framework slide: Vasili Ramanishka for video description Encoder CNN Average 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM Decode r is car a man … 12

Encoder-decoder framework slide: Vasili Ramanishka for video description Encoder CNN Average 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM Decode r is car a man … 13

slide: Vasili Ramanishka Saliency Estimation CNN 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM is car a man … … 14

slide: Vasili Ramanishka Saliency Estimation CNN 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM is car a man … … 15

slide: Vasili Ramanishka Saliency Estimation CNN 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM Decode r a man is car … … Kate Saenko 16

slide: Vasili Ramanishka Saliency Estimation “A man is driving a car” normalization 17

Spatiotemporal saliency Predicted sentence: A woman is cutting a piece of meat Kate Saenko 18

Spatiotemporal saliency woman phone Kate Saenko 19

Image captioning with the same architecture CNN LSTM … v i h i WxHxC 1xC … Kate Saenko 20

Image captioning with the same architecture Input query: A man in a jacket is standing at the slot machine 21 Kate Saenko

Flickr30kEntities 22 Kate Saenko Plummer et al., ICCV 2015

Pointing game in Flickr30kEntities 23 Kate Saenko

Comparison to Soft Attention on Flickr30kEntities Attention correctness Pointing game accuracy Captioning performance [14] C. Liu, J. Mao, F. Sha, and A. L. Yuille. Attention correctness in neural image captioning, 2016, implementation of K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image 24 caption generation with visual attention. In ICML 2015 Kate Saenko

Video summarization: predicted sentence Kate Saenko 25

Video summarization: arbitrary query Kate Saenko 26

Diversifying: Captioning Images with Diverse Objects Lisa Anne Subhashini Marcus Raymond Kate Trevor Hendricks Venugopalan Rohrbach Mooney Saenko Darrell UT Austin UC Berkeley Boston Univ. 29

slide: Subhashini Venugopalan Object Recognition Can identify 1000’s of categories of objects. 14M images, 22K classes [Deng et al. CVPR’09] 30

slide: Subhashini Venugopalan Visual Description Berkeley LRCN [Donahue et al. CVPR’15] : A brown bear standing on top of a lush green field. MSR CaptionBot [http://captionbot.ai/] : A large brown bear walking through a forest. MSCOCO 80 classes 31

slide: Subhashini Venugopalan Novel Object Captioner (NOC) We present Novel Object Captioner which can compose descriptions of 100s of objects in context. NOC (ours): Describe novel objects without paired image-caption data. An okapi standing in the middle of a field. + + MSCOCO Visual Classifiers. okapi init + train Existing captioners. A horse standing in the dirt. MSCOCO 32

slide: Subhashini Venugopalan Insights 1. Need to recognize and describe objects outside of image-caption datasets. okapi 33

slide: Subhashini Venugopalan Insight 1: Train effectively on external sources Image-Specific Loss Text-Specific Loss Visual features from unpaired image Embed data Embed LSTM Language model from CNN unannotated text data 34

slide: Subhashini Venugopalan Insights 2. Describe unseen objects that are similar to objects seen in image-caption datasets. okapi zebra 35

slide: Subhashini Venugopalan Insight 2: Capture semantic similarity of words scone Image-Specific Loss Text-Specific Loss zebra W T glove Embed Embed cake LSTM CNN okapi W glove tutu dress 36

slide: Subhashini Venugopalan Insight 2: Capture semantic similarity of words scone Image-Specific Loss Text-Specific Loss zebra W T glove Embed Embed cake LSTM CNN okapi W glove tutu dress MSCOCO 37

slide: Subhashini Venugopalan Combine to form a Caption Model Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed Embed Embed init init LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO Not different from existing caption models. Problem: Forgetting. 38

slide: Subhashini Venugopalan Insights 3. Overcome “forgetting” since pre- training alone is not sufficient. [Catastrophic Forgetting in Neural Networks. Kirkpatrick et al. PNAS 2017] 39

slide: Subhashini Venugopalan Insight 3: Jointly train on multiple sources Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed joint joint Embed Embed training training shared shared LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO 40

slide: Subhashini Venugopalan Novel Object Captioner (NOC) Model Joint-Objective Loss Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed joint joint Embed Embed training training shared shared LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO 41

slide: Subhashini Venugopalan Empirical Evaluation: COCO dataset In-Domain setting MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”An elephant galloping in the Elephant, Galloping, in the green grass” green grass” Green, Grass ”Two people playing ball in a ”Two people playing People, Playing, Ball, ball in a field” field” Field ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”Someone is about to ”Someone is about to eat some Eat, Pizza eat some pizza” pizza” Kitchen, ”A kitchen counter with ”A microwave is sitting on top of a Microwave a microwave on it” kitchen counter ” 48

slide: Subhashini Venugopalan Empirical Evaluation: COCO heldout dataset MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”An elephant galloping in the Elephant, Galloping, in the green grass” green grass” Green, Grass ”Two people playing ball in a ”Two people playing People, Playing, Ball, ball in a field” field” Field ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”Someone is about to ”A white plate topped with cheesy Pizza eat some pizza” pizza and toppings.” ”A kitchen counter with Microwave ”A white refrigerator, stove, oven a microwave on it” dishwasher and microwave” Held-out 49

slide: Subhashini Venugopalan Empirical Evaluation: COCO MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”A small elephant standing on top Two, elephants, in the green grass” of a dirt field” Path, walking ”A hitter swinging his bat to hit ”Two people playing Baseball, batting, ball in a field” the ball” boy, swinging ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”A white plate topped with cheesy Pizza pizza and toppings.” Microwave ”A white refrigerator, stove, oven dishwasher and microwave” ● CNN is pre-trained on ImageNet 50

slide: Subhashini Venugopalan Empirical Evaluation: Metrics F1 (Utility) : Ability to recognize and incorporate new words. (Is the word/object mentioned in the caption?) METEOR: Fluency and sentence quality. 51

Why Did You Say That? Explaining and Diversifying Captioning Models - PowerPoint PPT Presentation

Why Did You Say That? Explaining and Diversifying Captioning Models Kate Saenko VQA Workshop, CVPR, July 26, 2017 Explaining: Top-down saliency guided by captions http://ai.bu.edu/caption-guided-saliency/ Vasili Abir Jianming Kate

Tackling Performance Bottlenecks in the Diversifying CUDA HPC Ecosystem: a Molecular Dynamics

Credentialing Virginia s Workforce: Diversifying the Economy and Boosting Economic

Diversifying Computing: Its Contradictions And Challenges Richard Tapia Department of

The Multibiliography Package: Articulating and Diversifying the Ordering of Bibliographic Entries

Proposal for: Diversifying the Environmental Movement Presentation Submitted by: Center for

A successful self explaining roads project i N in New Zealand; but Z l d b t what is next?

Diversifying into S ervices: Experiences from the Commonwealth Ad Hoc Expert Group Meeting on

Diversifying the Computing Pipeline with Extraordinary Women Juan E. Gilbert, Ph.D. Andrew

Accurate prediction for atomic-level protein design and its application in diversifying the

Agglomeration in Practice: The Malaysian Experience in Diversifying Manufacturing Mohamed Rizwan

Diversifying Exports in the Context of Climate Change Overcoming the new physical and regulatory

Diversifying the S&E Workforce: Research University Experience Women in Science and

Diversifying your Income & Creating a Blogging Team. A guide by The Magpies What is this

Diversifying Your Portfolio with Inland DSTs Accredited Investor Use

Beyond Playing to Win: Diversifying Heuristics for GVGAI Cristina Guerrero-Romero, Annie Louis

Explaining Inconsistent Code Muhammad Numair Mansur Introduction 50% of the time in

Diversifying and Widening the Teacher Pipeline with Grow Your Own Programs March 19, 2019

Embracing the new threat: towards automatically, self-diversifying malware Mathias Payer

Grants and Funding: Diversifying and Securing Resources for Cancer Control Presenters: Anne

Challenges and Opportunities: Diversifying your workforce Toni Collis, EPCC, The University of

Introduction to social housing changes Government reforms are aimed at growing and diversifying

Religion, Spirituality and Longevity is Stress Buffering a Explaining Variable? Ren Hefti,

Sizzle or substance: Sizzle or substance: The Role of Small Business in Diversifying the Economy

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Why Did You Say That? Explaining and Diversifying Captioning Models - PowerPoint PPT Presentation

Why Did You Say That? Explaining and Diversifying Captioning Models Kate Saenko VQA Workshop, CVPR, July 26, 2017 Explaining: Top-down saliency guided by captions http://ai.bu.edu/caption-guided-saliency/ Vasili Abir Jianming Kate

Tackling Performance Bottlenecks in the Diversifying CUDA HPC Ecosystem: a Molecular Dynamics

Credentialing Virginia s Workforce: Diversifying the Economy and Boosting Economic

Diversifying Computing: Its Contradictions And Challenges Richard Tapia Department of

The Multibiliography Package: Articulating and Diversifying the Ordering of Bibliographic Entries

Proposal for: Diversifying the Environmental Movement Presentation Submitted by: Center for

A successful self explaining roads project i N in New Zealand; but Z l d b t what is next?

Diversifying into S ervices: Experiences from the Commonwealth Ad Hoc Expert Group Meeting on

Diversifying the Computing Pipeline with Extraordinary Women Juan E. Gilbert, Ph.D. Andrew

Accurate prediction for atomic-level protein design and its application in diversifying the

Agglomeration in Practice: The Malaysian Experience in Diversifying Manufacturing Mohamed Rizwan

Diversifying Exports in the Context of Climate Change Overcoming the new physical and regulatory

Diversifying the S&amp;E Workforce: Research University Experience Women in Science and

Diversifying your Income &amp; Creating a Blogging Team. A guide by The Magpies What is this

Diversifying Your Portfolio with Inland DSTs Accredited Investor Use

Beyond Playing to Win: Diversifying Heuristics for GVGAI Cristina Guerrero-Romero, Annie Louis

Explaining Inconsistent Code Muhammad Numair Mansur Introduction 50% of the time in

Diversifying and Widening the Teacher Pipeline with Grow Your Own Programs March 19, 2019

Embracing the new threat: towards automatically, self-diversifying malware Mathias Payer

Grants and Funding: Diversifying and Securing Resources for Cancer Control Presenters: Anne

Challenges and Opportunities: Diversifying your workforce Toni Collis, EPCC, The University of

Introduction to social housing changes Government reforms are aimed at growing and diversifying

Religion, Spirituality and Longevity is Stress Buffering a Explaining Variable? Ren Hefti,

Sizzle or substance: Sizzle or substance: The Role of Small Business in Diversifying the Economy

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Diversifying the S&E Workforce: Research University Experience Women in Science and

Diversifying your Income & Creating a Blogging Team. A guide by The Magpies What is this