Designing deep architectures for Visual Question Answering Matthieu - PowerPoint PPT Presentation

Designing deep architectures for Visual Question Answering Matthieu Cord Sorbonne University valeo.ai research lab. Paris Thanks to H. Ben-younes, R. Cadène

Visual Question Answering Question Answering: + What does Claudia do?

Visual Question Answering Visual Question Answering: + What does Claudia do?

Visual Question Answering Visual Question Answering: + What does Claudia do? Sitting at the bottom Standing at the back …

Visual Question Answering Visual Question Answering: + What does Claudia do? Sitting at the bottom Standing at the Deep ML back … Solving this task interesting for: - Study of deep learning models in a multimodal context - Improving human-machine interaction - One step to build visual assistant for blind people

Outline 1. Multimodal embedding Deep nets to align text+image • learning • 2. VQA framework Task modeling • Fusion in VQA • Reasoning in VQA •

Deep semantic-visual embedding RNN ConvNet

Deep semantic-visual embedding RNN ConvNet Semantic of d istance Retrieval by NN search

Deep semantic-visual embedding A car A cat on a sofa A dog playing 2D Semantic visual space example: • Distance in the space has a semantic interpretation • Retrieval is done by finding nearest neighbors

Deep semantic-visual embedding • Designing image and text embedding architectures • Learning scheme for these deep hybrid nets

Deep semantic-visual embedding DeViSE: A Deep Visual-Semantic Embedding Model, A. Frome et al, NIPS 2013 Finding beans in burgers: Deep semantic-visual embedding with localization, M. Engilberge et al, CVPR 2018 Textual pipeline: Visual pipeline: • Pretrained word embedding • ResNet-152 pretrained • Simple Recurrent Unit (SRU) • Weldon spatial pooling Normalization • Affine projection • normalization • affine+ ResNet conv pool norm. (a, man, in, ski, gear, w2v SRU+norm RNN skiing, on, snow) ! 0: 2 and ϕ are the trained parameters

Deep semantic-visual embedding Finding beans in burgers: Deep semantic-visual embedding with localization, M. Engilberge et al, CVPR 2018 Textual pipeline: Visual pipeline: Pretrained word embedding ResNet-152 pretrained • • Simple Recurrent Unit (SRU) Weldon spatial pooling • • Normalization Affine projection • • normalization • affine+ ResNet conv pool norm. (a, man, in, ski, gear, w2v SRU+norm skiing, on, snow)

Deep semantic-visual embedding Finding beans in burgers: Deep semantic-visual embedding with localization, M. Engilberge et al, CVPR 2018 Textual pipeline: Visual pipeline: Pretrained word embedding ResNet-152 pretrained • • Simple Recurrent Unit (SRU) Weldon spatial pooling • • Normalization Affine projection • • normalization • affine+ ResNet conv pool norm. (a, man, in, ski, gear, w2v SRU+norm skiing, on, snow) ! 0: 2 and ϕ => Learning using a training set

How to get large training datasets? Cooking recipes: easy to get large multimodal datasets with aligned data Learning Cross-modal Embeddings for Cooking Recipes and Food Images. A. Salvador, et al. CVPR 2017 Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings M. Carvalho, R. Cadene, D. Picard, L. Soulier, N. Thome, M. Cord, SIGIR (2018)

Deep semantic-visual embedding Demo Visiir.lip6.fr

Cross-modal retrieval Closest elements Query A plane in a cloudy sky A dog playing with a frisbee 1. A herd of sheep standing on top of snow covered field. 2. There are sheep standing in the grass near a fence. 3. some black and white sheep a fence dirt and grass

Cross-modal retrieval and localization Visual grounding examples: • Generating multiple heat maps with different textual queries Finding beans in burgers: Deep semantic-visual embedding with localization, M. Engilberge et al, CVPR 2018

Cross-modal retrieval and localization Emergence of color understanding:

Outline 1. Multimodal embedding Deep nets to align text+image • Learning • 2. Visual Question Answering Task modeling • Fusion in VQA • Reasoning in VQA •

VQA What color is the fire Hydrant on the left? Green What color is the fire hydrant

VQA What color is the fire Hydrant on the right? Yellow What color is the fire hydrant

Who is wearing glasses? Similar images man woman Different answers @VQA workshop, CVPR 2017 Þ Need very good Visual and Question (deep) representations Þ Full scene understanding Þ Need High level multimodal interaction modeling Þ Merging operators, attention and reasoning

Vanilla VQA scheme: 2 deep + fusion Question Representation Image

VQA: the output space Image representation Yes VQA Question : Is the lady with the Question representation blue fur wearing glasses ?

VQA: the output space

VQA: the output space Output space representation: => Classify over the most frequent answers (3000/95%)

VQA: the output space Image representation VQA Question : Is the lady with the Question Classes representation blue fur wearing glasses ?

VQA processing Image ● Convolutional Network (VGG, ResNet,....) ● Detection system (EdgeBoxes, Faster-RCNN, …) Multimodal Fusion Question Reasoning ● Bag-of-words ● Recurrent Network (RNN, LSTM, GRU, SRU, …) Learning ● Fixed answer vocabulary ● Classification (cross-entropy)

Fusion in VQA

VQA: fusion Is the lady with the purple fur wearing glasses ? Fusion Concat+ Proj Element-Wise Concat+MLP

VQA: bilinear fusion [Fukui, Akira et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, CVPR 2016] [Kim, Jin-Hwa et al. Hadamard Product for Low-rank Bilinear Pooling, ICLR 2017] Is the lady with the purple fur wearing glasses ? Bilinear model:

VQA: bilinear fusion Learn the 3-ways tensor coeff. • Different than the Signal Proc. Tensor analysis (representation) Problem: q , v and y are of dimension ~ 2000 => 8 billion free parameters in the tensor Need to reduce the tensor size: • Idea: structure the tensor to reduce the number of parameters

VQA: bilinear fusion Tensor structure: Tucker decomposition: ⇔ constrain the rank of each unfolding of

VQA: bilinear fusion =

VQA: bilinear fusion Other ways of structuring the tensor of parameters Compact Bilinear Low-Rank Tucker Pooling Bilinear Pooling Decomposition (MCB) (MLB) (MUTAN) Ben-younes H.* Cadene R.*, Thome N., Cord M., MUTAN: Multimodal Tucker Fusion for Visual Question Answering, ICCV 2017

VQA: bilinear fusion [AAAI 2019]

VQA: BLOCK fusion [AAAI 2019] B C A

VQA: bilinear fusion Is the lady with the purple fur wearing glasses ?

Reasoning in VQA

VQA: reasoning What is reasoning (for VQA)? Attentional reasoning Relational reasoning Iterative reasoning Compositional reasoning

VQA: reasoning What is reasoning (for VQA)? Attentional reasoning : given a certain context (i.e. Q), focus only on the relevant subparts of the image Relational reasoning Iterative reasoning Compositional reasoning

VQA: attentional reasoning Idea: focusing only on parts of the image relevant to Q ● Each region scored according to the question What is sitting on the desk in front of the boys ? ● Representation = sum of all (weighted) embeddings

VQA: attentional reasoning ResNet What is sitting on the desk in GRU front of the boys ?

VQA: attentional reasoning Attention MUTAN mechanism Fusion ResNet What is sitting on the desk in GRU front of the boys ? MUTAN Fusion “laptop” Attentional glimpse in most of recent strategies [MLB, MCB, MUTAN …]

VQA: attentional reasoning

VQA: attentional reasoning Focusing on multiple regions: Multi-glimpse attention Where is the smoke coming from ?

VQA: attentional reasoning with Multi-glimpse attention Attention MUTAN mechanism Fusion ResNet Where is the GRU smoke coming from ? Focus on the train Focus on the MUTAN smoke Fusion “train”

VQA: attentional reasoning with Multi-glimpse attention

VQA: attentional reasoning Evaluation on VQA dataset: Best MUTAN score of 67.36% on test-std Human performances about 83% on this dataset The winner of the VQA Challenge in CVPR 2017 (and CVPR 2018) integrates adaptive grid selection from additional region detection learning process

VQA: attentional reasoning

VQA: reasoning What is reasoning (for VQA)? Attentional reasoning : given a certain context (i.e. Q), focus only on the relevant subparts of the image Relational reasoning : object detection + mutual relationships (spatial, semantic,...), merging both with Q Iterative reasoning Compositional reasoning

Bottom-up and Relational reasoning Determine the answer using relevant objects and relationships

Designing deep architectures for Visual Question Answering Matthieu - PowerPoint PPT Presentation

Designing deep architectures for Visual Question Answering Matthieu Cord Sorbonne University valeo.ai research lab. Paris Thanks to H. Ben-younes, R. Cadne Visual Question Answering Question Answering: + What does Claudia do? Visual

Self-Critical Reasoning for Robust Visual Question Answering Jialin Wu and Raymond J. Mooney

Question-Answering: Shallow & Deep Techniques for NLP Deep Processing Techniques for NLP

Question-Answering: Shallow & Deep Techniques for NLP Ling571 Deep Processing Techniques

Question Answering What is Ques+on Answering? Dan Jurafsky Ques%on

Question Answering and AnswerFinder Diego Moll a Centre for Language Technology Department of

A Multilingual Hybrid Question-Answering System Cross-Lingual Open-Domain Question Answering

Visual Question Answering and Visual Reasoning Zhe Gan 6/15/2020 Overview Goal of this part

Dynamic memory networks for Dynamic memory networks for visual and textual question visual and

Architectures Architectural styles Software architectures Architectures versus middleware

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

Answering Queries Using Answering Queries Using Materialized view: result set is stored

Chess Q&A : Question Answering on Chess Games Reasoning, Attention, Memory NIPS Workshop 12

Multimodal Learning for Image Captioning and Visual Question Answering Xiaodong He Deep Learning

8. Other Deep Architectures CS 519 Deep Learning, Winter 2018 Fuxin Li With materials from Zsolt

Statistical NLP Spring 2011 Lecture 26: Question Answering Dan Klein UC Berkeley Question

Question Answering and Reading Comprehension Kevin Duh Fall 2019, Intro to HLT, Johns Hopkins

dnstap: introduction and status update Robert Edmonds (edmonds@fsi.io) Farsight Security, Inc.

(Unifying?) rheology of soft glasses and jammed solids Ludovic Berthier Laboratoire Charles

Adversarially Robust Optimization with Gaussian Processes Ilija Bogunovic, Jonathan Scarlett,

Linguistic sca fg olds for policy learning Jacob Andreas Berkeley Microsoft Semantic Machines

Where are my glasses? I know the following statements are true. 1. If I was reading the newspaper

DTTF/NB479: Dszquphsbqiz Day 29 Announcements: Questions? This week: Digital signatures, DSA

Markov Models and Hidden Markov Models Robert Platt Northeastern University Some images and

Topic 6: Optical Systems Aim: To apply the image formation theory to basic real optical sys- tems

Sambuz

Useful Links

Newsletter

Mail Us

Designing deep architectures for Visual Question Answering Matthieu - PowerPoint PPT Presentation

Designing deep architectures for Visual Question Answering Matthieu Cord Sorbonne University valeo.ai research lab. Paris Thanks to H. Ben-younes, R. Cadne Visual Question Answering Question Answering: + What does Claudia do? Visual

Self-Critical Reasoning for Robust Visual Question Answering Jialin Wu and Raymond J. Mooney

Question-Answering: Shallow &amp; Deep Techniques for NLP Deep Processing Techniques for NLP

Question-Answering: Shallow &amp; Deep Techniques for NLP Ling571 Deep Processing Techniques

Question Answering What is Ques+on Answering? Dan Jurafsky Ques%on

Question Answering and AnswerFinder Diego Moll a Centre for Language Technology Department of

A Multilingual Hybrid Question-Answering System Cross-Lingual Open-Domain Question Answering

Visual Question Answering and Visual Reasoning Zhe Gan 6/15/2020 Overview Goal of this part

Dynamic memory networks for Dynamic memory networks for visual and textual question visual and

Architectures Architectural styles Software architectures Architectures versus middleware

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

Answering Queries Using Answering Queries Using Materialized view: result set is stored

Chess Q&amp;A : Question Answering on Chess Games Reasoning, Attention, Memory NIPS Workshop 12

Multimodal Learning for Image Captioning and Visual Question Answering Xiaodong He Deep Learning

8. Other Deep Architectures CS 519 Deep Learning, Winter 2018 Fuxin Li With materials from Zsolt

Statistical NLP Spring 2011 Lecture 26: Question Answering Dan Klein UC Berkeley Question

Question Answering and Reading Comprehension Kevin Duh Fall 2019, Intro to HLT, Johns Hopkins

dnstap: introduction and status update Robert Edmonds (edmonds@fsi.io) Farsight Security, Inc.

(Unifying?) rheology of soft glasses and jammed solids Ludovic Berthier Laboratoire Charles

Adversarially Robust Optimization with Gaussian Processes Ilija Bogunovic, Jonathan Scarlett,

Linguistic sca fg olds for policy learning Jacob Andreas Berkeley Microsoft Semantic Machines

Where are my glasses? I know the following statements are true. 1. If I was reading the newspaper

DTTF/NB479: Dszquphsbqiz Day 29 Announcements: Questions? This week: Digital signatures, DSA

Markov Models and Hidden Markov Models Robert Platt Northeastern University Some images and

Topic 6: Optical Systems Aim: To apply the image formation theory to basic real optical sys- tems

Sambuz

Useful Links

Newsletter

Mail Us

Question-Answering: Shallow & Deep Techniques for NLP Deep Processing Techniques for NLP

Question-Answering: Shallow & Deep Techniques for NLP Ling571 Deep Processing Techniques

Chess Q&A : Question Answering on Chess Games Reasoning, Attention, Memory NIPS Workshop 12