Visual Question Answering and Visual Reasoning Zhe Gan 6/15/2020

Overview • Goal of this part of the tutorial: • Use VQA and visual reasoning as example tasks to understand Vision-and- Language representation learning • After the talk, everyone can confidently say: “yeah, I know VQA and visual reasoning pretty well now” • Focus on high-level intuitions, not technical details • Focus on static images, instead of videos • Focus on a selective set of papers, not a comprehensive literature review

Agenda • Task Overview • What are the main tasks that are driving progress in VQA and visual reasoning? • Method Overview • What are the state-of-the-art approaches and the key model design principles underlying these methods? • Summary • What are the core challenges and future directions?

What is V+L about? • V+L research is about how to train a smart AI system that can see and talk

What is V+L about? • V+L research is about how to train a smart AI system that can see and talk In our V+L context Prof. Yann LeCun’s cake theory Multimodel Reinforcement Intelligence Learning BERT Language Supervised Learning Understanding ResNet Visual Unsupervised/Self- Understanding supervised Learning

Task Overview: VQA and Visual Reasoning • Large-scale annotated datasets have driven tremendous progress in this field VQA-Rephrasings TextVQA VCR ST-VQA VizWiz VQA v0.1 VQA v2.0 2017/4 2015/6 2018/2 2019/2/15 2018/11/27 2019/4 2019/10 ... 2019/1 2016/11 2017/12 2018/11/1 2019/2/25 2019/5 GQA OK-VQA Visual Dialog VQA-CP VE NLVR2

VQA-Rephrasings TextVQA VCR ST-VQA VQA v0.1 VQA v2.0 VizWiz 2017/4 2015/6 2018/2 2019/2/15 2018/11/27 2019/4 2019/10 ... 2019/1 2016/11 2017/12 2018/11/1 2019/2/25 2019/5 GQA Visual Dialog OK-VQA VQA-CP VE NLVR2 VQA Visual Dialog [1] VQA: Visual Question Answering, ICCV 2015 Image credit: https://visualqa.org/, https://visualdialog.org/ [2] Visual Dialog, CVPR 2017

VQA-Rephrasings TextVQA VCR ST-VQA VQA v0.1 VQA v2.0 VizWiz 2017/4 2015/6 2018/2 2019/2/15 2018/11/27 2019/4 2019/10 ... 2019/1 2016/11 2017/12 2018/11/1 2019/2/25 2019/5 GQA Visual Dialog OK-VQA VQA-CP VE NLVR2 VQA-CP VQA v2.0 [1] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering, CVPR 2017 [2] Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering, CVPR 2018

VQA-Rephrasings TextVQA VCR ST-VQA VQA v0.1 VQA v2.0 VizWiz 2017/4 2015/6 2018/2 2019/2/15 2018/11/27 2019/4 2019/10 ... 2019/1 2016/11 2017/12 2018/11/1 2019/2/25 2019/5 GQA Visual Dialog OK-VQA VQA-CP VE NLVR2 NLVR2 VizWiz [1] VizWiz Grand Challenge: Answering Visual Questions from Blind People, CVPR 2018 [2] A Corpus for Reasoning About Natural Language Grounded in Photographs, ACL 2019

VQA-Rephrasings TextVQA VCR ST-VQA VQA v0.1 VQA v2.0 VizWiz 2017/4 2015/6 2018/2 2019/2/15 2018/11/27 2019/4 2019/10 ... 2019/1 2016/11 2017/12 2018/11/1 2019/2/25 2019/5 GQA Visual Dialog OK-VQA VQA-CP VE NLVR2 VCR [1] From Recognition to Cognition: Visual Commonsense Reasoning, CVPR 2019

VQA-Rephrasings TextVQA VCR ST-VQA VQA v0.1 VQA v2.0 VizWiz 2017/4 2015/6 2018/2 2019/2/15 2018/11/27 2019/4 2019/10 ... 2019/1 2016/11 2017/12 2018/11/1 2019/2/25 2019/5 GQA Visual Dialog OK-VQA VQA-CP VE NLVR2 Visual Entailment VQA-Rephrasings [1] Visual Entailment: A Novel Task for Fine-Grained Image Understanding, 2019 [2] Cycle-Consistency for Robust Visual Question Answering, CVPR 2019

VQA-Rephrasings TextVQA VCR ST-VQA VQA v0.1 VQA v2.0 VizWiz 2017/4 2015/6 2018/2 2019/2/15 2018/11/27 2019/4 2019/10 ... 2019/1 2016/11 2017/12 2018/11/1 2019/2/25 2019/5 GQA Visual Dialog OK-VQA VQA-CP VE NLVR2 [1] GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, CVPR 2019

VQA-Rephrasings TextVQA VCR ST-VQA VQA v0.1 VQA v2.0 VizWiz 2017/4 2015/6 2018/2 2019/2/15 2018/11/27 2019/4 2019/10 ... 2019/1 2016/11 2017/12 2018/11/1 2019/2/25 2019/5 GQA Visual Dialog OK-VQA VQA-CP VE NLVR2 [1] Towards VQA Models That Can Read, CVPR 2019

VQA-Rephrasings TextVQA VCR ST-VQA VQA v0.1 VQA v2.0 VizWiz 2017/4 2015/6 2018/2 2019/2/15 2018/11/27 2019/4 2019/10 ... 2019/1 2016/11 2017/12 2018/11/1 2019/2/25 2019/5 GQA Visual Dialog OK-VQA VQA-CP VE NLVR2 Scene Text VQA OK-VQA [1] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge, CVPR 2019 [2] Scene Text Visual Question Answering, ICCV 2019

More datasets…

Diagnostic Datasets • CLEVR (Compositional Language and Elementary Visual Reasoning) • Has been extended to visual dialog (CLEVR-Dialog), referring expressions (CLEVR-Ref+), and video reasoning (CLEVRER) [1] CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, CVPR 2017 [2] CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog, NAACL 2019 [3] CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions, CVPR 2019 [4] CLEVRER: CoLlision Events for Video REpresentation and Reasoning, ICLR 2020

Beyond VQA: Visual Grounding • Referring Expression Comprehension: RefCOCO(+/g) • ReferIt Game: Referring to Objects in Photographs of Natural Scenes • Flickr30k Entities [1] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge, EMNLP 2014 [2] Flickr30K Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models, IJCV 2017

Beyond VQA: Visual Grounding • PhraseCut: Language-based image segmentation [1] PhraseCut: Language-based Image Segmentation in the Wild, CVPR 2020

Visual Question Answering 76.36 Image Credit: CVPR 2019 Visual Question Answering and Dialog Workshop

Agenda • Task Overview • What are the main tasks that are driving progress in VQA and visual reasoning? • Method Overview • What are the state-of-the-art approaches and the key model design principles underlying these methods? • Summary • What are the core challenges and future directions?

Overview • How a typical system looks like Image Feature Extraction Multi-Modal Answer Hamburger Fusion Prediction Question What is she eating? Encoding

Image credit: from the original papers

Overview • Better image feature preparation • Enhanced multimodal fusion • Bilinear pooling: how to fuse two vectors into one • Multimodal alignment: cross-modal attention • Incorporation of object relations: intra-modal self-attention, graph attention • Multi-step reasoning • Neural module networks for compositional reasoning • Robust VQA (briefly mention) • Multimodal pre-training (briefly mention)

Better Image Feature Preparation • From grid features to region features, and to grid features again Grid Feature Pixel-BERT BUTD Show, Attend and Tell SAN 2015/2 2015/11 2017/7 2020/4 2020/1

Grid Feature BUTD Pixel-BERT Show, Attend and Tell SAN 2015/2 2015/11 2017/7 2020/4 2020/1 Show, Attend and Tell Stacked Attention Network 2017 VQA Challenge Winner [1] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 [2] Stacked Attention Networks for Image Question Answering, CVPR 2016 [3] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, CVPR 2018

Grid Feature BUTD Pixel-BERT Show, Attend and Tell SAN 2015/2 2015/11 2017/7 2020/4 2020/1 In Defense of Grid Features for VQA [1] In Defense of Grid Features for Visual Question Answering, CVPR 2020

Grid Feature BUTD Pixel-BERT Show, Attend and Tell SAN 2015/2 2015/11 2017/7 2020/4 2020/1 [1] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, 2020

Bilinear Pooling • Instead of simple concatenation and element-wise product for fusion, bilinear pooling methods have been studied • Bilinear pooling and attention mechanism can be enhanced with each other MFB & MFH BLOCK MUTAN MCB MLB 2016/6 2016/10 2019/1 2017/5 2017/8

MFB & MFH BLOCK MUTAN MCB MLB 2016/6 2016/10 2019/1 2017/5 2017/8 Multimodal Low-rank Bilinear Pooling Multimodal Compact Bilinear Pooling 2016 VQA Challenge Winner However, the feature after FFT is very high dimensional. [1] Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, EMNLP 2016 [2] Hadamard Product for Low-rank Bilinear Pooling, ICLR 2017 [3] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering, ICCV 2017

MFB & MFH BLOCK MUTAN MCB MLB 2016/6 2016/10 2019/1 2017/5 2017/8 Multimodal Tucker Fusion Bilinear Super-diagonal Fusion [1] MUTAN: Multimodal Tucker Fusion for Visual Question Answering, ICCV 2017 [2] BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection, AAAI 2019

FiLM: Feature-wise Linear Modulation Something similar to conditional batch normalization [1] FiLM: Visual Reasoning with a General Conditioning Layer, AAAI, 2018

Visual Question Answering and Visual Reasoning Zhe Gan 6/15/2020 - PowerPoint PPT Presentation

Visual Question Answering and Visual Reasoning Zhe Gan 6/15/2020 Overview Goal of this part of the tutorial: Use VQA and visual reasoning as example tasks to understand Vision-and- Language representation learning After the talk,

Designing deep architectures for Visual Question Answering Matthieu Cord Sorbonne University

Self-Critical Reasoning for Robust Visual Question Answering Jialin Wu and Raymond J. Mooney

Dynamic memory networks for Dynamic memory networks for visual and textual question visual and

Multimodal Learning for Image Captioning and Visual Question Answering Xiaodong He Deep Learning

VQA: Visual Question Answering Stainslaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell,

Chess Q&A : Question Answering on Chess Games Reasoning, Attention, Memory NIPS Workshop 12

Question Answering What is Ques+on Answering? Dan Jurafsky Ques%on

Enhancing Language & Vision with Knowledge - The Case of Visual Question Answering Freddy

A Multilingual Hybrid Question-Answering System Cross-Lingual Open-Domain Question Answering

Multi-modal Factorized High-order Pooling for Visual Question Answering Team HDU-USYD-UNCC with

Question Answering and AnswerFinder Diego Moll a Centre for Language Technology Department of

Statistical NLP Spring 2011 Lecture 26: Question Answering Dan Klein UC Berkeley Question

Question Answering Statistical NLP Following largely from Chris Mannings slides, which

Question Answering and Reading Comprehension Kevin Duh Fall 2019, Intro to HLT, Johns Hopkins

Towards End-to-End Reasoning for Question Answering Minjoon Seo Department of Computer Science

Question-Answering: Shallow & Deep Techniques for NLP Deep Processing Techniques for NLP

Factoid Question Answering Roy Aslan (ra2752@Columbia.edu) A Neural Network for Factoid

BV LC Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein Visual ques>on answering

Question-Answering: Shallow & Deep Techniques for NLP Ling571 Deep Processing Techniques

Phrase-Indexed Question Answering : A New Challenge for Scalable Document Comprehension Minjoon

Question Classification in English-Chinese Cross-Language Question Answering: An Integrated

Question Answering Spring 2020 2020-04-02 Adapted from slides from Danqi Chen and Karthik

CQARank:Jointly Model Topics and Expertise in Community Question Answering Liu Yang, Minghui Qiu,

An Question Recommendation System for Question Answer Community (Stackoverflow) Presenter: Haoyu