VQA: Visual Question Answering Stainslaw Antol, Aishwarya Agrawal, - PowerPoint PPT Presentation

VQA: Visual Question Answering Stainslaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh Presented by: Surbhi Goel Note: Images and tables that have not been cited have been taken from the above-mentioned paper

Outline ● VQA Task ● Importance of VQA ● Dataset Analysis ● Human Accuracy ● Model Comparison for VQA ● Common Sense Knowledge ● Conclusion ● Future Work ● Discussion

Visual Question Answering

Importance of VQA ● Multi modal task - a step towards solving AI ● Allows automatic quantitative evaluation ● Useful applications eg. answer questions asked by visually-impaired users Image credits: Dhruv Batra

Dataset Stump a smart robot! ● >250K images ○ 200K from MS COCO Ask a question that a human can ■ 80 train / 40 val / 80 test answer, ○ 50K from Abstract ● QAs but a smart robot probably can’t! ○ 3 questions/image ○ 10 answers/question ● Mechanical Turk ■ +3 answers/question ● >10,000 Turkers without showing the image ● >41,000 Human Hours ● >760K questions ● ~10M answers ○ will grow over the years 4.7 Human Years! 20.61 Person-Job-Years! Slide credits: Dhruv Batra

Questions

Questions what is

Answers ● 38.4% of questions are binary yes/no ○ 39.3% for abstract scenes ● 98.97% questions have answers <= 3 words ○ 23k unique 1 word answers ● Two evaluation formats: ○ Open answer ■ Input = question ○ Multiple choice ■ Input = question + 18 answer options ■ Options = correct / plausible / popular / random answers Slide credits: Dhruv Batra

Answers Slide credits: Dhruv Batra

Answers

Human Accuracy

VQA Model Neural Network Embedding Image Softmax 1k output over top K answers units Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP + Non-Linearity + Non-Linearity what 0 Question Embedding (BoW) where0 how 1 is 0 “How many horses are in this image?” Beginning of could 0 question words are 0 … are 1 … horse 1 … image1 Slide credits: Dhruv Batra

VQA Model Neural Network Embedding Image Softmax 1k output over top K answers units Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP + Non-Linearity + Non-Linearity Question Embedding (LSTM) “How many horses are in this image?” Slide credits: Dhruv Batra

Baseline #1 - Language-alone Neural Network Embedding Image Softmax 1k output over top K answers units Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP + Non-Linearity + Non-Linearity Question Embedding (LSTM) “How many horses are in this image?” Slide credits: Dhruv Batra

Baseline #2 - Vision-alone Neural Network Embedding Image Softmax 1k output over top K answers units Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP + Non-Linearity + Non-Linearity Question Embedding (LSTM) “How many horses are in this image?” Slide credits: Dhruv Batra (C) Dhruv Batra

Results

Challenge: Common Sense Does the person have perfect vision?

Evaluate Common Sense in the Dataset Asked users: ● Does the question require common sense? ● How old should a person be to answer the question? Image credits: Dhruv Batra

Conclusion ● Compelling ‘AI-complete’ task ● Combines a range of vision problems in one, such as ○ Scene Recognition ○ Object Recognition ○ Object Localization ○ Knowledge-base Reasoning ○ Commonsense ● Far from achieving human levels

Future Work ● Dataset ○ Extend the dataset ○ Create task-specific datasets eg. visually-impaired ● Model ○ Exploit more image related information ○ Identify the task and then use existing systems Challenge and workshop to promote systematic research (www.visualqa.org)

Discussion Points (Piazza) ● Should a different evaluation metric (such as METEOR) be used? ● How to collect questions faster (compared to humans)? ● Is the length restriction on the answers limiting the scope of the task? ● Since distribution of question types is skew, will it bias a statistical learner to answer only certain types of questions? ● Why use ‘realistic’ abstract scenes for the task? ● Why does LSTM not perform well? ● Would using Question + Image + Caption give better results than using Question + Image ? ● Should we focus on task specific VQA?

Thank You!

VQA: Visual Question Answering Stainslaw Antol, Aishwarya Agrawal, - PowerPoint PPT Presentation

VQA: Visual Question Answering Stainslaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh Presented by: Surbhi Goel Note: Images and tables that have not been cited have been taken from the

Self-Critical Reasoning for Robust Visual Question Answering Jialin Wu and Raymond J. Mooney

Designing deep architectures for Visual Question Answering Matthieu Cord Sorbonne University

Visual Question Answering and Visual Reasoning Zhe Gan 6/15/2020 Overview Goal of this part

Question Answering What is Ques+on Answering? Dan Jurafsky Ques%on

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) Jin-Hwa Kim,

Question Answering and AnswerFinder Diego Moll a Centre for Language Technology Department of

A Multilingual Hybrid Question-Answering System Cross-Lingual Open-Domain Question Answering

Dynamic memory networks for Dynamic memory networks for visual and textual question visual and

Answering Queries Using Answering Queries Using Materialized view: result set is stored

Chess Q&A : Question Answering on Chess Games Reasoning, Attention, Memory NIPS Workshop 12

Statistical NLP Spring 2011 Lecture 26: Question Answering Dan Klein UC Berkeley Question

Question Answering and Reading Comprehension Kevin Duh Fall 2019, Intro to HLT, Johns Hopkins

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

An Question Recommendation System for Question Answer Community (Stackoverflow) Presenter: Haoyu

BV LC Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein Visual ques>on answering

Neural Question Answering at BioASQ 5B Georg Wiese, Dirk Weissenborn, Mariana Neves Motivation

Jennifer Borman, Kansas State University June 19, 2019 Genetic Control of Cattle Feet and Leg

Issue 10 July 25 2003 THOUGHT FOR THE DAY Netherlands, Freerk set the pace from group dynamics,

HUT 2, 3, 4 ... from the Middle School and Senior School. These boys are entrenched in the

Algorithms in Bioinformatics: A Practical Introduction Phylogenetic Tree comparison and

An Application of Principal Components Analysis in Genetics Samuel Morrissette April 14, 2020

Challenging algorithms in bioinformatics IN3130, 3 October 2019 Torbjrn Rognes Department of

Colorectal Cancer Screening in Primary Care A Focus on STOP CRC Gloria D. Coronado, PhD Kaiser

Preoperative Geriatric Assessment And Tailored Interventions In Frail Older Patients With

VQA: Visual Question Answering Stainslaw Antol, Aishwarya Agrawal, - PowerPoint PPT Presentation

VQA: Visual Question Answering Stainslaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh Presented by: Surbhi Goel Note: Images and tables that have not been cited have been taken from the

Self-Critical Reasoning for Robust Visual Question Answering Jialin Wu and Raymond J. Mooney

Designing deep architectures for Visual Question Answering Matthieu Cord Sorbonne University

Visual Question Answering and Visual Reasoning Zhe Gan 6/15/2020 Overview Goal of this part

Question Answering What is Ques+on Answering? Dan Jurafsky Ques%on

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) Jin-Hwa Kim,

Question Answering and AnswerFinder Diego Moll a Centre for Language Technology Department of

A Multilingual Hybrid Question-Answering System Cross-Lingual Open-Domain Question Answering

Dynamic memory networks for Dynamic memory networks for visual and textual question visual and

Answering Queries Using Answering Queries Using Materialized view: result set is stored

Chess Q&amp;A : Question Answering on Chess Games Reasoning, Attention, Memory NIPS Workshop 12

Statistical NLP Spring 2011 Lecture 26: Question Answering Dan Klein UC Berkeley Question

Question Answering and Reading Comprehension Kevin Duh Fall 2019, Intro to HLT, Johns Hopkins

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

An Question Recommendation System for Question Answer Community (Stackoverflow) Presenter: Haoyu

BV LC Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein Visual ques&gt;on answering

Neural Question Answering at BioASQ 5B Georg Wiese, Dirk Weissenborn, Mariana Neves Motivation

Jennifer Borman, Kansas State University June 19, 2019 Genetic Control of Cattle Feet and Leg

Issue 10 July 25 2003 THOUGHT FOR THE DAY Netherlands, Freerk set the pace from group dynamics,

HUT 2, 3, 4 ... from the Middle School and Senior School. These boys are entrenched in the

Algorithms in Bioinformatics: A Practical Introduction Phylogenetic Tree comparison and

An Application of Principal Components Analysis in Genetics Samuel Morrissette April 14, 2020

Challenging algorithms in bioinformatics IN3130, 3 October 2019 Torbjrn Rognes Department of

Colorectal Cancer Screening in Primary Care A Focus on STOP CRC Gloria D. Coronado, PhD Kaiser

Preoperative Geriatric Assessment And Tailored Interventions In Frail Older Patients With

Chess Q&A : Question Answering on Chess Games Reasoning, Attention, Memory NIPS Workshop 12

BV LC Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein Visual ques>on answering