From Recognition to Cognition : Visual CommonSense Reasoning Rowan - PowerPoint PPT Presentation

From Recognition to Cognition : Visual CommonSense Reasoning Rowan Zellers et al. - Chinmoy Samant and Anurag Patil

Outline ● The Problem ● Motivation and Importance ● Dataset ● The Approach ○ Adversarial Matching ○ Recognition and Cognition Model ● Results ● Critique

The Problem ● Infer the entire situation: what is happening and why is it happening 1. Three people dining and already ordered food 2. Person 3 is serving and is not with the group 3. Person 1 ordered pancakes and bacon 4. How? Person 4 is pointing to Person 1 while looking at the server ( Person 3 )

Motivation and Importance Recognition [find objects] and ➔ cognition [infer interaction] Good vision systems ➔ ➔ Good cognition systems at scale. ● Image Captioning : High level understanding, Difficult Evaluation ● VQA : No rationale, Easy Evaluation. ● Multiple choice setting. ● Justification has to include details about the scene and background knowledge about how the world works.

Collecting commonsense inferences

Collecting commonsense inferences Every 48 Hrs the author manually verified the work.

Adversarial Matching ● Annotation artifacts: subtle patterns that are by themselves highly predictive of correct label ● Wrong answers must be Relevant to question yet different from correct answer Q, A’ Question Relevance A, A’ Entailment

Adversarial Matching We’ll use these two metrics to recycle right answers to other questions, using a minimum weight bipartite matching.

Adversarial Matching Wrong answers must be Relevant to question yet different from correct answer Q, A’ Question Relevance A, A’ Entailment W i,j = log (P rel (q i ,r j ) + λ log (1-P sim (r i ,r j )) W i,j = element of weight matrix | P rel = relevance score Entailment Penalty Question Relevance dominates Easy for Dominates - Hard for humans machines

What about the people tags ([ person 5 )] ? We’ll modify the detection tags in candidate answer to better match the new question/image based on heuristics

Unique Contributions of Adversarial Matching

Thus, Definition: 290K Questions | 110K images I - image sequence of object detections (o): bounding box (b), segmentation mask (m), class label. Query (q) : either word in vocab or tag referring to an object. Responses(N) : same structure as query. 38% - Why and how | 24% cognition level activities | 13 % temporal reasoning

Setting up the Task

Problem Description

How to reach at the correct answer and reasoning? 1. Figure out meanings of query and responses wrt image and each other. 2. Do some inference on this representation.

R2C model stages 1. Grounding 2. Contextualization 3. Reasoning

Final architecture Question (Q) + Image q = Question (Q) q = Question + Answer (QA) R2C Model 1 R2C Model 2 r = Answer (A) r = Rationale (R) Answer(A) + Rationale (R)

Results

Baselines used ● Text-only Baselines: ○ BERT ○ BERT-response only ○ ESIM+ELMO ○ LSTM+ELMO ● VQA Baselines ○ RevisitedVQA ○ Bottom-up and Top-down attention ○ MLB(Multimodal Low-rank Bilinear Attention) ○ MUTAN(Multimodal Tucker Fusion)

Results

Yes (that’s why it got published!) (Chance: 6.2)

Ablation Results

Okay, but….

VCR Leaderboard * *As of Feb 29th 2020, may have changed since then.

VCR Leaderboard * *As of Feb 29th 2020, may have changed since then. ………………………………………… This paper

(From Microsoft Dynamics 365 AI Research https://arxiv.org/abs/1909.11740)

Sample Results (Correct)

Sample Results (Incorrect)

Critique The “Good” ● Dataset : ○ A comprehensive pipeline to select and annotate interesting images at scale. 290K questions with over 110k images. ● Adversarial Matching : ○ Leverages the Natural Language Inference task to come up with a robust dataset with minimized annotation artifacts. Better BERT better dataset. ● Ablation study: ○ Performed ablation study of all the important model components as well as alternatives not included, gives reasoning behind model design decisions. ● Details: ○ Main paper: 8 pages, Appendix: 17 pages ○ Detailed description of Dataset, Adversarial Matching, R2C Model, even the hyper-parameters used. Ensuring detailed insights and easy reproducibility of results. ● Output analysis: ○ Performed analysis of correct as well as incorrect results, giving more insight into the model’s understanding of the world.

Critique The “Not So Good” ● Language bias in the dataset: ○ Abstract images help to reduce language bias, i.e things like fire hydrant being red, or sky being blue. Artificial images discourage bias due to this. Could this have been added? ● Adversarial Matching: ○ P rel computation limits application of Bigger BERT easily. ● R2C Model: ○ Knowledge bases for world knowledge and automatic reasoning generation. ○ Better handling of the subtasks Q→A and QA→R rather than naive composition. ● Evaluation Methodology: ○ Better baseline VQA models could have been selected to ensure a fair comparison with R2C. ○ No masking-approach in the paper or masking-based results to ensure complete attention coverage of images.

Questions?

References 1. https://arxiv.org/pdf/1811.10830.pdf 2. https://www.youtube.com/watch?v=Je5LlZlqUt8&t=4776s 3. https://www.youtube.com/watch?v=nl6IsjfWKms

From Recognition to Cognition : Visual CommonSense Reasoning Rowan - PowerPoint PPT Presentation

From Recognition to Cognition : Visual CommonSense Reasoning Rowan Zellers et al. - Chinmoy Samant and Anurag Patil Outline The Problem Motivation and Importance Dataset The Approach Adversarial Matching

Commonsense benchmarks Or how to measure that your model is actually doing some commonsense

Crowdsourcing 3D Semantic Maps for Vehicle Cognition Cognition for Cars Decisions Eyes

Ch 11. Event Cognition Seminar on Event Cognition Summary of Event Cognition Event

Commonsense Knowledge in Pre-trained Language Models Vered Shwartz July 5th, 2020 Commonsense

Cognition 9-26-2012 What is cognition? What are users good and bad at? Describe how

Which Material Design Is Commonsense . . . Possible Under Additive Commonsense . . . How

Representing Knowledge Dustin Smith MIT Media Lab July 2008 Commonsense Computing MIT MediaLab

Agenda 08:00 PST 1 hr 50 mins Part I - Review of CSKGs 15 min Introduction to commonsense

Agenda 08:00 PST 1 hr 50 mins Part I - Review of CSKGs 15 min Introduction to commonsense

From meaningful information to representations, enaction and cognition E-CAP 08 Christophe Menant -

COGNITION Smart Data COGNITION Smart Data Data. Insights. Business Intelligence. Deliverables.

Introduction to Visual Recognition General visual recognition importance for intelligence?

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Audio- -Visual Automatic Speech Recognition: Visual Automatic Speech Recognition: Audio Theory,

The Cognitive Roots of Adjectival Meaning Michael Glanzberg Northwestern University August 2016

From Cognition to Emotion: An Overview of OCC Andrew Ortony Northwestern University 1

MOVEMENT DESENTIZATION AND REPROCESSING) IN PRIMARY CARE Bradley Samuel, PHD Associate Professor

Enhancing Individual Productivity and Capability Mission Statement Increase individual

N328 Visualizing Information Week 3: Visual Perception Khairi Reda | redak@iu.edu School of

HUMAN-LEVEL ARTIFICIAL INTELIGENCE & COGNITIVE SCIENCE Nils J. Nilsson Stanford AI Lab

Closing the Cognitive Gap between Humans and Interactive Narrative Agents Using Shared Mental

Interview Methodology for use with Spanish-Speaking Respondents Patricia Goerman and Ryan King,

From Recognition to Cognition : Visual CommonSense Reasoning Rowan - PowerPoint PPT Presentation

From Recognition to Cognition : Visual CommonSense Reasoning Rowan Zellers et al. - Chinmoy Samant and Anurag Patil Outline The Problem Motivation and Importance Dataset The Approach Adversarial Matching

Commonsense benchmarks Or how to measure that your model is actually doing some commonsense

Crowdsourcing 3D Semantic Maps for Vehicle Cognition Cognition for Cars Decisions Eyes

Ch 11. Event Cognition Seminar on Event Cognition Summary of Event Cognition Event

Commonsense Knowledge in Pre-trained Language Models Vered Shwartz July 5th, 2020 Commonsense

Cognition 9-26-2012 What is cognition? What are users good and bad at? Describe how

Which Material Design Is Commonsense . . . Possible Under Additive Commonsense . . . How

Representing Knowledge Dustin Smith MIT Media Lab July 2008 Commonsense Computing MIT MediaLab

Agenda 08:00 PST 1 hr 50 mins Part I - Review of CSKGs 15 min Introduction to commonsense

Agenda 08:00 PST 1 hr 50 mins Part I - Review of CSKGs 15 min Introduction to commonsense

From meaningful information to representations, enaction and cognition E-CAP 08 Christophe Menant -

COGNITION Smart Data COGNITION Smart Data Data. Insights. Business Intelligence. Deliverables.

Introduction to Visual Recognition General visual recognition importance for intelligence?

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Audio- -Visual Automatic Speech Recognition: Visual Automatic Speech Recognition: Audio Theory,

The Cognitive Roots of Adjectival Meaning Michael Glanzberg Northwestern University August 2016

From Cognition to Emotion: An Overview of OCC Andrew Ortony Northwestern University 1

MOVEMENT DESENTIZATION AND REPROCESSING) IN PRIMARY CARE Bradley Samuel, PHD Associate Professor

Enhancing Individual Productivity and Capability Mission Statement Increase individual

N328 Visualizing Information Week 3: Visual Perception Khairi Reda | redak@iu.edu School of

HUMAN-LEVEL ARTIFICIAL INTELIGENCE &amp; COGNITIVE SCIENCE Nils J. Nilsson Stanford AI Lab

Closing the Cognitive Gap between Humans and Interactive Narrative Agents Using Shared Mental

Interview Methodology for use with Spanish-Speaking Respondents Patricia Goerman and Ryan King,

HUMAN-LEVEL ARTIFICIAL INTELIGENCE & COGNITIVE SCIENCE Nils J. Nilsson Stanford AI Lab