from recognition to cognition visual commonsense
play

From Recognition to Cognition : Visual CommonSense Reasoning Rowan - PowerPoint PPT Presentation

From Recognition to Cognition : Visual CommonSense Reasoning Rowan Zellers et al. - Chinmoy Samant and Anurag Patil Outline The Problem Motivation and Importance Dataset The Approach Adversarial Matching


  1. From Recognition to Cognition : Visual CommonSense Reasoning Rowan Zellers et al. - Chinmoy Samant and Anurag Patil

  2. Outline ● The Problem ● Motivation and Importance ● Dataset ● The Approach ○ Adversarial Matching ○ Recognition and Cognition Model ● Results ● Critique

  3. The Problem ● Infer the entire situation: what is happening and why is it happening 1. Three people dining and already ordered food 2. Person 3 is serving and is not with the group 3. Person 1 ordered pancakes and bacon 4. How? Person 4 is pointing to Person 1 while looking at the server ( Person 3 )

  4. Motivation and Importance Recognition [find objects] and ➔ cognition [infer interaction] Good vision systems ➔ ➔ Good cognition systems at scale. ● Image Captioning : High level understanding, Difficult Evaluation ● VQA : No rationale, Easy Evaluation. ● Multiple choice setting. ● Justification has to include details about the scene and background knowledge about how the world works.

  5. Collecting commonsense inferences

  6. Collecting commonsense inferences

  7. Collecting commonsense inferences Every 48 Hrs the author manually verified the work.

  8. Adversarial Matching ● Annotation artifacts: subtle patterns that are by themselves highly predictive of correct label ● Wrong answers must be Relevant to question yet different from correct answer Q, A’ Question Relevance A, A’ Entailment

  9. Adversarial Matching We’ll use these two metrics to recycle right answers to other questions, using a minimum weight bipartite matching.

  10. Adversarial Matching Wrong answers must be Relevant to question yet different from correct answer Q, A’ Question Relevance A, A’ Entailment W i,j = log (P rel (q i ,r j ) + λ log (1-P sim (r i ,r j )) W i,j = element of weight matrix | P rel = relevance score Entailment Penalty Question Relevance dominates Easy for Dominates - Hard for humans machines

  11. What about the people tags ([ person 5 )] ? We’ll modify the detection tags in candidate answer to better match the new question/image based on heuristics

  12. Unique Contributions of Adversarial Matching

  13. Thus, Definition: 290K Questions | 110K images I - image sequence of object detections (o): bounding box (b), segmentation mask (m), class label. Query (q) : either word in vocab or tag referring to an object. Responses(N) : same structure as query. 38% - Why and how | 24% cognition level activities | 13 % temporal reasoning

  14. Setting up the Task

  15. Problem Description

  16. Problem Description

  17. Problem Description

  18. Problem Description

  19. Problem Description

  20. How to reach at the correct answer and reasoning? 1. Figure out meanings of query and responses wrt image and each other. 2. Do some inference on this representation.

  21. R2C model stages 1. Grounding 2. Contextualization 3. Reasoning

  22. Final architecture Question (Q) + Image q = Question (Q) q = Question + Answer (QA) R2C Model 1 R2C Model 2 r = Answer (A) r = Rationale (R) Answer(A) + Rationale (R)

  23. Results

  24. Baselines used ● Text-only Baselines: ○ BERT ○ BERT-response only ○ ESIM+ELMO ○ LSTM+ELMO ● VQA Baselines ○ RevisitedVQA ○ Bottom-up and Top-down attention ○ MLB(Multimodal Low-rank Bilinear Attention) ○ MUTAN(Multimodal Tucker Fusion)

  25. Results

  26. Yes (that’s why it got published!) (Chance: 6.2)

  27. Ablation Results

  28. Okay, but….

  29. VCR Leaderboard * *As of Feb 29th 2020, may have changed since then.

  30. VCR Leaderboard * *As of Feb 29th 2020, may have changed since then. ………………………………………… This paper

  31. (From Microsoft Dynamics 365 AI Research https://arxiv.org/abs/1909.11740)

  32. Sample Results (Correct)

  33. Sample Results (Incorrect)

  34. Critique The “Good” ● Dataset : ○ A comprehensive pipeline to select and annotate interesting images at scale. 290K questions with over 110k images. ● Adversarial Matching : ○ Leverages the Natural Language Inference task to come up with a robust dataset with minimized annotation artifacts. Better BERT better dataset. ● Ablation study: ○ Performed ablation study of all the important model components as well as alternatives not included, gives reasoning behind model design decisions. ● Details: ○ Main paper: 8 pages, Appendix: 17 pages ○ Detailed description of Dataset, Adversarial Matching, R2C Model, even the hyper-parameters used. Ensuring detailed insights and easy reproducibility of results. ● Output analysis: ○ Performed analysis of correct as well as incorrect results, giving more insight into the model’s understanding of the world.

  35. Critique The “Not So Good” ● Language bias in the dataset: ○ Abstract images help to reduce language bias, i.e things like fire hydrant being red, or sky being blue. Artificial images discourage bias due to this. Could this have been added? ● Adversarial Matching: ○ P rel computation limits application of Bigger BERT easily. ● R2C Model: ○ Knowledge bases for world knowledge and automatic reasoning generation. ○ Better handling of the subtasks Q→A and QA→R rather than naive composition. ● Evaluation Methodology: ○ Better baseline VQA models could have been selected to ensure a fair comparison with R2C. ○ No masking-approach in the paper or masking-based results to ensure complete attention coverage of images.

  36. Questions?

  37. References 1. https://arxiv.org/pdf/1811.10830.pdf 2. https://www.youtube.com/watch?v=Je5LlZlqUt8&t=4776s 3. https://www.youtube.com/watch?v=nl6IsjfWKms

Recommend


More recommend