yash goyal aishwarya agrawal georgia tech georgia tech
play

Yash Goyal Aishwarya Agrawal (Georgia Tech) (Georgia Tech) - PowerPoint PPT Presentation

Yash Goyal Aishwarya Agrawal (Georgia Tech) (Georgia Tech) Outline Overview of Task and Dataset Overview of Challenge Winner Announcements Analysis of Results 2 Outline Overview of Task and Dataset Overview of Challenge Winner


  1. Yash Goyal Aishwarya Agrawal (Georgia Tech) (Georgia Tech)

  2. Outline Overview of Task and Dataset Overview of Challenge Winner Announcements Analysis of Results 2

  3. Outline Overview of Task and Dataset Overview of Challenge Winner Announcements Analysis of Results 3

  4. Outline Overview of Task and Dataset Overview of Challenge Winner Announcements Analysis of Results 4

  5. Outline Overview of Task and Dataset Overview of Challenge Winner Announcements Analysis of Results 5

  6. Outline Overview of Task and Dataset Overview of Challenge Winner Announcements Analysis of Results 6

  7. VQA Task 7

  8. VQA Task What is the mustache made of? 8

  9. VQA Task AI System What is the mustache made of? 9

  10. VQA Task AI System bananas What is the mustache made of? 10

  11. VQA v1.0 Dataset 11

  12. VQA v1.0 Dataset About objects 12

  13. VQA v1.0 Dataset Fine-grained recognition 13

  14. VQA v1.0 Dataset Counting 14

  15. VQA v1.0 Dataset Common sense 15

  16. VQA v2.0 Dataset

  17. Who is wearing glasses? Similar images man woman Different answers New in VQA v2.0 VQA v1.0

  18. VQA v2.0 Dataset Stats • >200K images • >1.1M questions • >11M answers 1.8 x VQA v1.0 18

  19. Accuracy Metric 19

  20. Outline Overview of Task and Dataset Overview of Challenge Winner Announcements Analysis of Results 20

  21. VQA Challenge on https://evalai.cloudcv.org/ 21

  22. Dataset splits Images Questions Answers Training 80K 443K 4.4M Dataset size is approximate 22

  23. Dataset splits Images Questions Answers Training 80K 443K 4.4M Validation 40K 214K 2.1M Dataset size is approximate 23

  24. Dataset splits Images Questions Answers Training 80K 443K 4.4M Validation 40K 214K 2.1M Test 80K 447K Dataset size is approximate 24

  25. Test Dataset • 4 splits of approximately equal size • Test-dev (development) – Debugging and Validation. • Test-standard (publications) – Used to score entries for the Public Leaderboard. • Test-challenge (competitions) – Used to rank challenge participants. • Test-reserve (check overfitting) – Used to estimate overfitting. Scores on this set are never released. Slide adapted from: MSCOCO Detection/Segmentation Challenge, ICCV 2015 25

  26. Outline Overview of Task and Dataset Overview of Challenge Winner Announcements Analysis of Results

  27. Challenge Stats • 40 teams • >=40 institutions* • >=8 countries* *Statistics based on teams that have replied

  28. Challenge Runner-Ups Joint Runner-Up Team 1 SNU-BI Jin-Hwa Kim (Seoul National University) Jaehyun Jun (Seoul National University) Byoung-Tak Zhang (Seoul National University & Surromind Robotics) Challenge Accuracy : 71.69 28

  29. Challenge Runner-Ups Joint Runner-Up Team 2 HDU-UCAS-USYD Zhou Yu ( Hangzhou Dianzi University, China ) Jun Yu ( Hangzhou Dianzi University, China ) Chenchao Xiang ( Hangzhou Dianzi University, China ) Liang Wang ( Hangzhou Dianzi University, China ) Dalu Guo ( The Unversity of Sydney, Australia ) Qingming Huang ( University of Chinese Academy of Sciences ) Jianping Fan ( Hangzhou Dianzi University, China ) Dacheng Tao ( The University of Sydney, Australia ) Challenge Accuracy : 71.91

  30. Challenge Winner FAIR-A* Yu Jiang† (Facebook AI Research) Vivek Natarajan† (Facebook AI Research) Xinlei Chen† (Facebook AI Research) Marcus Rohrbach (Facebook AI Research) Dhruv Batra (Facebook AI Research & Georgia Tech) Devi Parikh (Facebook AI Research & Georgia Tech) Challenge Accuracy : 72.41 † equal contribution 30

  31. Outline Overview of Task and Dataset Overview of Challenge Winner Announcements Analysis of Results

  32. Challenge Results 74 72 70 68 66 64 62 60

  33. Challenge Results 74 72 70 68 66 64 62 60

  34. Challenge Results 73 72 71 70 69 68 67

  35. Challenge Results 73 72 71 +3.4% absolute 70 69 68 67

  36. Statistical Significance • Bootstrap samples 5000 times • @ 95% confidence

  37. Statistical Significance 73 72 Overall Accuracy 71 70 69 68 67

  38. Easy vs. Difficult Questions

  39. Easy vs. Difficult Questions 70 60 correctly answered by teams Percentage of questions 50 40 30 20 10 0 0/10 1/10 2/10 3/10 4/10 5/10 6/10 7/10 8/10 9/10 10/10 Number of top 10 teams

  40. Easy vs. Difficult Questions 70 60 correctly answered by teams Percentage of questions 50 40 82.5% of questions can be answered by at least 1 method! 30 Difficult Questions 20 10 0 0/10 1/10 2/10 3/10 4/10 5/10 6/10 7/10 8/10 9/10 10/10 Number of top 10 teams

  41. Easy vs. Difficult Questions 70 Easy Questions 60 correctly answered by teams Percentage of questions 50 40 30 Difficult Questions 20 10 0 0/10 1/10 2/10 3/10 4/10 5/10 6/10 7/10 8/10 9/10 10/10 Number of top 10 teams

  42. Easy vs. Difficult Questions 70 60 correctly answered by teams Percentage of questions 50 40 30 20 10 0 0/10 1/10 2/10 3/10 4/10 5/10 6/10 7/10 8/10 9/10 10/10 Number of top 10 teams 2016 2017 2018

  43. Difficult Questions with Rare Answers

  44. Difficult Questions with Rare Answers What is the name of … What is the number on … What is written on the … What does the sign … What time is it? What kind of … What type of … Why is the …

  45. Easy vs. Difficult Questions

  46. Easy vs. Difficult Questions Difficult Questions Easy Questions with Frequent Answers

  47. Answer Type Analyses • SNU_BI performs the best for “number” questions

  48. "number" accuracy 30 35 40 45 50 55 60 FAIR-A* HDU-UCAS-USYD SNU-BI casia_iva Results on “number” questions Tohoku CV Lab MIL-UT ut-swk graph-attention-msm DCD_ZJU vqabyte fs UTS_YZZD Adelaide-Teney VQA-ReasonTensor UPMC-LIP6 wyvernbai caption_vqa cvqa nagizero CFM-UESTC VQA_NTU yudf2010 nmlab612 TsinghuaCVLab CIST-VQA VLC Southampton RelVQA University of Guelph MLRG NTU_ROSE_USTC zhi-smile VQA-Machine+ xie Vardaan HACKERS AE-VQA dandelin ghost VQA-Learning vqa-suchow HAIBIN windLBL VQA_San vqateam_mcb_benchmark akshay_isical

  49. Answer Type Analyses • SNU_BI performs the best for “number” questions • No team statistically significantly better than the winner team for “yes/no” and “other”

  50. Are models sensitive to subtle changes in images? Who is wearing glasses? Similar images man woman Different answers

  51. Are models sensitive to subtle changes in images? • Are predictions different for complementary images? • Are predictions accurate for complementary images?

  52. 40 45 50 55 60 65 70 FAIR-A* HDU-UCAS-USYD SNU-BI casia_iva MIL-UT Tohoku CV Lab ut-swk graph-attention-msm Are predictions different for DCD_ZJU complementary images? vqabyte fs UTS_YZZD Adelaide-Teney VQA-ReasonTensor UPMC-LIP6 wyvernbai caption_vqa cvqa nagizero CFM-UESTC VQA_NTU yudf2010 nmlab612 TsinghuaCVLab CIST-VQA VLC Southampton RelVQA University of Guelph MLRG NTU_ROSE_USTC zhi-smile VQA-Machine+ xie Vardaan HACKERS AE-VQA dandelin ghost VQA-Learning vqa-suchow HAIBIN windLBL VQA_San vqateam_mcb_benchmark akshay_isical

  53. 40 42 44 46 48 50 52 54 56 58 60 FAIR-A* HDU-UCAS-USYD SNU-BI casia_iva MIL-UT Tohoku CV Lab ut-swk Are predictions accurate for graph-attention-msm DCD_ZJU complementary images? vqabyte fs UTS_YZZD Adelaide-Teney VQA-ReasonTensor UPMC-LIP6 wyvernbai caption_vqa cvqa nagizero CFM-UESTC VQA_NTU yudf2010 nmlab612 TsinghuaCVLab CIST-VQA VLC Southampton RelVQA University of Guelph MLRG NTU_ROSE_USTC zhi-smile VQA-Machine+ xie Vardaan HACKERS AE-VQA dandelin ghost VQA-Learning vqa-suchow HAIBIN windLBL VQA_San vqateam_mcb_benchmark akshay_isical

  54. 40 42 44 46 48 50 52 54 56 58 60 FAIR-A* HDU-UCAS-USYD SNU-BI casia_iva MIL-UT Tohoku CV Lab ut-swk Are predictions accurate for graph-attention-msm DCD_ZJU complementary images? vqabyte fs UTS_YZZD Adelaide-Teney VQA-ReasonTensor UPMC-LIP6 wyvernbai caption_vqa cvqa nagizero CFM-UESTC VQA_NTU yudf2010 nmlab612 TsinghuaCVLab CIST-VQA +4.8% absolute VLC Southampton RelVQA University of Guelph MLRG NTU_ROSE_USTC zhi-smile VQA-Machine+ xie Vardaan HACKERS AE-VQA dandelin ghost VQA-Learning vqa-suchow 2017 winner HAIBIN windLBL 52.7% VQA_San vqateam_mcb_benchmark akshay_isical

  55. Are models driven by priors? Only consider those questions whose answers are not popular (given the question type) in training • 1-Prior: Test answers are not the top-1 most common in training • 2-Prior: Test answer are not the top-2 most common in training Agrawal et al., CVPR 2018

Recommend


More recommend