Aishwarya Agrawal Ph.D. Student Machine Learning and Perception Lab
2
sky stop light building bus car person sidewalk Identify objects in scene 3
blue green tall sky stop light building many red cars bus one bicycle Identify attributes of objects 4
man walking on sidewalk person wearing a helmet riding bicycle Identify activities in scene 5
street scene Identify the scene 6
A person on bike going through green light with bus nearby Describe the scene 8
A giraffe standing in the grass next to a tree. 11
• Answer questions about the scene – Q: How many buses are there? – Q: What is the name of the street? – Q: Is the man on bicycle wearing a helmet? 13
14
Visual Question Answering (VQA) Task: Given an image and a natural language open- ended question, generate a natural language answer. 15
VQA Task 16
VQA CloudCV Demo cloudcv.org/vqa/?useVoice=1&listenAnswer=1 17
Applications of VQA • An aid to visually-impaired Is it safe to cross the street now? 18
Applications of VQA • Surveillance What kind of car did the man in red shirt leave in? 19
Applications of VQA • Interacting with robot Is my laptop in my bedroom upstairs? 20
VQA Dataset 21
Real images (from MSCOCO) Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in COntext .” ECCV 2014. http://mscoco.org/ 22
Questions Stump a smart robot! Ask a question that a human can answer, but a smart robot probably can’t! 23
Two modalities of answering • Open Ended • Multiple Choice 24
Open Ended Task What is the girl holding in her hand? How many mirrors? Why is the girl holding an umbrella? 25
Multiple Choice Task What is the bus number? a) 3 b) 1 c) green d) 4 e) window trim f) blue g) m5 h) corn, carrots, onions, rice i) red j) 125 k) san antonio l) sign pen m) 478 n) no o) 25 p) 2 q) yes r) white 26
Dataset Stats • >250K images (MSCOCO + 50K Abstract Scenes) • >750K questions (3 per image) • ~10M answers (10 w/ image + 3 w/o image) 27
Please visit www.visualqa.org for more details. 28
Browse the Dataset http://visualqa.org/browser/ 29
Questions 30
Dataset Visualization http://visualqa.org/visualize/ 32
Answers • 38.4% of questions are binary yes/no • 98.97% questions have answers <= 3 words – 23k unique 1 word answers 33
Answers 34
2-Channel VQA Model Neural Network Image Embedding Softmax over top K answers 4096-dim Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP + Non-Linearity + Non-Linearity Embedding Question “How many horses are in this image?” 1024-dim 36
Ablation #1: Language-alone Neural Network Image Embedding Softmax 1k output over top K answers units Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP + Non-Linearity + Non-Linearity Question Embedding “How many horses are in this image?” 1024-dim 37
Ablation #2: Vision-alone Neural Network Image Embedding Softmax over top K answers 4096-dim Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP + Non-Linearity + Non-Linearity Question Embedding “How many horses are in this image?” 38
Accuracy Metric 39
Open-Ended Task Accuracies Human performance Human vs. Machine performance room for 25.14 improvement Human Machine 40
Results • Multiple-Choice > Open-Ended • Question alone does quite well Code available! • Image helps 41
Commonsense • Does this person have 20/20 vision? 42
Does this question need commonsense? Q: How many calories are in this pizza? 43
How old does a person need to be? Q: How many calories are in this pizza? 44
Most “commonsense” questions 45
Least “commonsense” questions 46
Spectrum 3-4 (15.3%) 5-8 (39.7%) 9-12 (28.4%) 13-17 (11.2%) 18+ (5.5%) Is that a bird in the sky? How many pizzas are shown? Where was this picture taken? Is he likely to get mugged if he walked What type of architecture is this? down a dark alleyway like this? What color is the shoe? What are the sheep eating? What ceremony does the cake Is this a vegetarian meal? Is this a Flemish bricklaying commemorate? pattern? How many zebras are there? What color is his hair? Are these boats too tall to fit What type of beverage is in the glass? How many calories are in this under the bridge? pizza? Is there food on the table? What sport is being played? What is the name of the white Can you name the performer in the What government document is shape under the batter? purple costume? needed to partake in this activity? Is this man wearing shoes? Name one ingredient in the skillet. Is this at the stadium? Besides these humans, what other What is the make and model of animals eat here? this vehicle? 47
Question Average Age what brand 12.5 why 11.18 what type 11.04 what kind 10.55 is this 10.13 what does 10.06 what time 9.81 who 9.58 where 9.54 which 9.32 does 9.29 do 9.23 what is 9.11 what are 9.04 are 8.65 is the 8.52 is there 8.24 what sport 8.06 how many 7.67 what animal 6.74 what color 6.6 48
VQA Age • Average “age of questions” = 8.98 years. • Our model =* 4.74 years old! * age as estimated by untrained crowd-sourced workers 49
VQA Common sense • Average common sense required = 31%. • Our best algorithm has* 17% common sense! * as estimated by untrained crowd-sourced workers 50
VQA Challenges on www.codalab.org 51
VQA Challenge @ CVPR16 52
VQA Challenge @ CVPR16 code available! 53
VQA Workshop @ CVPR16 54
Papers using VQA … and many more 55
Dataset: >1k downloads Code: >1.5k views Academia, industry, start ups 56
Conclusions • VQA: Visual Question Answering – The next “grand challenge” in vision, language, AI • Spectrum: Easy to Difficult – “What room is this?” Scene Recognition – “How many …” Object Recognition – … – “Does this person have 20/20 vision” Common sense • Exciting times ahead! 57
VQA Team Jiasen Lu Akrit Mohapatra Aishwarya Agrawal Stanislaw Antol Virginia Tech Virginia Tech Virginia Tech Virginia Tech Webmaster Meg Mitchell Larry Zitnick Dhruv Batra Devi Parikh Microsoft Research Facebook AI Virginia Tech Virginia Tech Research 58
Closing Remarks • CloudCV VQA Exhibition: Booth 101 • Contact email: aish@vt.edu • Please complete the Presenter Evaluation sent to you by email or through the GTC Mobile App. Your feedback is important! 59
Thanks! Questions? 60
Visual Question Answering (VQA) 61
Recommend
More recommend