ph d student
play

Ph.D. Student Machine Learning and Perception Lab 2 sky stop - PowerPoint PPT Presentation

Aishwarya Agrawal Ph.D. Student Machine Learning and Perception Lab 2 sky stop light building bus car person sidewalk Identify objects in scene 3 blue green tall sky stop light building many red cars bus one bicycle Identify


  1. Aishwarya Agrawal Ph.D. Student Machine Learning and Perception Lab

  2. 2

  3. sky stop light building bus car person sidewalk Identify objects in scene 3

  4. blue green tall sky stop light building many red cars bus one bicycle Identify attributes of objects 4

  5. man walking on sidewalk person wearing a helmet riding bicycle Identify activities in scene 5

  6. street scene Identify the scene 6

  7. A person on bike going through green light with bus nearby Describe the scene 8

  8. A giraffe standing in the grass next to a tree. 11

  9. • Answer questions about the scene – Q: How many buses are there? – Q: What is the name of the street? – Q: Is the man on bicycle wearing a helmet? 13

  10. 14

  11. Visual Question Answering (VQA) Task: Given an image and a natural language open- ended question, generate a natural language answer. 15

  12. VQA Task 16

  13. VQA CloudCV Demo cloudcv.org/vqa/?useVoice=1&listenAnswer=1 17

  14. Applications of VQA • An aid to visually-impaired Is it safe to cross the street now? 18

  15. Applications of VQA • Surveillance What kind of car did the man in red shirt leave in? 19

  16. Applications of VQA • Interacting with robot Is my laptop in my bedroom upstairs? 20

  17. VQA Dataset 21

  18. Real images (from MSCOCO) Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in COntext .” ECCV 2014. http://mscoco.org/ 22

  19. Questions Stump a smart robot! Ask a question that a human can answer, but a smart robot probably can’t! 23

  20. Two modalities of answering • Open Ended • Multiple Choice 24

  21. Open Ended Task What is the girl holding in her hand? How many mirrors? Why is the girl holding an umbrella? 25

  22. Multiple Choice Task What is the bus number? a) 3 b) 1 c) green d) 4 e) window trim f) blue g) m5 h) corn, carrots, onions, rice i) red j) 125 k) san antonio l) sign pen m) 478 n) no o) 25 p) 2 q) yes r) white 26

  23. Dataset Stats • >250K images (MSCOCO + 50K Abstract Scenes) • >750K questions (3 per image) • ~10M answers (10 w/ image + 3 w/o image) 27

  24. Please visit www.visualqa.org for more details. 28

  25. Browse the Dataset http://visualqa.org/browser/ 29

  26. Questions 30

  27. Dataset Visualization http://visualqa.org/visualize/ 32

  28. Answers • 38.4% of questions are binary yes/no • 98.97% questions have answers <= 3 words – 23k unique 1 word answers 33

  29. Answers 34

  30. 2-Channel VQA Model Neural Network Image Embedding Softmax over top K answers 4096-dim Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP + Non-Linearity + Non-Linearity Embedding Question “How many horses are in this image?” 1024-dim 36

  31. Ablation #1: Language-alone Neural Network Image Embedding Softmax 1k output over top K answers units Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP + Non-Linearity + Non-Linearity Question Embedding “How many horses are in this image?” 1024-dim 37

  32. Ablation #2: Vision-alone Neural Network Image Embedding Softmax over top K answers 4096-dim Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP + Non-Linearity + Non-Linearity Question Embedding “How many horses are in this image?” 38

  33. Accuracy Metric 39

  34. Open-Ended Task Accuracies Human performance Human vs. Machine performance room for 25.14 improvement Human Machine 40

  35. Results • Multiple-Choice > Open-Ended • Question alone does quite well Code available! • Image helps 41

  36. Commonsense • Does this person have 20/20 vision? 42

  37. Does this question need commonsense? Q: How many calories are in this pizza? 43

  38. How old does a person need to be? Q: How many calories are in this pizza? 44

  39. Most “commonsense” questions 45

  40. Least “commonsense” questions 46

  41. Spectrum 3-4 (15.3%) 5-8 (39.7%) 9-12 (28.4%) 13-17 (11.2%) 18+ (5.5%) Is that a bird in the sky? How many pizzas are shown? Where was this picture taken? Is he likely to get mugged if he walked What type of architecture is this? down a dark alleyway like this? What color is the shoe? What are the sheep eating? What ceremony does the cake Is this a vegetarian meal? Is this a Flemish bricklaying commemorate? pattern? How many zebras are there? What color is his hair? Are these boats too tall to fit What type of beverage is in the glass? How many calories are in this under the bridge? pizza? Is there food on the table? What sport is being played? What is the name of the white Can you name the performer in the What government document is shape under the batter? purple costume? needed to partake in this activity? Is this man wearing shoes? Name one ingredient in the skillet. Is this at the stadium? Besides these humans, what other What is the make and model of animals eat here? this vehicle? 47

  42. Question Average Age what brand 12.5 why 11.18 what type 11.04 what kind 10.55 is this 10.13 what does 10.06 what time 9.81 who 9.58 where 9.54 which 9.32 does 9.29 do 9.23 what is 9.11 what are 9.04 are 8.65 is the 8.52 is there 8.24 what sport 8.06 how many 7.67 what animal 6.74 what color 6.6 48

  43. VQA Age • Average “age of questions” = 8.98 years. • Our model =* 4.74 years old! * age as estimated by untrained crowd-sourced workers 49

  44. VQA Common sense • Average common sense required = 31%. • Our best algorithm has* 17% common sense! * as estimated by untrained crowd-sourced workers 50

  45. VQA Challenges on www.codalab.org 51

  46. VQA Challenge @ CVPR16 52

  47. VQA Challenge @ CVPR16 code available! 53

  48. VQA Workshop @ CVPR16 54

  49. Papers using VQA … and many more 55

  50. Dataset: >1k downloads Code: >1.5k views Academia, industry, start ups 56

  51. Conclusions • VQA: Visual Question Answering – The next “grand challenge” in vision, language, AI • Spectrum: Easy to Difficult – “What room is this?”  Scene Recognition – “How many …”  Object Recognition – … – “Does this person have 20/20 vision”  Common sense • Exciting times ahead! 57

  52. VQA Team Jiasen Lu Akrit Mohapatra Aishwarya Agrawal Stanislaw Antol Virginia Tech Virginia Tech Virginia Tech Virginia Tech Webmaster Meg Mitchell Larry Zitnick Dhruv Batra Devi Parikh Microsoft Research Facebook AI Virginia Tech Virginia Tech Research 58

  53. Closing Remarks • CloudCV VQA Exhibition: Booth 101 • Contact email: aish@vt.edu • Please complete the Presenter Evaluation sent to you by email or through the GTC Mobile App. Your feedback is important! 59

  54. Thanks! Questions? 60

  55. Visual Question Answering (VQA) 61

Recommend


More recommend