embodied question answering
play

Embodied Question Answering NVIDIA GTC March 26, 2018 Abhishek Das - PowerPoint PPT Presentation

Embodied Question Answering NVIDIA GTC March 26, 2018 Abhishek Das PhD student, Georgia Tech Embodied Question Answering Samyak Datta Georgia Gkioxari Stefan Lee Devi Parikh Dhruv Batra Georgia Tech FAIR Georgia Tech FAIR/Georgia Tech


  1. Embodied Question Answering NVIDIA GTC March 26, 2018 Abhishek Das PhD student, Georgia Tech

  2. Embodied Question Answering Samyak Datta Georgia Gkioxari Stefan Lee Devi Parikh Dhruv Batra Georgia Tech FAIR Georgia Tech FAIR/Georgia Tech FAIR/Georgia Tech em embodied edqa. a.org/pa pape per.pd pdf To To appear in CVPR PR 2018 (Oral).

  3. Forward

  4. Forward

  5. Turn Left

  6. What is to the left of the shower? Cabinet Slide credit: Devi Parikh

  7. EmbodiedQA: AI Challenges • Language understanding • Visual understanding • Active perception • Common sense reasoning • Grounding into actions • Selective memory • Credit assignment Slide credit: Devi Parikh

  8. EmbodiedQA: Context Dialog Language Single-Shot QA Single Frame Video Vision Slide credit: Devi Parikh

  9. EmbodiedQA: Context Dialog Language Single-Shot QA P a s s i v e A c t i o n A c t i v e Single Frame Video Vision Slide credit: Devi Parikh

  10. EmbodiedQA: Context Dialog Language Single-Shot QA VQA P a s s i v e Q. What is the mustache made of? A c t i o n A [Antol and Agrawal et al., ICCV 2015] c t i v [Malinowski et al., ICCV 2015] e … Single Frame Video Vision Slide credit: Devi Parikh

  11. EmbodiedQA: Context Dialog Language Single-Shot QA Q. How many times does the cat touch the dog? A. 4 times [Jang et al., CVPR 2017] VQA VideoQA P a s s i v e A c t i o Attribute: “dog”, “egg”, “bowl”, “woman”, “plate” n A c Q. What is a woman boiling in a pot of water? t i v e A. Eggs Single Frame Video [Ye et al., SIGIR 2017] Vision [Tapaswi et al., CVPR 2016] … Slide credit: Devi Parikh

  12. EmbodiedQA: Context Dialog Language Visual Dialog Single-Shot QA VQA VideoQA P a s s i v e A c t i o n A c t i v e [Das et al., CVPR 2017] Single Frame Video [Das and Kottur et al., ICCV 2017] Vision … Slide credit: Devi Parikh

  13. EmbodiedQA: Context Dialog Language • Goal specified via reward Visual Dialog • e.g., [Gupta et al., CVPR17, Zhu et al., ICCV17] Single-Shot QA • Goal specified via visual target • e.g., [Zhu et al., ICRA17] • Fully observable environment • e.g., [Wang et al., ACL16] VQA VideoQA • Recent P a s s i • [Hermann et al., 2017, Chaplot et al., 2017] v e A Embodied • More complex environments c t i QA o • Higher level tasks n A c • [Anderson et al., CVPR18] t i v e • Interactive downstream tasks Single Frame Video Vision Slide credit: Devi Parikh

  14. EQA Dataset • Questions in environments Slide credit: Devi Parikh

  15. EQA Dataset • Questions in en envi vironmen ents ts Slide credit: Devi Parikh

  16. EQA Dataset: Environments House3D: A Rich and Realistic 3D environment https://github.com/facebookresearch/House3D Georgia Gkioxari Yuandong Tian Yuxin Wu Yi Wu UC Berkeley Facebook AI Research Slide credit: Georgia Gkioxari

  17. SUNCG dataset [Song et al., CVPR 2017] Manually designed using an online interior design interface (Planner5D) Slide credit: Georgia Gkioxari

  18. 45,622 indoor scenes 5,697,217 object instances 404,058 rooms 2644 unique objects 80 object categories SUNCG dataset [Song et al., CVPR 2017] Manually designed using an online interior design interface (Planner5D) Slide credit: Georgia Gkioxari

  19. • Collision and free space prediction • On Tesla M40 GPU (120x90 resolution) OpenGL 600fps single process • • Linux/MacOS compatible 1800fps multi process • • House3D Slide credit: Georgia Gkioxari

  20. RGB image Depth maps Semantic segmentation masks Top-down 2D views House3D Slide credit: Georgia Gkioxari

  21. EQA Dataset: Environments • Subset of House 3D: Typical home environments • Realistic layout according to all three SUNCG annotators • Not too large or too small (300-800m 2 , cover 1/3rd of ground area) • Have at least one kitchen, living room, dining room, bedroom • Ignore obscure rooms (e.g., loggia) and tiny objects (e.g., light switches) Slide credit: Devi Parikh

  22. EQA Dataset: Environments Test for generalization to Homes (767): Rooms (12): Objects (50) train: 643 homes gym dining room rug piano dryer computer fireplace whiteboard bookshelf wardrobe cabinet novel environments! val: 67 homes patio living room pan toilet plates ottoman fish tank dishwasher microwave water dispenser test: 57 homes office bathroom bed table mirror tv stand stereo set chessboard playstation vacuum cleaner lobby bedroom cup xbox heater bathtub shoe rack range oven refrigerator coffee machine garage elevator sink sofa kettle dresser knife rack towel rack loudspeaker utensil holder kitchen balcony desk vase shower washer fruit bowl television dressing tab. cutting board ironing board food processor Slide credit: Devi Parikh

  23. Slide credit: Devi Parikh fish tank piano pedestal fan candle air conditioner EQA Dataset: Environments bedroom kitchen living room

  24. EQA Dataset • Questions in en envi vironmen ents ts Slide credit: Devi Parikh

  25. EQA Dataset • Qu Ques esti tions in environments Slide credit: Devi Parikh

  26. EQA Dataset: Questions • Programmatically generate questions and answers location: What room is the <OBJ> located in? color: What color is the <OBJ>? color_room: What color is the <OBJ> in the <ROOM>? preposition: What is <on/above/below/next-to> the <OBJ> in the <ROOM>? existence: Is there a(n) <OBJ> in the <ROOM>? logical: Is there a(n) <OBJ1> and a(n) <OBJ2> in the <ROOM>? count: How many <OBJs> in the <ROOM>? room_count: How many <ROOMs> in the house? distance: Is the <OBJ1> closer to the <OBJ2> than to the <OBJ3> in the <ROOM>? … Varying navigation and memory Skill combinations Slide credit: Devi Parikh

  27. EQA Dataset: Questions • Programmatically generate questions and answers Slide credit: Devi Parikh

  28. EQA Dataset: Questions • Programmatically generate questions and answers location: What room is the <OBJ> located in? EQA v1 color: What color is the <OBJ>? color_room: What color is the <OBJ> in the <ROOM>? preposition: What is <on/above/below/next-to> the <OBJ> in the <ROOM>? existence: Is there a(n) <OBJ> in the <ROOM>? logical: Is there a(n) <OBJ1> and a(n) <OBJ2> in the <ROOM>? count: How many <OBJs> in the <ROOM>? room_count: How many <ROOMs> in the house? distance: Is the <OBJ1> closer to the <OBJ2> than to the <OBJ3> in the <ROOM>? Slide credit: Devi Parikh

  29. EQA Dataset: Questions • Programmatically generate questions and answers Remove questions with peaky answer location: What room is the <OBJ> located in? distributions EQA v1 color: What color is the <OBJ>? color_room: What color is the <OBJ> in the <ROOM>? preposition: What is <on/above/below/next-to> the <OBJ> in the <ROOM>? existence: Is there a(n) <OBJ> in the <ROOM>? logical: Is there a(n) <OBJ1> and a(n) <OBJ2> in the <ROOM>? count: How many <OBJs> in the <ROOM>? room_count: How many <ROOMs> in the house? distance: Is the <OBJ1> closer to the <OBJ2> than to the <OBJ3> in the <ROOM>? Questions (5281) train: 4246 val: 506 test: 529 Slide credit: Devi Parikh

  30. EQA Dataset: Expert Demonstrations • Connected House3D to Amazon Mechanical Turk Slide credit: Devi Parikh

  31. EQA Dataset: Expert Demonstrations Slide credit: Devi Parikh

  32. 41 Slide credit: Devi Parikh

  33. 42 Slide credit: Devi Parikh

  34. EQA Dataset: Expert Demonstrations • Connected House3D to Amazon Mechanical Turk • Currently: demonstrations for 1162 questions across 70 environments • Can be used for training • Learn how to explore • Capture human common sense • Can serve as a performance reference Slide credit: Devi Parikh

  35. EQA Dataset: Expert Demonstrations • Connected House3D to Amazon Mechanical Turk • Currently: demonstrations for 1162 questions across 70 environments • Can be used for training • Learn how to explore • Capture human common sense • Can an serve as as a a performan ance reference (see pap aper) Slide credit: Devi Parikh

  36. Model: Vision, Language, Navigation, Answering Slide credit: Devi Parikh

  37. Model: sion , Language, Navigation, Answering Vi Visi Autoencoder 224 110 53 24 10 Segmentation 224 110 53 24 10 32 32 16 8 Conv_4 Conv_3 Conv_2 Conv_1 RGB Depth Encoder Slide credit: Devi Parikh

  38. Model: Vision, La ge , Navigation, Answering Langu guage Slide credit: Devi Parikh

  39. Model: Vision, Language, Na tion , Answering Navig vigatio • Planner: direction or intention • Controller: velocity or primitive actions Stop Repeat Slide credit: Devi Parikh

Recommend


More recommend