Embodied Question Answering NVIDIA GTC March 26, 2018 Abhishek Das PhD student, Georgia Tech
Embodied Question Answering Samyak Datta Georgia Gkioxari Stefan Lee Devi Parikh Dhruv Batra Georgia Tech FAIR Georgia Tech FAIR/Georgia Tech FAIR/Georgia Tech em embodied edqa. a.org/pa pape per.pd pdf To To appear in CVPR PR 2018 (Oral).
Forward
Forward
Turn Left
What is to the left of the shower? Cabinet Slide credit: Devi Parikh
EmbodiedQA: AI Challenges • Language understanding • Visual understanding • Active perception • Common sense reasoning • Grounding into actions • Selective memory • Credit assignment Slide credit: Devi Parikh
EmbodiedQA: Context Dialog Language Single-Shot QA Single Frame Video Vision Slide credit: Devi Parikh
EmbodiedQA: Context Dialog Language Single-Shot QA P a s s i v e A c t i o n A c t i v e Single Frame Video Vision Slide credit: Devi Parikh
EmbodiedQA: Context Dialog Language Single-Shot QA VQA P a s s i v e Q. What is the mustache made of? A c t i o n A [Antol and Agrawal et al., ICCV 2015] c t i v [Malinowski et al., ICCV 2015] e … Single Frame Video Vision Slide credit: Devi Parikh
EmbodiedQA: Context Dialog Language Single-Shot QA Q. How many times does the cat touch the dog? A. 4 times [Jang et al., CVPR 2017] VQA VideoQA P a s s i v e A c t i o Attribute: “dog”, “egg”, “bowl”, “woman”, “plate” n A c Q. What is a woman boiling in a pot of water? t i v e A. Eggs Single Frame Video [Ye et al., SIGIR 2017] Vision [Tapaswi et al., CVPR 2016] … Slide credit: Devi Parikh
EmbodiedQA: Context Dialog Language Visual Dialog Single-Shot QA VQA VideoQA P a s s i v e A c t i o n A c t i v e [Das et al., CVPR 2017] Single Frame Video [Das and Kottur et al., ICCV 2017] Vision … Slide credit: Devi Parikh
EmbodiedQA: Context Dialog Language • Goal specified via reward Visual Dialog • e.g., [Gupta et al., CVPR17, Zhu et al., ICCV17] Single-Shot QA • Goal specified via visual target • e.g., [Zhu et al., ICRA17] • Fully observable environment • e.g., [Wang et al., ACL16] VQA VideoQA • Recent P a s s i • [Hermann et al., 2017, Chaplot et al., 2017] v e A Embodied • More complex environments c t i QA o • Higher level tasks n A c • [Anderson et al., CVPR18] t i v e • Interactive downstream tasks Single Frame Video Vision Slide credit: Devi Parikh
EQA Dataset • Questions in environments Slide credit: Devi Parikh
EQA Dataset • Questions in en envi vironmen ents ts Slide credit: Devi Parikh
EQA Dataset: Environments House3D: A Rich and Realistic 3D environment https://github.com/facebookresearch/House3D Georgia Gkioxari Yuandong Tian Yuxin Wu Yi Wu UC Berkeley Facebook AI Research Slide credit: Georgia Gkioxari
SUNCG dataset [Song et al., CVPR 2017] Manually designed using an online interior design interface (Planner5D) Slide credit: Georgia Gkioxari
45,622 indoor scenes 5,697,217 object instances 404,058 rooms 2644 unique objects 80 object categories SUNCG dataset [Song et al., CVPR 2017] Manually designed using an online interior design interface (Planner5D) Slide credit: Georgia Gkioxari
• Collision and free space prediction • On Tesla M40 GPU (120x90 resolution) OpenGL 600fps single process • • Linux/MacOS compatible 1800fps multi process • • House3D Slide credit: Georgia Gkioxari
RGB image Depth maps Semantic segmentation masks Top-down 2D views House3D Slide credit: Georgia Gkioxari
EQA Dataset: Environments • Subset of House 3D: Typical home environments • Realistic layout according to all three SUNCG annotators • Not too large or too small (300-800m 2 , cover 1/3rd of ground area) • Have at least one kitchen, living room, dining room, bedroom • Ignore obscure rooms (e.g., loggia) and tiny objects (e.g., light switches) Slide credit: Devi Parikh
EQA Dataset: Environments Test for generalization to Homes (767): Rooms (12): Objects (50) train: 643 homes gym dining room rug piano dryer computer fireplace whiteboard bookshelf wardrobe cabinet novel environments! val: 67 homes patio living room pan toilet plates ottoman fish tank dishwasher microwave water dispenser test: 57 homes office bathroom bed table mirror tv stand stereo set chessboard playstation vacuum cleaner lobby bedroom cup xbox heater bathtub shoe rack range oven refrigerator coffee machine garage elevator sink sofa kettle dresser knife rack towel rack loudspeaker utensil holder kitchen balcony desk vase shower washer fruit bowl television dressing tab. cutting board ironing board food processor Slide credit: Devi Parikh
Slide credit: Devi Parikh fish tank piano pedestal fan candle air conditioner EQA Dataset: Environments bedroom kitchen living room
EQA Dataset • Questions in en envi vironmen ents ts Slide credit: Devi Parikh
EQA Dataset • Qu Ques esti tions in environments Slide credit: Devi Parikh
EQA Dataset: Questions • Programmatically generate questions and answers location: What room is the <OBJ> located in? color: What color is the <OBJ>? color_room: What color is the <OBJ> in the <ROOM>? preposition: What is <on/above/below/next-to> the <OBJ> in the <ROOM>? existence: Is there a(n) <OBJ> in the <ROOM>? logical: Is there a(n) <OBJ1> and a(n) <OBJ2> in the <ROOM>? count: How many <OBJs> in the <ROOM>? room_count: How many <ROOMs> in the house? distance: Is the <OBJ1> closer to the <OBJ2> than to the <OBJ3> in the <ROOM>? … Varying navigation and memory Skill combinations Slide credit: Devi Parikh
EQA Dataset: Questions • Programmatically generate questions and answers Slide credit: Devi Parikh
EQA Dataset: Questions • Programmatically generate questions and answers location: What room is the <OBJ> located in? EQA v1 color: What color is the <OBJ>? color_room: What color is the <OBJ> in the <ROOM>? preposition: What is <on/above/below/next-to> the <OBJ> in the <ROOM>? existence: Is there a(n) <OBJ> in the <ROOM>? logical: Is there a(n) <OBJ1> and a(n) <OBJ2> in the <ROOM>? count: How many <OBJs> in the <ROOM>? room_count: How many <ROOMs> in the house? distance: Is the <OBJ1> closer to the <OBJ2> than to the <OBJ3> in the <ROOM>? Slide credit: Devi Parikh
EQA Dataset: Questions • Programmatically generate questions and answers Remove questions with peaky answer location: What room is the <OBJ> located in? distributions EQA v1 color: What color is the <OBJ>? color_room: What color is the <OBJ> in the <ROOM>? preposition: What is <on/above/below/next-to> the <OBJ> in the <ROOM>? existence: Is there a(n) <OBJ> in the <ROOM>? logical: Is there a(n) <OBJ1> and a(n) <OBJ2> in the <ROOM>? count: How many <OBJs> in the <ROOM>? room_count: How many <ROOMs> in the house? distance: Is the <OBJ1> closer to the <OBJ2> than to the <OBJ3> in the <ROOM>? Questions (5281) train: 4246 val: 506 test: 529 Slide credit: Devi Parikh
EQA Dataset: Expert Demonstrations • Connected House3D to Amazon Mechanical Turk Slide credit: Devi Parikh
EQA Dataset: Expert Demonstrations Slide credit: Devi Parikh
41 Slide credit: Devi Parikh
42 Slide credit: Devi Parikh
EQA Dataset: Expert Demonstrations • Connected House3D to Amazon Mechanical Turk • Currently: demonstrations for 1162 questions across 70 environments • Can be used for training • Learn how to explore • Capture human common sense • Can serve as a performance reference Slide credit: Devi Parikh
EQA Dataset: Expert Demonstrations • Connected House3D to Amazon Mechanical Turk • Currently: demonstrations for 1162 questions across 70 environments • Can be used for training • Learn how to explore • Capture human common sense • Can an serve as as a a performan ance reference (see pap aper) Slide credit: Devi Parikh
Model: Vision, Language, Navigation, Answering Slide credit: Devi Parikh
Model: sion , Language, Navigation, Answering Vi Visi Autoencoder 224 110 53 24 10 Segmentation 224 110 53 24 10 32 32 16 8 Conv_4 Conv_3 Conv_2 Conv_1 RGB Depth Encoder Slide credit: Devi Parikh
Model: Vision, La ge , Navigation, Answering Langu guage Slide credit: Devi Parikh
Model: Vision, Language, Na tion , Answering Navig vigatio • Planner: direction or intention • Controller: velocity or primitive actions Stop Repeat Slide credit: Devi Parikh
Recommend
More recommend