vision language navigation with self supervised auxiliary
play

Vision-Language Navigation with Self-Supervised Auxiliary Reasoning - PowerPoint PPT Presentation

MONASH INFORMATION TECHNOLOGY Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks Fengda Zhu, Yi Zhu, Xiaojun Chang, Xiaodan Liang Outline 1. Embodied Navigation 2. Vision-language Navigation Task 3. Related Works 4.


  1. MONASH INFORMATION TECHNOLOGY Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks Fengda Zhu, Yi Zhu, Xiaojun Chang, Xiaodan Liang

  2. Outline 1. Embodied Navigation 2. Vision-language Navigation Task 3. Related Works 4. Our Methods 5. Conclusion

  3. Embodied Navigation Problem 1. datasets providing 3D assets with semantic annotations 2. simulators render these assets and simulate an embodied agent 3. tasks that define evaluable problems that enable us to benchmark scientific progress

  4. Synthetic Image / Real Image Advantage Advantage • • More data Close to real application • Faster rendering Disadvantage • Disadvantage Less data • • Limited application Easily Overfitting Transfer: Sim-Real Joint Reinforcement Transfer for 3D Indoor Navigation (by Zhu et al. CVPR 2019)

  5. Matterport3D

  6. Habitat Simulator A flexible, high-performance 3D simulator with configurable agents, multiple sensors, and generic 3D dataset handling (with built-in support for MatterPort3D, Gibson, Replica, and other datasets). Advantage: • Real image • Fast rendering • Continuous action space Disadvantage: • Low rendering quality

  7. PointGoal Task

  8. ObjectGoal Task

  9. Vision Language Navigation (VLN) Task Room-to-room (R2R) dataset • 90 houses • 7k trajectories • 21k instructions Computer Vision + Natural Language Processing • Natural Language + Reinforcement Learning • More detailed description • Require complex scene understanding

  10. VLN baseline (seq-to-seq) Disadvantage: 1. Supervised learning is easily overfitting 2. Does not sufficiently exploit the panoramic view 3. The action space is redundant 4. Training-testing domain gap

  11. Speaker-Follower Model Speaker: trajectory to instruction Follower: instruction to trajectory

  12. Speaker-Follower Model

  13. Reinforced Cross-Modal Matching (RCM) Advantage: 1. Use cross-modal attention 2. Introduce RL+ supervised learning

  14. Environmental Dropout (Envdrop)

  15. Self-Supervised Auxiliary Reasoning Tasks Please turn left and walk through the living room. Exit the room and Rich information to explore: turn right into the bedroom. • Semantics of the route 𝑕𝑝𝑏𝑚 • Navigation Progress • Vision Language Consistency 𝑢 2 • Room Structure 𝑢 1 Navigation Node 𝑢 0 Navigation Edge Feasible Edge

  16. Self-Supervised Auxiliary Reasoning Tasks Please turn left and walk through the living room. Exit the room and We require the agent to: turn right into the bedroom. • Interpret its actions 𝑕𝑝𝑏𝑚 • Reason about the past • Align Vision-language explicitly 𝑢 2 • Predict the future 𝑢 1 Navigation Node 𝑢 0 Navigation Edge Feasible Edge

  17. Self-Supervised Auxiliary Reasoning Tasks

  18. Self-Supervised Auxiliary Reasoning Tasks

  19. Self-Supervised Auxiliary Reasoning Tasks

  20. Self-Supervised Auxiliary Reasoning Tasks Demo Code

  21. Thank You

Recommend


More recommend