geometry aware deep visual learning
play

Geometry-Aware Deep Visual Learning Katerina Fragkiadaki zebras - PowerPoint PPT Presentation

Geometry-Aware Deep Visual Learning Katerina Fragkiadaki zebras How this talk fits the workshop We will discuss new neural architectures for video understanding and feature learning without human annotations We will still use SGD to


  1. Geometry-Aware Deep Visual Learning Katerina Fragkiadaki

  2. zebras How this talk fits the workshop • We will discuss new neural architectures for video understanding and feature learning without human annotations • We will still use SGD to train the models

  3. zebras What is the goal of computer vision?

  4. label image pixels, detect and segment objects Image from Bruno Olshausen

  5. zebras label image pixels, detect and segment objects K. He et al., MaskRCNN, 2017

  6. Registration against known HD maps, 3D object detection, 3D motion forecasting

  7. Image Understanding as Inverse Graphics

  8. zebras A reasonable answer: the goal of computer vision is task specific

  9. Internet Vision Photos taken by people (and uploaded on the Internet) Mobile (Embodied) Computer Vision Photos taken by a NAO robot during a robot soccer game Our detectors may not work very well here…

  10. Internet Vision Photos taken by people (and uploaded on the Internet) Mobile (Embodied) Computer Vision Photos taken by a NAO robot during a robot soccer game Our detectors may not work very well here… Do we have more suitable models for this domain?

  11. Why Embodied Computer Vision Matters 1.Agents that move around in the world, perceive the world and accomplish tasks is (close to) the goal of AI research 2.It may be the key towards unsupervised visual feature learning `` We must perceive in order to move, but we must also move in order to perceive” JJ Gibson Ecological Approach to Visual Perception, Gibson, 1979

  12. zebras Internet and Mobile Perception have developed independently and have each made great progress • Internet vision has trained great DeepNets for image labelling and object detection+segmentation • Mobile computer vision has produced great SLAM (Simultaneous Localization and Mapping) methods

  13. Image Understanding as Inverse Graphics ? Should we be engineering a different model for every domain?

  14. Image Understanding as Inverse Graphics Blocks world Computed 3D model rendered Larry Roberts Input image Image gradient from a new viewpoint Machine perception of Three-Dimensional solids, MIT 1965

  15. Image Understanding as Inverse Graphics David Marr 1982

  16. 3D Models are impossible and unecessary Steering angle ``Internal world models which are complete representations of the external environment, besides being impossible to obtain, are not at all necessary for agents to act in a competent manner.” ``…(1) eventually computer vision will catch up and provide such world models—-I don't believe this based on the biological evidence presented below, or (2) complete objective models of reality are unrealistic and hence the methods of Artificial Intelligence that rely on such models are unrealistic.” “ Intelligence without reason ”, IJCAI, Rodney Brooks (1991)

  17. 25 years later iRobot vacuum cleaner is building a map! (Rodney Brooks co-founded iRobot in 1990)

  18. To 3D or not to 3D?

  19. And if to 3D, what 3D representation to use? ? depth map surface normals 3d mesh 3d point cloud 3d voxel occupancy

  20. This talk: To 3D using 3D feature tensors H × W × D × C 3 spatial dimensions, 1 feature dimension

  21. This talk: To 3D using 3D feature tensors H × W × D × C 3 spatial dimensions, 1 feature dimension

  22. This talk: To 3D using 3D feature tensors H × W × D × C 3 spatial dimensions, 1 feature dimension

  23. This talk: To 3D using 3D feature tensors H × W × D × C 3 spatial dimensions, 1 feature dimension

  24. Geometry-Aware Recurrent Networks 1.Hidden state: A 4D deep feature tensor, akin to a 3D (feature as opposed to pointcloud) map of the scene 2.Egomotion-stabilized hidden state updates t R, t

  25. 2D Recurrent networks, LSTMs, CONVLSTMs,.. h t +1 h t +2 h t CNN CNN CNN T

  26. 4D latent state h t h t R, t Egomotion CNN CNN CNN T

  27. 4D latent state h t h t h t h t +1 R, t R, t Egomotion Egomotion CNN CNN CNN T

  28. 4D latent state h t h t +1 h t h t h t h t +2 R, t R, t Egomotion Egomotion CNN CNN CNN T

  29. Geometry-Aware Recurrent Networks (GRNNs) H × W × D × C

  30. Geometry-Aware Recurrent Networks (GRNNs) H × W × D × C

  31. GRNNs t R, t • A set of differentiable neural modules to learn to go from 2D to 3D and back • A lot of SLAM ideas into the neural modules

  32. Unprojection (2D to 3D)

  33. Unprojection (2D to 3D)

  34. Unprojection (2D to 3D)

  35. Unprojection (2D to 3D)

  36. Unprojection (2D to 3D)

  37. Rotation azimuth elevation

  38. Egomotion-stabilized memory update 3D feature memory Relative Rotation R cross convolution Unprojection Rotation

  39. Egomotion-stabilized memory update Hidden state update h t h t +1 Rotation − R Unprojection

  40. Projection (3D to 2D) d

  41. Projection (3D to 2D) d

  42. Projection (3D to 2D) d

  43. Projection (3D to 2D) d

  44. Projection (3D to 2D) d

  45. Training GRNNs 1.Self-supervised via predicting images the agent will see under novel viewpoints 2.Supervised for 3D object detection

  46. Image generation rotate to query view Image generator View prediction project

  47. 3 input views 2D RNN [1] GRNN [1] Neural scene representation and rendering DeepMind, Science, 2018

  48. 3 input views 2D RNN [1] GRNN Testing on scenes with more objets than train time [1] Neural scene representation and rendering DeepMind, Science, 2018

  49. View prediction geometry-aware RNN 2D RNN [1]

  50. 3D Object Detection RPN 3D version of MaskRCNN

  51. Results - 3D object detection

  52. 3D object detection predicted segmentations predicted boxes input gt prediction views front-view bird-view Objects detections learn to perist in time, they do not switch on and off from frame to frame

  53. GRNNs Differentiable SLAM for better space-aware deep feature learning • Generative model of scenes with a 3D bottleneck when trained • from view prediction Generalize better than 2D models •

  54. What’s next? • Use GRNNs for tracking, dynamics, learning, perceptual front-end for RL, robotic learning

  55. Thank you! Fish Tung Ziyan Wang Ricson Chen • Learning spatial common sense with geometry-aware recurrent networks, F. Tung, R. Cheng, K.F., arxiv • Geometry-Aware Recurrent Neural Networks for Active Visual Recognition , R. Cheng, Z. Wang, K.F., NIPS 2018

Recommend


More recommend