tsinghua university monocular depth pose prediction
play

Tsinghua University Monocular Depth-Pose Prediction [R, t] Depth - PowerPoint PPT Presentation

Wang Zhao, Shaohui Liu, Yezhi Shu, Yong-Jin Liu Tsinghua University Monocular Depth-Pose Prediction [R, t] Depth and Pose RGB PoseNet Fails to Generalize! All Drift ! Depth estimation in Indoor environments with Visual Odometry with


  1. Wang Zhao, Shaohui Liu, Yezhi Shu, Yong-Jin Liu Tsinghua University

  2. Monocular Depth-Pose Prediction ⋮ [R, t] Depth and Pose RGB

  3. PoseNet Fails to Generalize! All Drift ! Depth estimation in Indoor environments with Visual Odometry with Unseen complex camera motions and low texture Camera Ego-motions

  4. Joint Learning without PoseNet Sparse Triangulated Depth Scale Alignment DepthNet Loss Sample & Triangulation 1 ⋯ ⋯ 𝐺: ⋮ 1 ⋮ Normalized ⋮ ⋯ 0 [R, t] FlowNet 8‐Poi nt Inlier Mask Sampled Correspondences Built on top of two-frame structure-from-motion

  5. Joint Learning without PoseNet 1 ⋯ ⋯ 𝐺: ⋮ 1 ⋮ Normalized ⋮ ⋯ 0 [R, t] FlowNet 8‐Poi nt Inlier Mask Sampled Correspondences • Correspondences are sampled based on the occlusion mask and the forward-backward consistency score produced by the optical flow network . • 8-Point algorithm is implemented in RANSAC loop to robustly recover the relative pose. • Epipolar distance (Inlier mask) is calculated and used to further filter out the incorrect matchings and non-rigid objects.

  6. Joint Learning without PoseNet Sample + [R, t] Triangulation Sparse Triangulation Flow Correspondence Relative pose • We sample 6k matches from flow to triangulate, according to the occlusion mask, forward-backward score, and the inlier mask. • We use mid- point triangulation for its convenience and it’s naturally differentiable. • A match is abandoned if the angle between two rays is too small.

  7. Joint Learning without PoseNet Scale Sparse Triangulated Depth Alignment DepthNet Loss • Predicted depth is aligned with triangulation depth map to have a consistent scale. • Triangulation loss, depth re-projection loss and the depth smoothness loss are used to supervise the depth-net.

  8. Scale Disentanglement 1. The translation value 𝒖 of estimated pose [𝑺, 𝒖] from monocular video is up-to-scale! 2. Monocular depth prediction 𝑬 from network has a learnt scale. 3. Joint training losses require a consistent scale across learnt depth and pose.

  9. Scale Disentanglement PoseNet-based learning system Our system Scale Alignment 𝑬 DepthNet DepthNet 𝑬 𝑬′ RGB RGB Loss Loss Input Input PoseNet [𝑺, 𝒖] [𝑺, 𝒖] FlowNet + Solver PoseNet needs to learn a translation No need for network to learn a translation scale consistent with DepthNet scale consistent with DepthNet

  10. Quantitative Results on KITTI dataset Our method achieves state-of-the-art performances on KITTI depth and optical flow estimation.

  11. Robustness Improved – KITTI Visual Odometry with unseen camera ego-motion PoseNet-based Our system

  12. Robustness Improved – TUM Visual Odometry with Indoor Environments PoseNet-based Our system

  13. Robustness Improved – NYUv2 Depth Estimation in Indoor Environments PoseNet-based Ours Input Image PoseNet Our system

  14. Robustness Improved – NYUv2 Depth Estimation in Indoor Environments PoseNet-based Our system Best performance on NYUv2 among unsupervised methods!

  15. Code and model are available here Link: https://github.com/B1ueber2y/TrianFlow Check our paper for more details!

Recommend


More recommend