Unsupervised Monocular Depth Estimation CNN Robust to Training Data Diversity Valery Anisimovskiy, Andrey Shcherbinin, 15 May, 2020 Sergey Turko and Ilya Kurilin
Problem Statement: Depth Sensors limitations IR depth sensor Stereo camera Lidar Dense depth map Dense Long range Instantaneous Instantaneous Good accuracy - Noisy - Unreliable - Sparse - Short range - Long baseline - Non-instantaneous - Occlusions - Expensive YDLiDAR X1 YDLi X1 Kinect 2 Kin ZED ZE D RGB St Stereocamera Kinect dep Kin depth map ap RGB im image
Problem Statement: RGB Camera + CNN Unsuperv rvis ised ed Monocular CNN CNN Pho hoto by Monocular cam amera Predicted Dep Depth Unsupervised Monocular CNN Dense and instantaneous depth map Cheap ap monocular camera Adap apta table ble to scenery by training on relevant dataset Easy training data collection - High computational overhead
Existing Approaches Supervised monocular CNN CNN model trained on dataset containing monocular input images and ground truth depth maps KITTI dataset + Best depth estimation accuracy - - Requires hard-to-get precise ground truth depth maps - Costly training datasets - Trained model is scene dependent Cityscapes dataset 4
Existing Approaches Unsupervised CNN trained on video sequence CNN model trained on video sequence using camera pose prediction along with depth estimation + Leverages readily available video sequence data for training - - Worse depth estimation accuracy - Requires camera intrinsics for training 5
Existing Approaches Unsupervised CNN trained on stereo pairs Unsupervised CNN model trained on stereo pairs with loss based on opposite image reconstruction and left-right consistency + Good depth estimation accuracy - - Lacks robustness to training on hybrid datasets containing data from different stereo rigs Available stereo datasets (KITTI, Cityscapes,…) 6
Suggested Approach Possible sources of easy available training data: Unsupervised CNN trained on stereo pairs + Camera Parameter Estimation Unsupervised CNN model trained on stereo pairs using camera parameters estimation along with depth estimation High quality commercial stereo movies + State-of-the-art depth estimat imation accuracy acy (among unsupervised monocular methods) Easy adapt ptivity ivity to any scene category by routine training data collection Stereo web-datasets No expensive ground truth depth Robustnes ness to to training data ta divers rsity ity via Camera parameters estimation Custom stereo video 7
Proposed Model Generated disparity maps d’ L and d’ R are further used to reconstruct auxiliary images I'' L and I'' R Disparity maps d L and d R along with corrected input images Î L and Î R are used for reconstructing counterpart images I‘ R and I‘ L as well as counterpart disparity maps d’ L and d’ R Data flow High-level feature maps for both left and right images are processed by camera parameters estimation CNN to produce stereo camera parameters: gain and bias (G and B), which are used to transform ẑ L and ẑ R to disparity maps d L and d R : G( ẑ L + B) d L = G( B) G( ẑ R + B) d R = - G( B) Input stereo pair images (I L and I R ) are separately processed by the Siamese depth estimation CNN to produce inverse depth maps ( ẑ L and ẑ R ), high-level feature maps and correction maps Δ I L and Δ I R 8
Proposed Model Auxiliary reconstruction: Left-right consistency: Primary reconstruction: Correction regularization: Total loss: 9
Depth Estimation CNN Architecture W x H x 16 Correction map W x H x 1 conv3x3, tanh conv3x3, ELU W/2 x H/2 x 32 Correction map W/2 x H/2 x 1 conv3x3, tanh conv3x3, ELU W x H x 16 Inverse depth W x H x 1 conv3x3, conv3x3, ELU tanh W x H x 1 Inverse depth W/2 x H/2 x 1 upsample2x2 W x H x 32 W x H x 16 Inverse depth W/4 x H/4 x 1 upsample2x2, conv3x3, ELU W/2 x H/2 x 32 conv3x3, tanh Inverse depth W/8 x H/8 x 1 conv3x3, ELU W/2 x H/2 x 1 upsample2x2 W/2 x H/2 x 64 W/2 x H/2 x 32 upsample2x2, conv3x3, ELU W/4 x H/4 x 64 conv3x3, tanh conv3x3, ELU W/4 x H/4 x 1 upsample2x2 W/4 x H/4 x 128 W/4 x H/4 x 64 upsample2x2, conv3x3, ELU W/8 x H/8 x 128 conv3x3, tanh conv3x3, ELU W/8 x H/8 x 256 W/8 x H/8 x 128 upsample2x2, conv3x3, ELU W/16 x H/16 x 256 conv3x3, ELU Inverse depth pyramid W/16 x H/16 x 512 W/16 x H/16 x 256 upsample2x2, conv3x3, ELU W/32 x H/32 x 512 conv3x3, ELU W/32 x H/32 x 512 W/32 x H/32 x 512 upsample2x2, conv3x3, ELU skip skip skip skip skip skip W/64 x H/64 x 512 conv3x3, ELU W/64 x H/64 x 512 pool2x2, conv3x3, ELU W/32 x H/32 x 512 conv3x3, ELU W/32 x H/32 x 512 pool2x2, conv3x3, ELU W/16 x H/16 x 512 conv3x3, ELU W/16 x H/16 x 512 pool2x2, conv3x3, ELU W/8 x H/8 x 256 conv3x3, ELU W/8 x H/8 x 256 pool2x2, conv3x3, ELU W/4 x H/4 x 128 conv5x5, ELU W/4 x H/4 x 128 pool2x2, conv5x5, ELU W/2 x H/2 x 64 conv7x7, ELU W/2 x H/2 x 64 pool2x2, conv7x7, ELU W x H x 32 conv9x9, ELU Input image W x H x 32 conv9x9, ELU W x H x 3 10
Camera Parameters Estimation CNN Architecture Left disparity map Right disparity map 2 fully-connected, sigmoid/tanh 256 fully-connected, ELU 256 avg pool g, b g, b 𝑆 + 𝑐 𝑀 + 𝑐 𝑒 𝑆 = − 𝑒 𝑒 𝑀 = 𝑒 256 conv3x3, ELU Left inverse depth map Right inverse depth map 128 conv3x3, ELU 64 concatenate 32 32 512 512 Depth Depth estimation CNN estimation CNN Left high-level Right high-level feature map feature map 11 Left image Right image
Training Datasets Hybrid city driving dataset Cityscapes + KITTI (CS+K) Stereo Movies dataset (SM) 12
Quantitative Results on KITTI Dataset (Trained on CS+K) Unsupervised (Semi-) Supervised 13
Quantitative Results on DIW Dataset (Trained on SM) WHDR = Weighted Human Disagreement Rate 14
Test of Robustness to Training Dataset Diversity CS K: Training on Cityscapes and fine-tuning on KITTI CS+K : Training on mixture of Cityscapes and KITTI 15
Camera Parameters Predicted for KITTI/Cityscapes 16
Qualitative Results on KITTI/Cityscapes Datasets Input image Correction map Depth map by our method Depth map by Godard et al. [14] Cityscapes KITTI 17
Qualitative Results for Uncontrolled Street Views (CS+K) Input image Depth map by our method Depth map by Godard et al. [14] 18 *Our model trained on CS+K dataset
Qualitative Results for Uncontrolled People Images (SM) Input images Depth maps by our method 19
Conclusion State-of of-the-art accuracy among unsupervised monocular depth estimation methods Robustness to dataset diversity and variability allows efficient training on hybrid datasets combining data from different stereo rigs The smallest CNN model size among high-accuracy methods Relaxed requirements for training data allow for easy and routine gathering of large and representative datasets. 20
Thank you 21
Recommend
More recommend