Deep Models for 3D Reconstruction Andreas Geiger Autonomous Vision Group, MPI for Intelligent Systems, T¨ ubingen Computer Vision and Geometry Group, ETH Z¨ urich October 12, 2017 Max Planck Institute for Intelligent Systems Autonomous Vision Group
3D Reconstruction [Furukawa & Hernandez: Multi-View Stereo: A Tutorial] Task: ◮ Given a set of 2D images ◮ Reconstruct 3D shape of object/scene 2
3D Reconstruction Pipeline Input Images 3
3D Reconstruction Pipeline Input Images Camera Poses 3
3D Reconstruction Pipeline Input Images Camera Poses Dense Correspondences 3
3D Reconstruction Pipeline Input Images Camera Poses Dense Correspondences Depth Maps 3
3D Reconstruction Pipeline Input Images Camera Poses Dense Correspondences Depth Map Fusion Depth Maps 3
3D Reconstruction Pipeline Input Images Camera Poses Dense Correspondences 3D Reconstruction Depth Map Fusion Depth Maps 3
3D Reconstruction Pipeline Input Images Camera Poses Dense Correspondences 3D Reconstruction Depth Map Fusion Depth Maps 3
Large 3D Datasets and Repositories [Newcombe et al ., 2011] [Choi et al ., 2011] [Dai et al ., 2017] [Wu et al ., 2015] [Chang et al ., 2015] [Chang et al ., 2017] 4
Can we learn 3D Reconstruction from Data?
OctNet: Learning Deep 3D Representations at High Resolutions [Riegler, Ulusoy, & Geiger, CVPR 2017]
Deep Learning in 2D [LeCun, 1998] 7
Deep Learning in 3D 8
Deep Learning in 3D ◮ Existing 3D networks limited to ∼ 32 3 voxels 8
3D Data is often Sparse [Geiger et al ., 2012] 9
3D Data is often Sparse [Li et al ., 2016] 9
3D Data is often Sparse [Li et al ., 2016] Can we exploit sparsity for efficient deep learning? 9
Network Activations Layer 1: 32 3 Layer 2: 16 3 Layer 3: 8 3 10
Network Activations Layer 1: 32 3 Layer 2: 16 3 Layer 3: 8 3 10
Network Activations Layer 1: 32 3 Layer 2: 16 3 Layer 3: 8 3 Idea: ◮ Partition space adaptively based on sparse input 10
Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 Convolution 11
Convolution 11
Convolution 11
Convolution ◮ Differentiable ⇒ allows for end-to-end learning 11
Efficient Convolution This operation can be implemented very efficiently: ◮ 4 different cases ◮ First case requires only 1 evaluation! 12
Pooling 13
Pooling 13
Pooling 13
Pooling ◮ Unpooling operation defined similarly 13
Results: 3D Shape Classification Airplane Convolution Convolution Fully Fully and Pooling and Pooling Conn. Conn. 14
Results: 3D Shape Classification 80 OctNet 70 DenseNet Memory [GB] 60 50 40 30 20 10 0 8 3 16 3 32 3 64 3 128 3 256 3 Input Resolution 15
Results: 3D Shape Classification 16 OctNet 14 DenseNet 12 Runtime [s] 10 8 6 4 2 0 8 3 16 3 32 3 64 3 128 3 256 3 Input Resolution 15
Results: 3D Shape Classification 0 . 95 0 . 90 Accuracy 0 . 85 0 . 80 0 . 75 OctNet DenseNet 0 . 70 8 3 16 3 32 3 64 3 128 3 256 3 Input Resolution ◮ Input: voxelized meshes from ModelNet 16
Results: 3D Shape Classification OctNet 1 0 . 94 OctNet 2 0 . 92 Accuracy OctNet 3 0 . 90 0 . 88 0 . 86 8 3 16 3 32 3 64 3 128 3 256 3 Input Resolution ◮ Input: voxelized meshes from ModelNet 16
Results: 3D Shape Classification 17
Results: 3D Semantic Labeling Input Prediction ◮ Dataset: RueMonge2014 18
Results: 3D Semantic Labeling Skip Skip Convolution Convolution Unpooling Unpooling and Pooling and Pooling and Conv. and Conv. ◮ Decoder octree structure copied from encoder 19
Results: 3D Semantic Labeling IoU [Riemenschneider et al ., 2014] 42.3 [Martinovic et al ., 2015] 52.2 [Gadde et al ., 2016] 54.4 OctNet 64 3 45.6 OctNet 128 3 50.4 OctNet 256 3 59.2 20
OctNetFusion: Learning Depth Fusion from Data [Riegler, Ulusoy, Bischof & Geiger, 3DV 2017]
Volumetric Fusion w ( p ) ˆ w i ( p ) d i ( p ) + ˆ d ( p ) d i +1 ( p ) = ◮ p ∈ R 3 : voxel location w i ( p ) + ˆ w ( p ) ◮ d : distance, w : weight w i +1 ( p ) = w i ( p ) + ˆ w ( p ) [Curless and Levoy, SIGGRAPH 1996] 22
Volumetric Fusion ◮ Pros : ◮ Simple, fast, easy to implement ◮ Defacto ”gold standard” (KinectFusion, Voxel Hashing, ...) Ground Truth Volumetric Fusion 23
Volumetric Fusion ◮ Pros : ◮ Simple, fast, easy to implement ◮ Defacto ”gold standard” (KinectFusion, Voxel Hashing, ...) ◮ Cons : ◮ Requires many redundant views to reduce noise ◮ Can’t handle outliers / complete missing surfaces Ground Truth Volumetric Fusion 23
TV-L1 Fusion ◮ Pros : ◮ Prior on surface area ◮ Noise reduction Ground Truth Volumetric Fusion TV-L1 Fusion 23
TV-L1 Fusion ◮ Pros : ◮ Prior on surface area ◮ Noise reduction ◮ Cons : ◮ Simplistic local prior (penalizes surface area, shrinking bias) ◮ Can’t complete missing surfaces Ground Truth Volumetric Fusion TV-L1 Fusion 23
Learned Fusion ◮ Pros : ◮ Learn noise suppression from data ◮ Learn surface completion from data Ground Truth Volumetric Fusion TV-L1 Fusion OctNetFusion 23
Learned Fusion ◮ Pros : ◮ Learn noise suppression from data ◮ Learn surface completion from data ◮ Cons : ◮ Requires large 3D datasets for training ◮ How to scale to high resolutions? Ground Truth Volumetric Fusion TV-L1 Fusion OctNetFusion 23
Learning 3D Fusion Skip Skip Convolution Convolution Unpooling Unpooling and Pooling and Pooling and Conv. and Conv. Input Representation: Output Representation: ◮ TSDF ◮ Occupancy ◮ Higher-order statistics ◮ TSDF 24
Learning 3D Fusion Skip Skip Convolution Convolution Unpooling Unpooling and Pooling and Pooling and Conv. and Conv. What is the problem? 24
Learning 3D Fusion Skip Skip Convolution Convolution Unpooling Unpooling and Pooling and Pooling and Conv. and Conv. What is the problem? ◮ Octree structure unknown ⇒ needs to be inferred as well! 24
OctNetFusion Architecture 64³ 64³ ∆ 64 Octree Structure Input Output Features 128³ 128³ ∆ 128 Octree Structure Input Output Features 256³ 256³ ∆ 256 Input Output 25
Results: Surface Reconstruction VolFus TV-L1 Ours Ground Truth 64 3 128 3 256 3 26
Results: Volumetric Completion [Firman, 2016] Ours Ground Truth 27
Thank you!
Recommend
More recommend