3D Shape Representations ∞ ∞ 2 2 2 2 Depth Voxel Implicit Pointcloud Mesh Map Grid Surface Justin Johnson November 13, 2019 Lecture 17 - 30
3D Shape Representations ∞ ∞ 2 2 2 2 Depth Voxel Implicit Pointcloud Mesh Map Grid Surface Justin Johnson November 13, 2019 Lecture 17 - 31
3D Shape Representations: Point Cloud • Represent shape as a set of P points in 3D space • (+) Can represent fine structures without huge numbers of points • ( ) Requires new architecture, losses, etc • (-) Doesn’t explicitly represent the surface of the shape: extracting a mesh for rendering or other applications requires post-processing Fan et al, “A Point Set Generation Network for 3D Object Reconstruction from a Single Image”, CVPR 2017 Justin Johnson November 13, 2019 Lecture 17 - 32
Processing Pointcloud Inputs: PointNet Want to process pointclouds as sets : Run MLP on order should not matter Max-Pool each point Fully Connected Point features: Pooled vector: Input pointcloud: Class score: P x D D P x 3 C Qi et al, “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation”, CVPR 2017 Qi et al, “PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space”, NeurIPS 2017 Justin Johnson November 13, 2019 Lecture 17 - 33
Generating Pointcloud Outputs Fully connected branch Points : P 1 x 3 2D CNN 2D Image Points : CNN Input Image : Pointcloud : Features : (P 2 x3) x H’ x W’ 3 x H x W (P 1 + H’W’P 2 ) x 3 C x H’ x W’ Convolutional branch Fan et al, “A Point Set Generation Network for 3D Object Reconstruction from a Single Image”, CVPR 2017 Justin Johnson November 13, 2019 Lecture 17 - 34
Predicting Point Clouds: Loss Function We need a (differentiable) way to compare pointclouds as sets ! Fan et al, “A Point Set Generation Network for 3D Object Reconstruction from a Single Image”, CVPR 2017 Justin Johnson November 13, 2019 Lecture 17 - 35
Predicting Point Clouds: Loss Function We need a (differentiable) way to compare pointclouds as sets ! Chamfer distance is the sum of L2 distance to each point’s nearest neighbor in the other set Fan et al, “A Point Set Generation Network for 3D Object Reconstruction from a Single Image”, CVPR 2017 Justin Johnson November 13, 2019 Lecture 17 - 36
Predicting Point Clouds: Loss Function We need a (differentiable) way to compare pointclouds as sets ! Chamfer distance is the sum of L2 distance to each point’s nearest neighbor in the other set Fan et al, “A Point Set Generation Network for 3D Object Reconstruction from a Single Image”, CVPR 2017 Justin Johnson November 13, 2019 Lecture 17 - 37
Predicting Point Clouds: Loss Function We need a (differentiable) way to compare pointclouds as sets ! Chamfer distance is the sum of L2 distance to each point’s nearest neighbor in the other set Fan et al, “A Point Set Generation Network for 3D Object Reconstruction from a Single Image”, CVPR 2017 Justin Johnson November 13, 2019 Lecture 17 - 38
Predicting Point Clouds: Loss Function We need a (differentiable) way to compare pointclouds as sets ! Chamfer distance is the sum of L2 distance to each point’s nearest neighbor in the other set Fan et al, “A Point Set Generation Network for 3D Object Reconstruction from a Single Image”, CVPR 2017 Justin Johnson November 13, 2019 Lecture 17 - 39
3D Shape Representations ∞ ∞ 2 2 2 2 Depth Voxel Implicit Pointcloud Mesh Map Grid Surface Justin Johnson November 13, 2019 Lecture 17 - 40
3D Shape Representations ∞ ∞ 2 2 2 2 Depth Voxel Implicit Pointcloud Mesh Map Grid Surface Justin Johnson November 13, 2019 Lecture 17 - 41
3D Shape Representations: Triangle Mesh Represent a 3D shape as a set of triangles Vertices : Set of V points in 3D space Faces : Set of triangles over the vertices (+) Standard representation for graphics (+) Explicitly represents 3D shapes Justin Johnson November 13, 2019 Lecture 17 - 42
3D Shape Representations: Triangle Mesh Represent a 3D shape as a set of triangles Vertices : Set of V points in 3D space Faces : Set of triangles over the vertices (+) Standard representation for graphics (+) Explicitly represents 3D shapes (+) Adaptive: Can represent flat surfaces very efficiently, can allocate more faces to areas with fine detail Dolphin image is in the public domain Justin Johnson November 13, 2019 Lecture 17 - 43
3D Shape Representations: Triangle Mesh Represent a 3D shape as a set of triangles Vertices : Set of V points in 3D space Faces : Set of triangles over the vertices (+) Standard representation for graphics (+) Explicitly represents 3D shapes (+) Adaptive: Can represent flat surfaces very efficiently, can allocate more faces to areas with fine detail (+) Can attach data on verts and interpolate over the whole surface: RGB colors, texture coordinates, normal vectors, etc. UV mapping figure is licensed under CC BY-SA 3.0. Figure slightly reorganized. Justin Johnson November 13, 2019 Lecture 17 - 44
3D Shape Representations: Triangle Mesh Represent a 3D shape as a set of triangles Vertices : Set of V points in 3D space Faces : Set of triangles over the vertices (+) Standard representation for graphics (+) Explicitly represents 3D shapes (+) Adaptive: Can represent flat surfaces very efficiently, can allocate more faces to areas with fine detail (+) Can attach data on verts and interpolate over the whole surface: RGB colors, texture coordinates, normal vectors, etc. (-) Nontrivial to process with neural nets! UV mapping figure is licensed under CC BY-SA 3.0. Figure slightly reorganized. Justin Johnson November 13, 2019 Lecture 17 - 45
Predicting Meshes: Pixel2Mesh Input : Single RGB Output : Triangle Image of an object mesh for the object Wang et al, “Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images”, ECCV 2018 Justin Johnson November 13, 2019 Lecture 17 - 46
Predicting Meshes: Pixel2Mesh Key ideas : Iterative Refinement Input : Single RGB Output : Triangle Graph Convolution Image of an object mesh for the object Vertex Aligned-Features Chamfer Loss Function Wang et al, “Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images”, ECCV 2018 Justin Johnson November 13, 2019 Lecture 17 - 47
Predicting Triangle Meshes: Iterative Refinement Idea #1: Iterative mesh refinement Start from initial ellipsoid mesh Network predicts offsets for each vertex Repeat. Wang et al, “Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images”, ECCV 2018 Justin Johnson November 13, 2019 Lecture 17 - 48
Predicting Triangle Meshes: Graph Convolution f’ i = Vertex v i has feature f i New feature f’ i for vertex vi depends on feature of neighboring vertices N(i) Use same weights W0 Input : Graph with a feature Output : New feature and W1 to compute vector at each vertex vector for each vertex all outputs Justin Johnson November 13, 2019 Lecture 17 - 49
Predicting Triangle Meshes: Graph Convolution Each of these blocks consists of a stack of graph convolution layers operating on edges of the mesh Wang et al, “Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images”, ECCV 2018 Justin Johnson November 13, 2019 Lecture 17 - 50
Predicting Triangle Meshes: Graph Convolution Each of these blocks consists of a Problem : How stack of graph convolution layers to incorporate operating on edges of the mesh image features? Wang et al, “Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images”, ECCV 2018 Justin Johnson November 13, 2019 Lecture 17 - 51
Predicting Triangle Meshes: Vertex-Aligned Features Idea #2 : Aligned vertex features 2D For each vertex of the mesh: CNN Image - Use camera information to Features project onto image plane Input Image - Use bilinear interpolation to sample a CNN feature Wang et al, “Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images”, ECCV 2018 Justin Johnson November 13, 2019 Lecture 17 - 52
Predicting Triangle Meshes: Vertex-Aligned Features Project proposal Idea #2 : Aligned vertex features onto features For each vertex of the mesh: f 6,5 f 7,5 CNN - Use camera information to f 6.5,5.8 project onto image plane f 6,6 f 7,6 - Use bilinear interpolation to sample a CNN feature Similar to RoI-Align operation from last time: maintains alignment between input image and feature vectors Wang et al, “Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images”, ECCV 2018 Justin Johnson November 13, 2019 Lecture 17 - 53
Predicting Meshes: Loss Function The same shape can be represented with different meshes – how can we define a loss between predicted and ground-truth mesh? vs Prediction Ground-Truth Wang et al, “Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images”, ECCV 2018 Justin Johnson November 13, 2019 Lecture 17 - 54
Predicting Meshes: Loss Function The same shape can be represented with different meshes – how can we define a loss between predicted and ground-truth mesh? Idea: Convert meshes to pointclouds, then compute loss vs Prediction Ground-Truth Wang et al, “Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images”, ECCV 2018 Justin Johnson November 13, 2019 Lecture 17 - 55
Predicting Meshes: Loss Function The same shape can be represented with different meshes – how can we define a loss between predicted and ground-truth mesh? Idea: Convert meshes to pointclouds, then compute loss Sample points from the vs surface of the ground- truth mesh (offline) Prediction Ground-Truth Wang et al, “Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images”, ECCV 2018 Justin Johnson November 13, 2019 Lecture 17 - 56
Predicting Meshes: Loss Function The same shape can be represented with different meshes – how can we define a loss between predicted and ground-truth mesh? Loss = Chamfer distance between predicted verts and ground-truth samples Sample points from the vs surface of the ground- truth mesh (offline) Prediction Ground-Truth Wang et al, “Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images”, ECCV 2018 Justin Johnson November 13, 2019 Lecture 17 - 57
Predicting Meshes: Loss Function The same shape can be represented with different meshes – how can we define a loss between predicted and ground-truth mesh? Loss = Chamfer distance between predicted verts and ground-truth samples Problem: Doesn’t Sample points from the take the interior vs surface of the ground- of predicted faces truth mesh (offline) into account! Prediction Ground-Truth Wang et al, “Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images”, ECCV 2018 Justin Johnson November 13, 2019 Lecture 17 - 58
Predicting Meshes: Loss Function The same shape can be represented with different meshes – how can we define a loss between predicted and ground-truth mesh? Loss = Chamfer distance between predicted samples and ground-truth samples Sample points Sample points from the from the surface vs surface of the ground- of the predicted truth mesh (offline) mesh (online!) Prediction Ground-Truth Smith et al, “GEOMetrics: Exploiting Geometric Structure for Graph-Encoded Objects”, ICML 2019 Justin Johnson November 13, 2019 Lecture 17 - 59
Predicting Meshes: Loss Function Problem: Need to sample online! Must be efficient! Problem: Need to backprop through sampling! Loss = Chamfer distance between predicted samples and ground-truth samples Sample points Sample points from the from the surface vs surface of the ground- of the predicted truth mesh (offline) mesh (online!) Prediction Ground-Truth Smith et al, “GEOMetrics: Exploiting Geometric Structure for Graph-Encoded Objects”, ICML 2019 Justin Johnson November 13, 2019 Lecture 17 - 60
Predicting Meshes: Pixel2Mesh Key ideas : Iterative Refinement Input : Single RGB Output : Triangle Graph Convolution Image of an object mesh for the object Vertex Aligned-Features Chamfer Loss Function Wang et al, “Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images”, ECCV 2018 Justin Johnson November 13, 2019 Lecture 17 - 61
3D Shape Representations ∞ ∞ 2 2 2 2 Depth Voxel Implicit Pointcloud Mesh Map Grid Surface Justin Johnson November 13, 2019 Lecture 17 - 62
3D Shape Prediction Shape Representations Metrics Camera Systems Datasets Justin Johnson November 13, 2019 Lecture 17 - 63
3D Shape Prediction Shape Representations Metrics Camera Systems Datasets Justin Johnson November 13, 2019 Lecture 17 - 64
Shape Comparison Metrics: Intersection over Union In 2D, we evaluate boxes and segmentation masks with intersection over union (IoU): Figure credit: Alexander Kirillov Justin Johnson November 13, 2019 Lecture 17 - 65
Shape Comparison Metrics: Intersection over Union In 2D, we evaluate boxes and In 3D: Voxel IoU segmentation masks with Problem : Cannot capture thin structures intersection over union (IoU): Problem : Cannot be applied to pointclouds Problem : For meshes, need to voxelize or sample Figure credit: Alexander Kirillov Justin Johnson November 13, 2019 Lecture 17 - 66
Shape Comparison Metrics: Intersection over Union In 2D, we evaluate boxes and In 3D: Voxel IoU segmentation masks with Problem : Cannot capture thin structures intersection over union (IoU): Problem : Cannot be applied to pointclouds Problem : For meshes, need to voxelize or sample Problem : Not very meaningful at low values! Figure credit: Alexander Kirillov Figure credit: Tatarchenko et al, “What Do Single-view 3D Reconstruction Networks Learn?”, CVPR 2019 Justin Johnson November 13, 2019 Lecture 17 - 67
Shape Comparison Metrics: Intersection over Union State–of-the-art methods In 3D: Voxel IoU achieve low IoU Problem : Cannot capture thin structures Problem : Cannot be applied to pointclouds IoU Problem : For meshes, need to voxelize or sample 0.6 Problem : Not very meaningful at low values! 0.55 0.571 0.5 0.493 0.48 0.45 0.4 3D-R2N2 Pixel2Mesh OccNet (Voxels) (mesh) (implicit) Results from Mescheder et al, “Occupancy Networks: Learning 3D Reconstruction in Function Space”, CVPR 2019 Conclusion : Voxel IoU not a good metric Figure credit: Tatarchenko et al, “What Do Single-view 3D Reconstruction Networks Learn?”, CVPR 2019 Justin Johnson November 13, 2019 Lecture 17 - 68
Shape Comparison Metrics: Chamfer Distance We’ve already seen another shape comparison metric: Chamfer distance 1. Convert your prediction and ground-truth into pointclouds via sampling 2. Compare with Chamfer distance Justin Johnson November 13, 2019 Lecture 17 - 69
Shape Comparison Metrics: Chamfer Distance We’ve already seen another shape comparison metric: Chamfer distance 1. Convert your prediction and ground-truth into pointclouds via sampling 2. Compare with Chamfer distance Problem : Chamfer is very sensitive to outliers Figure credit: Tatarchenko et al, “What Do Single-view 3D Reconstruction Networks Learn?”, CVPR 2019 Justin Johnson November 13, 2019 Lecture 17 - 70
Shape Comparison Metrics: F1 Score Predicted Similar to Chamfer, sample points Ground-truth from the surface of the prediction and the ground-truth Precision@t = fraction of predicted points within t of some ground- truth point Recall@t = fraction of ground-truth points within t of some predicted point F1@t = 2 ∗ #$%&'(')*@, ∗ -%&.//@, #$%&'(')*@,0-%&.//@, Justin Johnson November 13, 2019 Lecture 17 - 71
Shape Comparison Metrics: F1 Score Predicted Similar to Chamfer, sample points Ground-truth from the surface of the prediction and the ground-truth Precision@t = fraction of predicted points within t of some ground- truth point Recall@t = fraction of ground-truth points within t of some predicted point Precision@t = 3/4 F1@t = 2 ∗ #$%&'(')*@, ∗ -%&.//@, #$%&'(')*@,0-%&.//@, Justin Johnson November 13, 2019 Lecture 17 - 72
Shape Comparison Metrics: F1 Score Predicted Similar to Chamfer, sample points Ground-truth from the surface of the prediction and the ground-truth Precision@t = fraction of predicted points within t of some ground- truth point Recall@t = fraction of ground-truth points within t of some predicted point Precision@t = 3/4 F1@t = 2 ∗ #$%&'(')*@, ∗ -%&.//@, Recall@t = 2/3 #$%&'(')*@,0-%&.//@, Justin Johnson November 13, 2019 Lecture 17 - 73
Shape Comparison Metrics: F1 Score Predicted Similar to Chamfer, sample points Ground-truth from the surface of the prediction and the ground-truth Precision@t = fraction of predicted points within t of some ground- truth point Recall@t = fraction of ground-truth points within t of some predicted point Precision@t = 3/4 F1@t = 2 ∗ #$%&'(')*@, ∗ -%&.//@, Recall@t = 2/3 #$%&'(')*@,0-%&.//@, F1@t ≅ 0.70 Justin Johnson November 13, 2019 Lecture 17 - 74
Shape Comparison Metrics: F1 Score F1 score is robust to outliers! Similar to Chamfer, sample points from the surface of the prediction and the ground-truth Precision@t = fraction of predicted points within t of some ground- truth point Recall@t = fraction of ground-truth points within t of some predicted Conclusion : F1 score is probably point the best shape prediction metric F1@t = 2 ∗ #$%&'(')*@, ∗ -%&.//@, in common use #$%&'(')*@,0-%&.//@, Figure credit: Tatarchenko et al, “What Do Single-view 3D Reconstruction Networks Learn?”, CVPR 2019 Justin Johnson November 13, 2019 Lecture 17 - 75
Shape Comparison Metrics: Summary Intersection over Union: Doesn’t capture fine structure, not meaningful at low values Chamfer Distance : Very sensitive to outliers Can be directly optimized F1@1% = 0.56 F1@1% = 0.56 F1 score : Robust to outliers, but need to look at different threshold values to capture details at different scales Justin Johnson November 13, 2019 Lecture 17 - 76
3D Shape Prediction Shape Representations Metrics Camera Systems Datasets Justin Johnson November 13, 2019 Lecture 17 - 77
3D Shape Prediction Shape Representations Metrics Camera Systems Datasets Justin Johnson November 13, 2019 Lecture 17 - 78
Cameras: Canonical vs View Coordinates Canonical Input target Canonical Coordinates: Predict 3D shape in a canonical coordinate system (e.g. front of chair is +z) regardless of the viewpoint of the input image Justin Johnson November 13, 2019 Lecture 17 - 79
Cameras: Canonical vs View Coordinates Canonical View Input target target Canonical Coordinates: Predict 3D shape in a canonical coordinate system (e.g. front of chair is +z) regardless of the viewpoint of the input image View Coordinates: Predict 3D shape aligned to the viewpoint of the camera Many papers predict in canonical coordinates – easier to load data Justin Johnson November 13, 2019 Lecture 17 - 80
Cameras: Canonical vs View Coordinates Canonical View Input target target Problem : Canonical view breaks the “principle of feature alignment”: Predictions should be aligned to inputs View coordinates maintain alignment between inputs and predictions! Justin Johnson November 13, 2019 Lecture 17 - 81
Cameras: Canonical vs View Coordinates Canonical View Input Problem : Canonical view overfits to training shapes: target target Better generalization to new views of known shapes Worse generalization to new shapes or new categories View Canonical 1 0.9 0.902 0.8 0.7 Voxel IoU 0.714 0.6 0.5 0.57 0.517 0.4 0.474 0.3 0.309 0.2 0.1 0 Novel view Novel Model Novel category Shin et al, “Pixels, voxels, and views: A study of shape representations for single view 3D object shape prediction”, CVPR 2018 Justin Johnson November 13, 2019 Lecture 17 - 82
Cameras: Canonical vs View Coordinates Canonical View Input Problem : Canonical view overfits to training shapes: target target Better generalization to new views of known shapes Worse generalization to new shapes or new categories Conclusion : Prefer view coordinate system Justin Johnson November 13, 2019 Lecture 17 - 83
View-Centric Voxel Predictions View-centric predictions! Voxels take perspective camera into account, so our “voxels” are actually frustums Gkioxari, Malik, and Johnson, “Mesh R-CNN”, ICCV 2019 Justin Johnson November 13, 2019 Lecture 17 - 84
3D Shape Prediction Shape Representations Metrics Camera Systems Datasets Justin Johnson November 13, 2019 Lecture 17 - 85
3D Shape Prediction Shape Representations Metrics Camera Systems Datasets Justin Johnson November 13, 2019 Lecture 17 - 86
3D Datasets: Object-Centric ShapeNet ~50 categories, ~50k 3D CAD models Standard split has 13 categories, ~44k models, 25 rendered images per model Many papers show results here (-) Synthetic, isolated objects; no context (-) Lots of chairs, cars, airplanes Chang et al, “ShapeNet: An Information-Rich 3D Model Repository”, arXiv 2015 Choy et al, “3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction”, ECCV 2016 Justin Johnson November 13, 2019 Lecture 17 - 87
3D Datasets: Object-Centric ShapeNet Pix3D ~50 categories, ~50k 3D CAD models 9 categories, 219 3D models of IKEA furniture Standard split has 13 categories, ~44k aligned to ~17k real images models, 25 rendered images per model Some papers train on ShapeNet and show qualitative results here, but use ground-truth Many papers show results here segmentation masks (-) Synthetic, isolated objects; no context (+) Real images! Context! (-) Lots of chairs, cars, airplanes (-) Small, partial annotations – only 1 obj/image Chang et al, “ShapeNet: An Information-Rich 3D Model Repository”, arXiv 2015 Sun et al, “Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling”, CVPR 2018 Choy et al, “3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction”, ECCV 2016 Justin Johnson November 13, 2019 Lecture 17 - 88
3D Shape Prediction: Mesh R-CNN Mesh R-CNN: Mask R-CNN: 2D Image -> Triangle Meshes 2D Image -> 2D shapes He, Gkioxari, Dollár, and Gkioxari, Malik, and Johnson, Girshick, “Mask R-CNN”, “Mesh R-CNN”, ICCV 2019 ICCV 2017 Justin Johnson November 13, 2019 Lecture 17 - 89
Mesh R-CNN: Task Input : Single RGB image Output : A set of detected objects For each object: - Bounding box Mask R-CNN - Category label - Instance segmentation - 3D triangle mesh Mesh head Justin Johnson November 13, 2019 Lecture 17 - 90
Mesh R-CNN: Hybrid 3D shape representation Mesh deformation gives good results, but the topology (verts, faces, genus, connected components) fixed by the initial mesh Justin Johnson November 13, 2019
Mesh R-CNN: Hybrid 3D shape representation Mesh deformation gives good results, but the topology (verts, faces, genus, connected components) fixed by the initial mesh Our approach : Use voxel predictions to create initial mesh prediction! Justin Johnson November 13, 2019
Mesh R-CNN Pipeline Input imag age Justin Johnson November 13, 2019 Lecture 17 - 93
Mesh R-CNN Pipeline Input imag age 2D 2D obje bject recognit itio ion Justin Johnson November 13, 2019 Lecture 17 - 94
Mesh R-CNN Pipeline Input imag age 2D 2D obje bject recognit itio ion 3D 3D obje bject vo voxels ls Justin Johnson November 13, 2019 Lecture 17 - 95
Mesh R-CNN Pipeline Input imag age 2D 2D obje bject recognit itio ion 3D obje 3D bject meshes 3D 3D obje bject vo voxels ls Justin Johnson November 13, 2019 Lecture 17 - 96
Mesh R-CNN: ShapeNet Results Justin Johnson November 13, 2019 Lecture 17 - 97
Mesh R-CNN: Shape Regularizers Using Chamfer as only mesh loss gives degenerate meshes. Also need ”mesh regularizer” to encourage nice predictions: L edge = minimize L2 norm of edges in the predicted mesh Justin Johnson November 13, 2019 Lecture 17 - 98
Mesh R-CNN: Pix3D Results Justin Johnson November 13, 2019 Lecture 17 - 99
Mesh R-CNN: Pix3D Results Predicting many objects per scene Box & Mask Predictions Mesh Predictions Justin Johnson November 13, 2019 Lecture 17 - 100
Recommend
More recommend