render for cnn
play

Render for CNN: Viewpoint Estimation in Images Using CNNsTrained - PowerPoint PPT Presentation

Render for CNN: Viewpoint Estimation in Images Using CNNsTrained with Rendered 3D Model Views Hao Su * Charles R. Qi * Yangyan Li Leonidas J. Guibas ILSVRC Image ClassificationTop-5 Error (%) 28.2 30 25.8 25 Top-5 Error (%) 20 16.4 15


  1. Render for CNN: Viewpoint Estimation in Images Using CNNsTrained with Rendered 3D Model Views Hao Su * Charles R. Qi * Yangyan Li Leonidas J. Guibas

  2. ILSVRC Image ClassificationTop-5 Error (%) 28.2 30 25.8 25 Top-5 Error (%) 20 16.4 15 11.7 10 6.7 3.6 5 0 2010 2011 2012 2013 2014 2015

  3. Go beyond 2D Image Classification car • 3D bounding box • 3D alignment • 3D model retrieval

  4. Go beyond 2D Image Classification 3D Viewpoint Estimation car in-plane rotation elevation azimuth

  5. 3DViewpoint Estimation in theWild Images in the Wild Models unknown

  6. 3D Perception in theWild Learn from Data Images in the Wild Models unknown AlexNet [Krizhevsky et al.] AlexNet [Krizhevsky et al.]

  7. However.. Accurate Label Acquisition is Expensive What’s the camera viewpoint angles to the SUV in the image?

  8. However.. Accurate Label Acquisition is Expensive PASCAL3D+ dataset [Xiang et al.]

  9. However.. Accurate Label Acquisition is Expensive Step1: PASCAL3D+ dataset [Xiang et al.] Choose similar model

  10. However.. Accurate Label Acquisition is Expensive Step1: PASCAL3D+ dataset [Xiang et al.] Choose similar model Step2: Coarse Viewpoint Labeling

  11. However.. Accurate Label Acquisition is Expensive Step1: PASCAL3D+ dataset [Xiang et al.] Choose similar model Step2: Coarse Viewpoint Labeling Annotation takes ~1 min per object Step3: Label keypoints For alignment

  12. High-capacity Model High-cost Label Acquisition 30K images with viewpoint labels in PASCAL3D+ 60M parameters. AlexNet [Krizhevsky et al.] dataset [Xiang et al.] How to get MORE images with ACCURATE viewpoint labels?

  13. Manual alignment by annotators Auto alignment through rendering

  14. Good News: ShapeNet 3M models in total 330K from 4K categories annotated 1,000 #models per class ShapeNet (going on) ModelNet 15’ 100 http://shapenet.cs.stanford.edu SHREC 14’ 10 PSB 05’ 1,000 1,000,000 #total models

  15. Key Idea: Render for CNN Training Viewpoint Rendering ShapeNet Synthetic Images

  16. Key Idea: Render for CNN Testing Viewpoint Real Images

  17. I want data! How to render data with both quantity and quality? Rendering

  18. Synthesize: Scalability vs Quality Ideal High Quality Low Low High Scalability

  19. Synthesize: Scalability vs Quality Ideal High Quality Low Low High Scalability

  20. Synthesize: Scalability vs Quality Ideal High Previous Quality works Low Low High Scalability

  21. Synthesize: Scalability vs Quality Ideal High Sweet spot Previous Quality works Low Low High Scalability

  22. Synthesize: Scalability vs Quality Ideal High Sweet spot Previous Quality works Story Time! Low Low High Scalability

  23. A “Data Engineering” Journey • 80K rendered chair images • Metric: 16-view classification accuracy tested on real images At beginning.. • Lighting: 4 fixed point light sources on the sphere • Background: clean

  24. A “Data Engineering” Journey 95% on synthetic val set 47% on real test set  ConvNet: Ah ha, I know! Viewpoint is just the brightness pattern!

  25. A “Data Engineering” Journey 95% on synthetic val set 47% on real test set  ConvNet: Ah ha, I know! Viewpoint is just the brightness pattern!

  26. A “Data Engineering” Journey Randomize lighting 47% -> 74% ConvNet: hmm.. viewpoint is not the brightness pattern. Maybe it’s the contour?

  27. A “Data Engineering” Journey Randomize lighting 47% -> 74% ConvNet: hmm.. viewpoint is not the brightness pattern. Maybe it’s the contour ?

  28. A “Data Engineering” Journey Add backgrounds 74% -> 86% ConvNet: It becomes really hard! Let me look more into the picture.

  29. A “Data Engineering” Journey bbox crop texture 86% -> 93%

  30. A “Data Engineering” Journey bbox crop texture 86% -> 93% ConvNet: the mapping becomes hard. I have to Key Lesson: Don ’ t give CNN a chance to “cheat” - it’s very good learn harder to get it right! at it. When there is no way to cheat, true learning starts.

  31. Render for CNN Image Synthesis Pipeline Add bkg Rendering Crop 3D model Sample lighting and Sample bkg. Image Sample cropping camera params Alpha-blending params Hyper-parameters estimation from real images

  32. Render for CNN Image Synthesis Pipeline Rendering 3D model Sample lighting and camera params

  33. Rendering Lighting params Camera params KDE from Randomly sampled PASCAL3D+ train set • Number of light sources • Light distances • Light energies • Light positions • Light types

  34. Render for CNN Image Synthesis Pipeline Add bkg Rendering 3D model Sample lighting and Sample bkg. Image camera params Alpha-blending

  35. Background Composition • Simple but effective! • Backgrounds randomly sampled from SUN397 dataset [Xiao et al.] • Alpha blending composition for natural boundaries

  36. Render for CNN Image Synthesis Pipeline Add bkg Rendering Crop 3D model Sample lighting and Sample bkg. Image Sample cropping camera params Alpha-blending params

  37. Image Cropping Cropping patterns KDE from PASCAL3D+ train set

  38. Image Cropping Cropping patterns KDE from PASCAL3D+ train set

  39. 2.4M Synthesized Images for 12 Categories • High scalability • High quality • Overfit-resistant • Accurate labels

  40. Results

  41. 3D Viewpoint Estimation Evaluation Metric: median angle error (lower the better) Real test images from PASCAL3D+ dataset

  42. 3D Viewpoint Estimation Evaluation Metric: viewpoint accuracy and median angle error (lower the better) Our model trained on rendered images outperforms state-of-the-art model trained on real images in PASCAL3D+. Real test images from PASCAL3D+ dataset 16 Viewpoint Median Error 15 14 13 12 11 10 9 8 Vps&Kps RenderForCNN (CVPR15) (Ours)

  43. How many 3D models are necessary? 90 85 80 Accuracy 75 10 vs 1000 models 70 20% + difference 65 60 55 10 91 1000 6928 #models (for one category)

  44. 3D Viewpoint Estimation

  45. Azimuth Viewpoint Estimation bicycle airplane bicycle 180 270 90 0 0 90 180 270 360 0 90 180 270 360 0 90 180 270 360 boat motorbike car Ground truth view Estimated view confidence 0 90 180 270 360

  46. Azimuth Viewpoint Estimation chair table chair 180 270 90 0 monitor sofa sofa Ground truth view Estimated view confidence

  47. Failure Cases sofa occluded by people ambiguous car viewpoint multiple cars car occluded by motorbike multiple chairs ambiguous chair viewpoint

  48. Limitations of Current Synthesis Pipeline • Modeling Occlusions? • Modeling Background Context? • Shape database augmentation by interpolation?

  49. Render for CNN – BeyondViewpoint • 3D model retrieval • Joint Embedding [Li et al sigasia15] • Object detection • Segmentation • Intrinsic image decomposition • Controlled experiments for DL • Vision algorithm verification

  50. Conclusion Images rendered from 3D models can be effectively used to train CNNs, especially for 3D tasks. State-of-the-art result has been achieved. Keys to success • Quantity: Large scale 3D model collection (ShapeNet) • Quality: Overfit-resistant, scalable image synthesis pipeline http://shapenet.cs.stanford.edu

  51. THE END THANK YOU!

Recommend


More recommend