ICDSC 2019 Region Merging Driven by Deep Learning for RGB-D Segmentation and Labeling U. Michieli, M. Camporese, A. Agiollo, G. Pagnutti, P. Zanuttigh September 9 th , 2019
2 Outline ¡ Semantic Segmentation ¡ Proposed Framework ¡ Pre-processing ¡ Over-segmentation and Classification ¡ Merging Phase ¡ Results ¡ Conclusions and Future Work
3 Semantic Segmentation wall wall objects objects furniture furniture floor ¡ Segmentation + labeling (pixel-wise classification) ¡ Deep learning and consumer depth sensors ¡ Very useful for free navigation systems to explore the surroundings
4 Semantic Segmentation
4 Semantic Segmentation
5 Proposed Framework
6 Proposed Framework AIM: propose CNN for region merging and refine boundaries of shapes Use normalized cuts spectral clustering extended for RGBD à but bias toward region of similar sizes Then 2 steps procedure: ¡ Initial over-segmentation to properly separate objects ¡ Region merging procedure to avoid over-segmentation Framework derived from [1] but much faster and simpler [1] G.Pagnutti, L. Minto, P. Zanuttigh, "Segmentation and Semantic Labeling of RGBD Data with Convolutional Neural Networks and Surface Fitting “, IET Computer Vision, 2017
7 Framework of [1] 320x240x6 (x, y, z) Geometry 1/ σ g point set vectors Depth Normalized cuts data spectral clustering 160x120x6 Normals Orientation Segment 1/ σ n computation vectors descriptors Convolutional Color Neural Network data RGB to CIELab Color CONs: 1/ σ c (CNN) conversion vectors Pre-processing Over-segmentation and classification • NURBS fitting very slow NURBS • Many hand-tuned Segment 1 fitting Compute Surface No Select two Sort and discard Discard similarity of fitting accuracy segments below similarity thresholds (on depth, union adjacent improved? to be joined threshold segments NURBS Segment 2 fitting Yes color, normals, NURBS Keep union fitting) Merge phase [1] G.Pagnutti, L. Minto, P. Zanuttigh, "Segmentation and Semantic Labeling of RGBD Data with Convolutional Neural Networks and Surface Fitting “, IET Computer Vision, 2017
8 Proposed Framework PROs: • Much faster • Fewer thresholds • Same accuracy
9 Proposed Framework - Preprocessing ¡ 3 channels for 3D location ¡ 3 channels for surface normals ¡ 3 channels for color representation à CIELab for perceptual uniformity ¡ Normalization to achieve consistent representation across the 3 domains.
10 Proposed Framework – Oversegmentation ¡ Over-segmentation with normalized cuts spectral clustering with Nystrom acceleration: 9D input ¡ CNN for the semantic labeling of each segment and for guiding the region merging process ¡ 9 conv layers ¡ 15 classes ¡ very simple
� 11 Proposed Framework – Region Merging ¡ Compute adjacency map of the segments ¡ Compute similarity between adjacent segment descriptors with Bhattacharyya coefficient: ' 𝑡 ' 𝑐 ",$ = ∑ ' 𝑡 " $ 𝑢 : class scores 𝑡 " : descriptors (~PDFs) ¡ Sort list on the basis of 𝑐 ",$
12 Proposed Framework Iterative merging procedure Ø Select segments with 𝑐 ",$ > 𝑈 -". Ø CNN classifier to decide whether the two segments will be joined or not • If merged: new segment of the union is created and list updated • If not merged: remove segments from the list
.. .. . . training time : about 11 hours on a NVIDIA Titan X GPU with 𝑚𝑠 = 10 34 , regularization constant = 10 35 , 𝑈 -". = 0.8 training : 50 epochs, batch size of 32 samples, CE & L2 regularization losses, Adam input : 2 outputs of softmax layer of semantic CNN (15 channels each candidate) CNN for classification (6 conv. layers, symm. padding, 2x2 maxpool, ReLU) CNN for Region Merging - PDFs PDFs 560x425x30 (560x425x6) CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 280x212x4 CONV 4@7x7 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 140x106x4 CONV 4@5x5 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 70x53x4 CONV 4@3x3 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 35x26x4 CONV 4@3x3 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 17x13x4 CONV 2@17x13 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 1x2 ARGMAX Not merged Merged 13
à PDFs richer descriptions, while normals are faster with limited impact on the final accuracy training time : about 3 hours on a NVIDIA Titan X GPU with 𝑚𝑠 = 10 35 , regularization constant = 5 ⋅ 10 3: , 𝑈 -". = 0.75 training : 50 epochs, batch size of 32 samples, CE & L2 regularization losses, Adam input : 2 surface normals of the 2 candidate segments (3 channels each) CNN for classification (6 conv. layers, symm. padding, 2x2 maxpool, ReLU) CNN for Region Merging - Normals normals 560x425x30 (560x425x6) CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 280x212x4 CONV 4@7x7 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 140x106x4 CONV 4@5x5 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 70x53x4 CONV 4@3x3 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 35x26x4 CONV 4@3x3 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 17x13x4 CONV 2@17x13 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 1x2 ARGMAX Not merged Merged 14
15 Experimental Results
16 NYUDv2 Dataset [2] 1449 depth maps + color images of indoor scenes with Kinect sensor RGB raw depth GT training set: 795 scenes test set: 654 scenes 894 classes clustered in 15 classes as [3] unknown & unlabeled classes excluded [2] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. 2012. Indoor segmentation and support inference from RGBD images. ECCV. Springer. [3] C. Couprie, C. Farabet, L. Najman, and Y. LeCun. 2013. Indoor semantic segmentation using depth information. ICLR.
¡ Randomly select 10 couples of adjacent segments in each image the merging CNN Need a dataset to train Merging CNN – Ground Truth Generation ¡ Assign label 0 otherwise ¡ Assign label 1 if more than 85% of the union of the segments belongs to same object Selection of a in the semantic segmentation ground truth segment adjacent segment Selection of an . . . . . . Ground truth examination label 1 5 6 0 x 4 2 5 x 3 0 ( 5 6 0 x 4 2 5 x 6 ) CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 2 8 0 x 2 1 2 x 4 CONV 4@7x7 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 1 4 0 x 1 0 6 x 4 Region appears to be uniform CONV 4@5x5 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 7 0 x 5 3 x 4 CONV 4@3x3 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 3 5 x 2 6 x 4 CONV 4@3x3 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 MAXP 2x2 MAXP 2x2 MAXP 2x2 MAXP 2x2 RELU RELU RELU RELU 1 7 x 1 3 x 4 CONV 2@17x13 CONV 4@9x9 CONV 4@9x9 CONV 4@9x9 1x2 ARGMAX Not merged Merged 17
18 Merging CNN – GT Ambiguities ¡ Examples of ambiguities in ground truth: ¡ Inconsistent labeling ¡ Objects not labeled Bed Objects Chair Furniture Ceiling Floor Picture/Deco Sofa Table Wall Windows missing Books Monitor/TV Unknown
19 Merging CNN – Results Predicted: Merge Predicted: Not Merged GT: Merge GT: Not Merged 18 ¡ Good oversegmentation (inter-uniformity)
20 Merging CNN – Results Predicted: Not Merged Predicted: Merge GT: Merge GT: Not Merged 18 ¡ Bad oversegmentation
21 Qualitative Results [1] Color view Semantic CNN Pagnutti et al. [21] Our Approach Ground Truth Bed Objects Chair Furniture Ceiling Floor Picture/Deco Sofa Table Wall Windows Books Monitor/TV Unknown [1] G.Pagnutti, L. Minto, P. Zanuttigh, "Segmentation and Semantic Labeling of RGBD Data with Convolutional Neural Networks and Surface Fitting “, IET Computer Vision, 2017
Recommend
More recommend