learning affinity via spatial propagation network
play

LEARNING AFFINITY VIA SPATIAL PROPAGATION NETWORK Sifei Liu, - PowerPoint PPT Presentation

LEARNING AFFINITY VIA SPATIAL PROPAGATION NETWORK Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, Jan Kautz NVIDIA Research, March 27, 2018 WHAT IS AFFINITY? The relations between two pixels/regions A typical


  1. LEARNING AFFINITY VIA SPATIAL PROPAGATION NETWORK Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, Jan Kautz NVIDIA Research, March 27, 2018

  2. WHAT IS AFFINITY? • The relations between two pixels/regions • A typical model: pixel locations and intensities ! "# = % & ', ) % * + , , + - % & and % * are manually designed kernel function, e.g., . /, geometric intensity closeness closeness

  3. WHY AFFINITY? A semantic-level example spatially adjacent nose far away, with similar texture hair far away, different in shape and appearance image semantic labels eye ! "# = % & ', ) * % + , - , , .

  4. a CNN a propagation network … … • deep • shallow • • pixels are pixels are independent correlated image patch

  5. a CNN a propagation network … … • deep • shallow • • pixels are pixels are independent correlated image patch

  6. HOW TO USE IT? CNN-based semantic segmentation softmax CNN pixel-wise image probability segmentation Ø Standard CNN-based image segmentation does not explicitly model the pairwise relations of pixels.

  7. WHY AFFINITY? Refine the segmentation CNN-based segmentation image our result

  8. WHY AFFINITY? Improve the context CNN-based segmentation image our result

  9. PROPOSED ARCHITECTURE conv refined result conv coarse mask w " guidance network (deep CNN) affinity RGB image

  10. PROPOSED ARCHITECTURE Spatial propagation networks (SPN) conv refined result conv coarse mask …

  11. SPATIAL PROPAGATION NETWORK Row/column-wise linear propagation left to right propagating ℎ " = $ − & " ' " + ) " ℎ "*+ , . ∈ 2, 1 • ℎ " : the . "2 column in the SPN hidden layer • ) " : an 1×1 transform matrix between t-1 and t • & " : the degree matrix normalizing the response of ℎ " 8 • & " 4, 4 = ∑ ) " 6,7 79+,7:6 ℎ "*+ ℎ " ℎ "*+ ℎ "

  12. SPATIAL PROPAGATION NETWORK Row/column-wise linear propagation Affinity: the off-diagonal entities of G • ℎ " = $ " , • ℎ & = ' − ) & $ & + + & ℎ " , … • ℎ , = ' − ) , $ , + + , ℎ ,-& , … 6 / 0 : the span of ℎ &-. 23 ℎ & , ℎ 4 , … , ℎ . • • ℎ . = ' − ) . $ . + + . ℎ .-& , 6 • 7 0 : the span of $ &-. 23 $ & , $ 4 , … , $ . 8 : an 9×9 9 = ; 4 matrix • < , = ' − ) , •

  13. SPATIAL PROPAGATION NETWORK Advantage: a compact representation Affinity: the off-diagonal entities of G • Parameters • Dense affinity (dense CRF): ! " ×! " • Affinity in SPN (for all ( $ % )): ! " ×! learnable affinity matrix = learn all $ & , $ ( , … , $ *

  14. LEARNING THE AFFINITY MATRIX • Basic idea: Learn all the entities of ! w.r.t. 4 directions ! ' , ! ( , … , ! * → ! ' , ! ( , … , ! * ↑ image CNN ! ' , ! ( , … , ! * ← ! ' , ! ( , … , ! * ↓ • Disadvantage: Huge dimension of network output: number of channels= " ℎ %&( ℎ %&' ℎ % … … … … … … A fully-connected spatial propagation

  15. REDUCING CONNECTIONS Three-way connection • Each pixel is connected to 3 adjacent pixels in the previous row/column ℎ $." = 1 − ) * $," . $," + ) * $," ℎ $,"01 $∈- $∈- • For each pixel, only 3 where 2 = 3 scalar weights needs to be learned with each direction. ℎ "01 ℎ " ℎ "01 ℎ " • Tridiagonal-transform matrix ! " The total CNN output is: • 7 8 ×12 ! " = ℎ " = 1 − 4 " . " + ! " ℎ "01 , 5 ∈ 2, 7

  16. PROPOSED ARCHITECTURE Learnable affinity through a guidance network (CNN) SPN guidance network (deep CNN) affinity RGB image 3×% & w "

  17. REDUCING CONNECTIONS Three-way connection • The integration of 4 directions results in dense connections between all pixels.

  18. IMAGE SEGMENTATION REFINEMENT Helen Dataset: high-resolution VOC 2012 Dataset: general object face parsing with 11 classes semantic segmentation with 21 classes

  19. A TYPICAL CNN IMAGE SEGMENTATION RGB image Jonathan Long, Evan Shelhamer, Trevor Darrell. Fully convolutional networks for semantic segmentation, CVPR 2015

  20. SPATIAL PROPAGATION NETWORK • Input and output: the probability map of all classes (we show one class as an example). 128×128×1 128×128×1 64×64×16 4×4×16/2 4×4×16/2 node-wise max-pool coarse mask refined result 64×64×16

  21. GUIDANCE NETWORK to the propagation module symmetric CNN 64×64×16 skipped links ! of one direction RGB image Helen: VOC: A relatively small VGG16 conv1~pool5 with pre-trained network, learned weights, symmetric upsample layers from scratch learned from scratch

  22. IMPLEMENTATION • Baseline CNN Helen: we train a CNN-base network with the output size as ! " as the input image • VOC: we directly use FCN-8s. • • Train a universal SPN • Coarse mask: segmentation results on the trainset by CNN-base. Guidance network: an independent deep CNN. • • Test on any base networks (for VOC only) • We directly apply the SPN on any CNN-base network ( e.g. , deeplab-VGG-16 and ResNet-101).

  23. HELEN FACE PARSING Original CNN-base • CNN-base: the baseline CNN network • SPN: the three-way model with the two different connections and a same guidance network. CNN+[1] ours [1] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016

  24. HELEN FACE PARSING Original CNN-base • CNN-base: the baseline CNN network • SPN: the three-way model with the two different connections and a same guidance network. CNN+[1] ours [1] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016

  25. QUANTITATIVE EVALUATION f-score skin brows eyes nose mouth lip_upper lip_lower lip_inner overall Multi-obj [1] 90.87 69.89 74.74 90.23 82.07 59.22 66.30 81.70 83.68 CNN base 90.53 70.09 74.86 89.16 83.83 55.61 64.88 71.27 82.89 CNN Highres 91.78 71.84 74.46 89.42 81.83 68.15 72 71.95 83.21 CNN + [2] 92.26 75.05 85.44 91.51 88.13 77.61 70.81 79.95 87.09 CNN + SPN (ours) 93.1 78.53 87.71 92.62 91.08 80.17 71.53 83.13 89.3 [1] Sifei Liu, Jimei Yang, Chang Huang and Ming-Hsuan Yang. Multi-objective convolutional learning for face labeling. CVPR 2015. [2] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016

  26. VOC2012 OBJECT SEGMENTATION Experimental comparison a universal SPN • Segmentation Networks FCN-8s • • Deeplab-VGG-16 Deeplab-ResNet-101 • • Refinement Models … deeplab deeplab FCN … vgg-16 resnet-101 • Dense CRF SPN (trained on FCN-8s) •

  27. VOC2012 OBJECT SEGMENTATION CNN base Original Dense CRF • CNN base: ResNet-101 with dilated convolution (deeplab pretrained model) • Dense CRF: CNN base + Dense CRF SPN • SPN: pretrained + SPN

  28. VOC2012 OBJECT SEGMENTATION CNN base Original Dense CRF • CNN base: ResNet-101 with dilated convolution (deeplab pretrained model) • Dense CRF: CNN base + Dense CRF SPN • SPN: pretrained + SPN

  29. VOC2012 OBJECT SEGMENTATION

  30. VOC2012 OBJECT SEGMENTATION Quantitative results on the validation set model F F+[2] F+SPN V V+[2] V+SPN R R+[2] R+SPN overall AC 91.22 90.64 92.90 92.61 92.16 93.83 94.63 94.12 95.49 mean AC 77.61 70.64 79.49 80.97 73.53 83.15 84.16 77.46 86.09 mean IoU 65.51 60.95 69.86 68.97 64.42 73.12 76.46 72.02 79.76 • F: FCN-8s • V: VGG-16 • R: ResNet-101 [2] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016

  31. VOC2012 OBJECT SEGMENTATION Quantitative results on the validation set model F F+[2] F+SPN V V+[2] V+SPN R R+[2] R+SPN overall AC 91.22 90.64 92.90 92.61 92.16 93.83 94.63 94.12 95.49 mean AC 77.61 70.64 79.49 80.97 73.53 83.15 84.16 77.46 86.09 mean IoU 65.51 60.95 69.86 68.97 64.42 73.12 76.46 72.02 79.76 mean IoU CNN base +Dense CRF +SPN VGG-16 (val) 68.97 71.57 73.12 ResNet-101 (val) 76.40 77.69 79.76 ResNet-101 (test) - 79.70 80.22 [1] Liang-Chieh Chen*, George Papandreou*, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv preprint, 2016 [2] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016

  32. RUNTIME ANALYSIS (VOC MODEL) Computational efficiency on 512×512 images 4.4 s 5000 4000 3.2 s 3000 2000 ~1 s 1000 0.08 s 0 Dense CRF [1] Dense CRF [2] CRF as RNN [3] OURS [1] Philipp Krähenbühl, Vladlen Koltun, Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. NIPS 2011. [2] Liang-Chieh Chen*, George Papandreou*, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv preprint arXiv:1606.00915 (2016). [3] Shuai Zheng*, Sadeep Jayasumana*, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. Conditional Random Fields as Recurrent Neural Networks. ICCV 2015.

Recommend


More recommend