G 2 DeNet: Global Gaussian Distribution Embedding Network and Its Application to Visual Recognition Qilong Wang 1 Peihua Li 1 Lei Zhang 2 1 Dalian University of Technology, 2 Hong Kong Polytechnic University
Tendency of CNN architectures LeNet-5 VGG-VD-19 /GoogLeNet-22 AlexNet-8 ResNet-152 /Inception-V4 CNN architectures tend to be Deeper & Wider More accurate ! Only Convolution, Non-linear (ReLU), Pooling
Trainable structural layers O 2 P layer (LogCOV ) [DeepO 2 P, ICCV’15] … Bilinear pooling (COV) … [B- CNN, ICCV’15] …… Mean Map Embedding [DMMs, arXiv’15] Images Conv. layers Loss VLAD Coding [NetVLAD , CVPR’16] Modeling outputs of the last convolutional layer as trainable structural layers .
Trainable structural layers Fine-grained Visual Classification B-CNN [D,D] (84.1, 84.1, 91.3) ~ 8% VGG-VD16 (76.4, 74.1, 79.8) T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recognition. In ICCV, 2015.
Trainable structural layers Place Recognition (Pitts30k) ~ 15% NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR, 2016.
Trainable structural layers DMMs + GoogLeNet (49.00) VS. GoogLeNet(47.5) Scene Categorization (Place205) J. B. Oliva, D. J. Sutherland, B. P ´ oczos, and J. G. Schneider. Deep mean maps. arXiv, abs/1511.04150, 2015.
Trainable structural layers DMMs + GoogLeNet (49.00) VS. GoogLeNet(47.5) Integration of trainable structural layers into deep Scene Categorization (Place205) CNNs achieves significant improvements in many challenging vision tasks. J. B. Oliva, D. J. Sutherland, B. P ´ oczos, and J. G. Schneider. Deep mean maps. arXiv, abs/1511.04150, 2015.
Parametric probability distribution modeling ① Modelling abundant statistics of features. ② Producing fixed size representations regardless of varying feature sizes. Promising modeling performance ( > coding methods ) Nakayama et al. CVPR’10 Gaussian Serra et al. CVIU’15 Parametric probability distribution modeling Distribution Wang et al. CVPR’16 Gaussian Mixture Model High computational efficiency Closed-form solution of parameters estimation …… Gaussian- Laplacian Model
Embedding of global Gaussian in CNN … … …… Images Conv. layers Loss Global Gaussian
Global Gaussian distribution embedding network ( G 2 DeNet ) 1 Global Σ μμ μ T 2 μ Σ , Gaussian: μ T 1 … f Z ( ) Matrix Partition Sub- … X Y Square-rooted SPD Z layer Matrix Sub-layer 1 T T f ( ) X AX XA …… 1 MPL N ( ) Y Y 2 f 2 ESRL T T AX 1b B N sym ( ) Z f ( ) Z f Y X Images Conv. Layers Global Gaussian Embedding Layer Loss A trainable global Gaussian embedding layer for modeling convolutional features. The first attempt to plug a parametric probability distribution into deep CNNs .
Challenges Riemannian Geometry Structure Forward Q: How to construct our trainable Propagation Algebraic global Gaussian embedding layer? Structure A: The key is to give the explicit Backward forms of Gaussian distributions. Differentiable Propagation
Gaussian embedding The space of Gaussians is a Riemannian manifold having special geometric structure. [TPAMI’ 17] shows space of Gaussians is equipped with a Lie group structure. Cholesky decomp. left polar decomp. L L T 1 A PO 1 T , L T L T 2 , A P T T 0 1 T 1 , L Gaussian Positive upper triangular matrix SPD matrix [TPAMI’ 17] Peihua Li, Qilong Wang et al. Local Log-Euclidean Multivariate Gaussian Descriptor and Its Application to Image Classification. TPAMI, 2017.
Global Gaussian embedding layer 1 T 2 , P Gaussian Embedding : T 1 2. Square-rooted SPD Matrix Sub-layer: 1. Matrix Partition Sub-layer : 1 T Z f Y Y 2 Y f X ESRL T MPL 1 1 T U U 2 1 2 T T T T AX XA AX 1b B N N sym Y is a function of convolutional features X. Computing square-root of Y via SVD.
BP for global Gaussian embedding layer 1 Global Σ μμ μ T 2 μ Σ , Gaussian: μ T 1 … X Y f Z ( ) Z … Matrix Partition Sub- Square-rooted SPD layer Matrix Sub-layer 1 T T f ( ) X AX XA …… MPL 1 N f ( ) Y Y 2 2 T T AX 1b B ESRL N sym ( ) Z ( ) f Z f Y X Images Conv. Layers Global Gaussian Embedding Layer Loss f Z f Z The first step is to compute The goal is to compute X Y
BP for square-rooted SPD matrix sub-layer f Z T Y U U Compute Y f f f Y U : d : d : d [DeepO 2 P, ICCV’15 ] Y U T T T d U 2 U K U d YU , sym T d U d YU . diag f f f 1 T T T U 2 K U U , K ij Y U 2 2 diag i j sym [DeepO 2 P, ICCV’15]: Catalin Ionescu et al. Matrix Backpropagation for Deep Networks with Structured Layers. ICCV, 2015.
BP for square-rooted SPD matrix sub-layer f f Compute and U 1 1 f f f T Z Y Y U U f : d Z : d U : d 2 2 ESRL Z U 1 1 1 T T d Z d U U U d U 2 2 2 2 sym 1 1 f f f 1 f T 2 U U , U U . 2 2 U Z Z 2 sym
BP for global Gaussian embedding layer The goal is to compute f given f X Y 1 2 T T T T Y f X AX XA AX 1b B MPL N N sym f 2 f T T N XA 1b A X Y f f sym X Y : d : d X Y BP for global Gaussian embedding layer
Global Gaussian distribution embedding network ( G 2 DeNet ) 1 Global Σ μμ μ T 2 μ Σ , Gaussian: μ T 1 … f Z ( ) Matrix Partition Sub- … X Y Square-rooted SPD Z layer Matrix Sub-layer 1 T T f ( ) X AX XA …… 1 MPL N ( ) Y Y 2 f 2 ESRL T T AX 1b B N sym ( ) Z f ( ) Z f Y X Images Conv. Layers Global Gaussian Embedding Layer Loss Gaussian Embedding. f f Structural Backpropagation and . X Y
Experiments on MS-COCO 890k segmented instances from MS-COCO dataset. 80 classes, ~600k training instances, ~290k validation ones. [DeepO 2 P, ICCV’ 15] DeepO 2 P DeepO 2 P-FC DeepO 2 P-FC [ICCV 15] (S) [ICCV 15] [ICCV 15] Err. 28.6 28.9 25.2 G 2 DeNet G 2 DeNet-FC (S) G 2 DeNet-FC (Ours) (Ours) (Ours) Err. 24.4 22.6 21.5 Convergence curve of our G 2 DeNet- Comparison of classification errors on MS-COCO. FC with AlexNet on MS-COCO.
Experiments on MS-COCO 890k segmented instances from MS-COCO dataset. 80 classes, ~600k training instances, ~290k validation ones. [DeepO 2 P, ICCV’ 15] AlexNet DeepO 2 P DeepO 2 P-FC DeepO 2 P-FC (baseline) [ICCV 15] (S) [ICCV 15] [ICCV 15] Err. 25.3 28.6 28.9 25.2 G 2 DeNet G 2 DeNet-FC (S) G 2 DeNet-FC DMMs-FC [arXiv‘15] (Ours) (Ours) (Ours) Err. 24.6 24.4 22.6 21.5 Convergence curve of our G 2 DeNet- Comparison of classification errors on MS-COCO. FC with AlexNet on MS-COCO.
Experiments on FGVR - Benchmarks Birds CUB-200-2011 FGVC-Aircraft FGVC-Car 200 classes 100 classes 196 classes 5,994 training/5,794 test 6,667 training/3,333 test 8,144 training/8,041 test
Experiments on FGVR - Results Methods Birds CUB-200-2011 FGVC-Aircraft FGVC-Cars FC-CNN 76.4 74.1 79.8 FV-CNN 77.5 77.6 85.7 VLAD-CNN 79.0 80.6 85.6 NetFV [TPAMI’17] 79.9 79.0 86.2 NetVLAD [CVPR’16] 81.9 81.8 88.6 B- CNN [ICCV’15] 84.1 84.1 91.3 G 2 DeNet (Ours) 87.1 89.0 92.5 Comparison of different counterparts by using VGG-VD16 without Bounding Box & Part sharing the same settings with B-CNN. NetFV [TPAMI’17]: Lin et al. Bilinear CNNs for Fine-grained Visual Recognition. TPAMI, 2017.
Recommend
More recommend