Paper Motivation ● Fixed geometric structures of CNN models – “CNNs are inherently limited to model geometric transformations” ● Higher-level features combine lower-level features at fixed positions as a weighted sum ● Pooling chooses the dominating features / averages features at fixed positions Tomas Jenicek, CMP, CVUT 2
Invariance to Geometric Transformations ● Learned from data augmentation ● Using transformation- invariant features and algorithms ● “Unknown or complex geometric transformations not learned or modeled” Tomas Jenicek, CMP, CVUT 3
Standard Convolution and RoI Pooling ● Convolution samples feature map at fixed locations ● RoI pooling reduces the spatial resolution at a fixed ratio ● “The higher the layer, the less desired behaviour” Tomas Jenicek, CMP, CVUT 4
Deformable Convolution ● Adds 2D offset to the regular grid sampling locations ● Free form deformation of the sampling grid Tomas Jenicek, CMP, CVUT 5
Deformable Convolution ● Offsets are learned from the preceding feature maps via additional convolutional layers Tomas Jenicek, CMP, CVUT 6
Deformable RoI Pooling ● Adds 2D offset to each bin position in the regular bin partition ● Adaptive part localization for objects with different shapes Tomas Jenicek, CMP, CVUT 7
Deformable RoI Pooling ● Offsets are learned from the preceding feature maps via additional RoI and a fully connected layer Tomas Jenicek, CMP, CVUT 8
Deformable Position-Sensitive RoI Pooling ● Differs by having a different set of feature maps for each bin position Tomas Jenicek, CMP, CVUT 9
Deformable Convolution and RoI Pooling Summary ● Inference: offsets depend on the input features ● Learning: offsets are learned from data ● Filters are differentiable Tomas Jenicek, CMP, CVUT 10
Method Details ● Offsets are fractional → bilinear interpolation ● For (PS) RoI pooling, normalized offsets must be used ● The number of additional parameters – Convolution and RoI pooling: – PS RoI pooling: ● Learning rate for offsets can be different Tomas Jenicek, CMP, CVUT 11
PS RoI Offsets Examples ● One 3x3 deformable PS RoI pooling layer ● Input: a bounding box with a label Tomas Jenicek, CMP, CVUT 12
PS RoI Offsets Examples Tomas Jenicek, CMP, CVUT 13
Conv Offsets Examples ● Three consecutive 3x3 deformable convolutional layers = 9^3 points Tomas Jenicek, CMP, CVUT 14
Conv Example – Man and a Goat ● Blue dots – standard convolution sample locations ● Red dots – deformable convolution sample locations ● For 1, 2 and 3 consequent layers Tomas Jenicek, CMP, CVUT 15
Conv Example – Man and a Goat ● Center of convolution on a man, sky and grass ● For 3 consequent layers Tomas Jenicek, CMP, CVUT 16
Conv Example – Man and a Goat ● The magnitude of offsets ● For 3 consequent layers – res5a, res5b and res5c Tomas Jenicek, CMP, CVUT 17
Conv Example – Man and a Goat ● The anisotropic scale HSV visualization ● Red – horizontal, Green – vertical ● For 3 consequent layers Tomas Jenicek, CMP, CVUT 18
Conv Example – Man and a Goat ● Offsets HSV visual. ● For 3 layers Tomas Jenicek, CMP, CVUT 19
Conv Example – Cars ● The magnitude of offsets ● For 3 consequent layers ● The foreground- background separation can be seen Tomas Jenicek, CMP, CVUT 20
Affine Transformation Approximation ● The “unknown and complex” transformation was approximated by an affine transformation ● Format is MEAN (STD), the first is vertical axis ● Unit is pixels in the feature map Man and a Goat Cars Mean squared error 3.1 (1.5) 2.7 (1.4) Scale 3.4, 3.7 (0.8, 1.1) 2.9, 3.6 (1.0, 1.1) Translation 0.8, 0.0 (1.3, 0.2) 0.3, 0.0 (1.2, 0.1) Rotation -0.1 (0.0) -0.1 (0.0) Shear 0.0 (0.0) 0.0 (0.0) ● Other tested images had similar results Tomas Jenicek, CMP, CVUT 21
Statistics of Learned Scale - Effective Dilation ● The mean of the distances between all adjacent pairs of sampling locations in the deformable convolution filter Tomas Jenicek, CMP, CVUT 22
Remarks ● The shift is a function of feature maps and not constrained by any (e.g. affine) transformation ● surprisingly no need for shift regularization Tomas Jenicek, CMP, CVUT 23
Relation to Deformable Part Models ● Maximizing the similarity of parts while minimizing the inter- part connection cost ● Inference can be converted to CNN, learning not end-to-end ● Deformable convolutions: no spatial relations between parts, unlimited in modeling deformations Tomas Jenicek, CMP, CVUT 24
Relation to Spatial Transform Networks 1. Localization net ● Input: feature map ● Output: affine transformation 2. Grid generator ● Generate a sampling grid according to transformation 3. Sampler Tomas Jenicek, CMP, CVUT 25
Relation to Spatial Transform Networks ● Can be inserted between any two layers ● Deformable convolutions: – No global parametric transformation – Easier training Tomas Jenicek, CMP, CVUT 26
Relation to Atrous / Dilated Convolutions ● Exponential expansion of the receptive field ● Deformable convolutions: input-dependent and learnable dilated convolution ● Both can replace filters with larger receptive field while constraining their connectivity Tomas Jenicek, CMP, CVUT 27
Relation to Active Convolution ● Learning the shape of convolution during training ● Deformable convolutions: input-dependent offsets Tomas Jenicek, CMP, CVUT 28
Relation to Dynamic Filter Network ● Weights for convolution are generated from the input feature map ● Deformable convolutions: the same but for offsets Tomas Jenicek, CMP, CVUT 29
Their Task ● Semantic segmentation ● Object detection Tomas Jenicek, CMP, CVUT 30
Their Setup SoA object detection and semantic segmentation CNNs: 1. Deep network generates feature maps – Replace last 3 conv layers with deformable 2. Shallow task specific network generates results – Replace (PS) RoI pooling with deformable Convolutions and offsets are learned simultaneously Tomas Jenicek, CMP, CVUT 31
Results ● Object detection – VOC 07: 82.3 vs. 79.6 mAP@0.5 – COCO: 56.8 vs. 54.3 mAP@0.5 ● Semantic segmentation – Cityscapes: 75.2 vs. 70.3 mIoU – VOC 12: 75.9 vs. 70.7 mIoU ● Others’ results – COCO (with Soft-NMS): 62.8 mAP@0.5 Tomas Jenicek, CMP, CVUT 32
Paper Evaluation – Formal Objections ● Page 2 formula (2) - notation is misleading since depends on ● Page 3 paragraph 3 – scalar gamma further scales normalized offsets, empirically set to 0.1 ● Page 5 figure 4 – figure is misleading, the output feature map has depth (C+1) Tomas Jenicek, CMP, CVUT 33
Paper Evaluation - Subjective Objections ● Page 3 paragraph 1 and 2 – notation is ambiguous ● Max pooling application is missing Tomas Jenicek, CMP, CVUT 34
References ● Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." Advances in Neural Information Processing Systems. 2015. ● Jeon, Yunho, and Junmo Kim. "Active Convolution: Learning the Shape of Convolution for Image Classification." arXiv preprint arXiv:1703.09076 (2017). ● Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." arXiv preprint arXiv:1511.07122 (2015). ● Felzenszwalb, Pedro F., et al. "Object detection with discriminatively trained part-based models." IEEE transactions on pattern analysis and machine intelligence 32.9 (2010): 1627-1645. ● De Brabandere, Bert, et al. "Dynamic filter networks." Neural Information Processing Systems (NIPS). 2016. Tomas Jenicek, CMP, CVUT 35
Recommend
More recommend