Visual Place Recognition as Image Retrieval with CNN Giorgos Tolias Visual Recognition Group, CTU in Prague CVPR 2017 tutorial on Large-Scale Visual Place Recognition and Image-Based Localization Alex Kendall, Torsten Sattler, Giorgos Tolias, Akihiko Torii
Visual place recognition
Visual place recognition by image retrieval Query (image) query descriptor Nearest Neighbor search Descriptors for database images
http://viral.image.ntua.gr
CNN as feature extractors • CNN pre-trained for image classification • Internal layer activations as features • Good generalization properties • Detection • Fine-grained classification • Scene classification • Semantic segmentation Figure from Razavian et al. • …… Donahue, J., Jia, Y., Vinyals , O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: A deep convolutional activation feature for generic visual recognition. In: arXiv:1310.1531. (2013). Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson , S.: CNN features off -the-shelf: An astounding baseline for recognition. In: CVPRW. (2014)
Image retrieval with pre-trained CNN
Global image representation – FC layer • Features: FC layer activations • Resize/crop to fixed image size Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: ECCV. (2014) Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson , S.: CNN features off -the-shelf: An astounding baseline for recognition. In: CVPRW. (2014) Figure from Babenko et al.
Global image representation – Conv layer • Features: Conv layer activations • Global max or sum pooling • Any input image size • Better to use last Conv layer (VGG, Alex) Azizpour, H., Razavian, A.S., Sullivan, J., Maki, A., Carlsson , S.: From generic to specific deep representations for visual recognition. In: CVPRW. (2015) Babenko, A., Lempitsky, V.: Aggregating deep convolutional features for image retrieval. In: ICCV. (2015) Figure from Razavian et al.
Spatial and channel weighting • Channel-wise and spatial-wise weighting • Global sum pooling • Channel-wise: IDF-like weighting • Spatial-wise: saliency mask by L2 norm Kalantidis, Y., Mellina, C., Osindero, S.: Cross-dimensional weighting for aggregated deep convolutional features. In: ECCVW (2016) Figures from Kalantidis et al.
Maximum Activations of Convolutions - MAC Input image conv 5 filter 1 conv 5 filter 2 …. conv 5 filter i …. conv 5 filter K Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)
Maximum Activations of Convolutions - MAC Input image conv 5 filter 1 conv 5 filter 2 …. conv 5 filter i …. conv 5 filter K maximum activation Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)
MAC similarity • Similarity: inner product of L 2 normalized MAC descriptors • Max of the same feature map fires on the same location • Implicitly forms correspondences (512 for VGG) Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)
Regional Maximum Activations of Convolutions R-MAC • Extract MAC descriptor per region • Sum pool regional descriptors • Global image representation (same dimensionality as MAC) • PCA Whitening MAC L2 norm Whitening L2 norm whitening Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)
Comparison with local feature based methods Method Oxf5k Oxf105k Par6k Par106k 77.3 84.9 79.5 82.4 BoW(16M) + geometry + QE 3-4k features / image Memory demanding 88.0 84.0 Hamming Query Expansion 82.8 - Triangulation Emb. 1024D 56.0 50.2 - - Compact representation One descriptor /image 83.0 75.7 R-MAC (512D) 66.9 61.6
Object localization Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)
Object localization Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)
Object localization with integral max pooling Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)
Object localization with integral max pooling Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)
Object localization with integral max pooling Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)
Object localization with integral max pooling Initial ranking (IR) Re-ranking (RR) IR RR Initial ranking (IR) Re-ranking (RR) IR RR Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)
Object localization with integral max pooling Initial ranking (IR) Re-ranking (RR) <3 seconds to re-rank 1000 images IR RR using 1 CPU thread Initial ranking (IR) Re-ranking (RR) IR RR Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max pooling of CNN activations. In: ICLR. (2016)
Comparison with local features, geometry, and query expansion Method Oxf5k Oxf105k Par6k Par106k 84.9 79.5 82.4 77.3 BoW(16M) + geometry + QE 88.0 84.0 Hamming Query Expansion 82.8 - R-MAC (512D) 66.9 61.6 83.0 75.7 86.5 79.8 R-MAC + localization + QE 77.3 73.2
Other approaches Known encodings applied on CNN local descriptor • Bag-of-Words Mohedano E, Salvador A, McGuinness K, Giró-i-Nieto X, O'Connor N, Marqués F. Bags of Local Convolutional Features for Scalable Instance Search. In ICMR 2016 • Fisher vectors P. Kulkarni , J. Zepeda , F. Jurie , P. Perez and L. Chevallier, Hybrid multi-layer deep cnn/aggregator feature for image classification, In ICASSP 2015 • VLAD Y. Gong, L. Wang, R. Guo, and S. Lazebnik, Multi-scale Orderless Pooling of Deep Convolutional Activation Features, In ECCV 2014 Figure from Mohedano et al. Figure from Gong et al.
Off-the-shelf CNN • Target application: classification • Training dataset: ImageNet • Architecture: AlexNet, VGG, ResNet Images from ImageNet.org • Directly applicable to other tasks Fine-grain classification Object detection Image retrieval Images from ImageNet.org Images from PASCAL VOC 2012
CNN fine-tuning for image retrieval
Lots of Training Examples Training … Image annotations Large Internet Convolutional Neural photo collection Network (CNN)
Lots of Training Examples … Not accurate Expensive $$ Large Internet Convolutional Neural photo collection Network (CNN)
Lots of Training Examples Manual cleaning of the training data done by Researchers Very expensive $$$$ … Not accurate Expensive $$ Large Internet Convolutional Neural photo collection Network (CNN)
Lots of Training Examples Manual cleaning of the training data done by Researchers Very expensive $$$$ … Not accurate Expensive $$ Large Internet Convolutional Neural photo collection Network (CNN) Automated extraction of training data Accurate Free $
Annotations for CNN Image Retrieval CNN pre-trained for classification task used for retrieval [Gong et al. ECCV’14, Babenko et al. ICCV’15, Kalantidis et al. ECCVW’16, Tolias et al. ICLR’16 ] Building class
Annotations for CNN Image Retrieval CNN pre-trained for classification task used for retrieval [Gong et al. ECCV’14, Babenko et al. ICCV’15, Kalantidis et al. ECCVW’16, Tolias et al. ICLR’16 ] Building class Fine-tuned CNN using a dataset with landmark classes [Babenko et al. ECCV’14] Landmark class
Annotations for CNN Image Retrieval CNN pre-trained for classification task used for retrieval [Gong et al. ECCV’14, Babenko et al. ICCV’15, Kalantidis et al. ECCVW’16, Tolias et al. ICLR’16 ] Building class Fine-tuned CNN using a dataset with landmark classes [Babenko et al. ECCV’14] Landmark class NetVLAD: Weakly supervised spatially closest ≠ matching fine-tuned CNN using GPS tags [Arandjelovic et al. CVPR’16]
NetVLAD WxHxD feature map of last conv. layer Negatives: geographically far Positives: geographically close and close in the feature space Figures from Arandjelovic et al.
Annotations for CNN Image Retrieval CNN pre-trained for classification task used for retrieval [Gong et al. ECCV’14, Babenko et al. ICCV’15, Kalantidis et al. ECCVW’16, Tolias et al. ICLR’16] Building class Fine-tuned CNN using a dataset with landmark classes [Babenko et al. ECCV’14] Landmark class NetVLAD: Weakly supervised spatially closest ≠ matching fine-tuned CNN using GPS tags [Arandjelovic et al. CVPR’16]
Annotations for CNN Image Retrieval CNN pre-trained for classification task used for retrieval [Gong et al. ECCV’14, Babenko et al. ICCV’15, Kalantidis et al. ECCVW’16, Tolias et al. ICLR’16] Building class Fine-tuned CNN using a dataset with landmark classes [Babenko et al. ECCV’14] Landmark class NetVLAD: Weakly supervised spatially closest ≠ matching fine-tuned CNN using GPS tags [Arandjelovic et al. CVPR’16] Automatic annotations for CNN training [Radenovic et al. ECCV’16] Hard negatives Hard positives
Recommend
More recommend