1 Feature-based Place Recognition Akihiko Torii Tokyo Tech CVPR 2017 tutorial on Large-Scale Visual Place Recognition and Image-Based Localization Alex Kendall, Torsten Sattler, Giorgos Tolias, Akihiko Torii
2 Introduction • Challenges in large-scale place recognition • Local feature based image representation • Instant level recognition to place recognition • Datasets and evaluation protocol • Feature-based Place Recognition
3 Introduction Feature-based Place Recognition
4 Visual Place Recognition Where? Search box
5 The approach Represent the world by a set of geotagged images • Given a query image, find the best matching image • Transfer the geotag of the best matching image • Query https://www.google.co.jp/maps/ @35.6066354,139.6861582,3a,45.2y,256.68h,96.58t
6 Why is this interesting? Mapping/organizing any photos • on the globe Recognition & geometry •
7 (Visual) place recognition [Knopp10,Torii13, Maddern14, Johns14, Sunderhauf15] Location recognition [Cao13, Arandjelovic14, Sattler15] Landmark identification [Chen11] Geo-localization [Hays08, Cummins08, Zamir10, Zamir16, Kim17]
8 Challenges in large-scale place recognition Feature-based Place Recognition
9 Sources of geotagged images Photo community sites (flickr, instagram, …) + Never-stop growing - Noisy images/tags, concentrates landmarks StreetView images (Google StreetView, Mapillary,…) + Accurate, covering almost all the streets - (Can) not update frequently Generate perspective cutouts [Gronat11, Chen11, Torii13]
10 Sources of geotagged images San Francisco Landmarks dataset [Chen11] • 1.06M images for 6 x 6 km 2 Spatial and temporal densities may increase by collecting all the data from autonomous drivings! However, it is impossible to monitor all the streets for all the time. Figures from: [Chen-CVPR11]
11 Temporal sparseness induces … Query images Time 2014/07 2014/10/08 Lighting (day-night) Structure (poster) Database image
12 Spatial sparseness induces … Viewpoints Occlusions Space Time
13 It is actually the mixture of … Lightings, Structures, Viewpoints, Occlusions Space Time
14 Why is this di ffi cult? Temporal gap - Lighting, weather, season, structures, moving objects Spatial gap - Viewpoints, self-occlusions Large scale - Inter/intra repetitions, saturations of features
15 Local feature based image representation Feature-based Place Recognition
16 Visual instance recognition Geotagged image database Design an “image representation” extractor f(I) f( ) f( ) f( ) + + Transfer GPS + f( ) f( ) + + + f( ) + Query f( ) Image representation space
17 Review: Visual instance recognition 2 0 0 + 1 0 1 … Image I Extract local features Aggregate f(I) Compact yet discriminative image representation f(I), i.e. BoW, VLAD, FV
18 Review: Bag of Words (BoW) Local feature detection & description (DoG+SIFT) 0/1 assignment of desc. i to cluster k 0 1 0 0 [Sivic03]
19 Review: Bag of Words (BoW) Local feature detection & description (DoG+SIFT) 0/1 assignment of desc. i to cluster k Sum over all N descriptors 0 in the image 1 1 2 B = [ 1, 0, 2, 1, … ] [Sivic03]
20 Review: Vector of Locally Aggregated Descriptors (VLAD) 0/1 assignment of desc. i to cluster k Residual vector [Jégou10b]
21 Review: Vector of Locally Aggregated Descriptors (VLAD) 0/1 assignment of desc. i to cluster k Residual vector Sum over all N descriptors in the image V = [ , . , , , … ] [Jégou10b]
22 BoW VLAD (FV) B = [ 1, 0, 2, 1, … ] V = [ , . , , , … ] Dim. = #clusters Dim. = #clusters x Dim. of feature (e.g. 1.6M) (e.g. 256K x 128K) + Can be a sparse histogram + Performs well with a small vocab. (using a large vocab.) + Can be compressed by PCA with + Can provide matches a small loss in performance + No extra memory requirement to encode more features
23 Matching BoW histograms 0 1 1 2 Q = [ 1, 0, 2, 1, … ] B = [ 1, 0, 2, 1, … ]
24 Matching BoW histograms 0 1 1 2 Q = [ 1, 0, 2, 1, … ] B = [ 1, 0, 2, 1, … ]
25 Matching BoW histograms 0 1 1 2 Q = [ 1, 0, 2, 1, … ] B = [ 1, 0, 2, 1, … ]
26 BoW VLAD (FV) B = [ 1, 0, 2, 1, … ] V = [ , . , , , … ] Dim. = #clusters Dim. = #clusters x Dim. of feature (e.g. 1.6M) (e.g. 256K x 128K) + Can be a sparse histogram + Performs well with a small vocab. (using a large vocab.) + Can be compressed by PCA with + Can provide matches a small loss in performance + No extra memory requirement to encode more features
27 Sparse to dense features , . , , , … + Image I Aggregate f(I) Extract local features (DoG+SIFT)
28 Sparse to dense features , . , , , … + Image I Aggregate f(I) Extract local features (DSIFT, PHOW) + No memory overhead (with VLAD) + No bursts +/- Less invariant to viewpoint changes See [Lazebnik06, Bosch07, Iscen15, Torii15]
29 Sparse to dense features to CNN , . , , , … + Image I Aggregate f(I) Extract local features (DSIFT, PHOW) Pooling layer CNN layers Image NetVLAD layer Convolutional Neural Network (KxD)x1 soft-assignment VLAD vector conv (w,b) s L2 soft-max ... 1x1xDxK normalization x V intra- x VLAD core (c) normalization WxHxD map interpreted as NxD local descriptors x For detail, please be patient and wait for next session! See also an excellent survey paper [Zheng17]!
30 Instant level recognition to place recognition Feature-based Place Recognition
31 Designing (sparse) BoW tailored for place recognition tasks (Please forget about VLAD for 5 min) Feature-based Place Recognition
32 Using advanced techniques • Burstiness weighting [Jegou09] • Soft/multiple assignment [Philbin08, Chen11] • Hamming embedding [Jegou10, Arandjelovic14, Sattler16] • Query expansion [Chum11, Arandjelovic12] • Spatial/Hough pyramid, WGC, [Lazebnik06, Jegou10, Tolias14] Figure from [Zheng16]
33 Using advanced techniques • Burstiness weighting [Jegou09] • Soft/multiple assignment [Philbin08, Chen11] • Hamming embedding [Jegou10, Arandjelovic14, Sattler16] • Query expansion [Chum11, Arandjelovic12] • Spatial/Hough pyramid, WGC, [Lazebnik06, Jegou10, Tolias14] The challenges in large-scale place recognition Temporal • - Lighting, weather, season, structures, moving objects Spatial • - Viewpoints, self-occlusions Large scale • - Saturations (repetitions)
34
35 Burstiness weighting [Jegou09] Suppress saturation of BoW by repetitive patterns Query Top 3 ranked images The score is dominated by VWs on repeated structures But, removing them loses too much information
36 Soft/multiple assignment [Philbin08, Chen11] Retrieve matches lost by quantization Database image (correct) Query image Soft weight
37 Adaptive assignment [Torii13] 1.Explicitly detect repeated structures 2.Design an adaptive soft-assignment procedure - Repetitions provide a natural soft-assignment 3.Truncate high weights to limit influence of repeated VWs
38 Hamming embedding [Jegou10] Subdivide each cell into finer blocks • Give additional binary signature • Database image 10 (correct) 00 11 Query image 01 01 00
39 Dislocation [Arandjelovic14] Distinctiveness = inverse of local density = distance to the k-th nearest descriptor (HE signature) = σ See also Disloc+geometric burstiness [Sattler16] and selective match kernel [Tolias16]
40 Using GPS tags as weak priors Feature-based Place Recognition
41 Spatially far images should not match 200m positive image many negative data points [Schindler07] – What are informative features? [Zamir10] – Ratio test with location constraint. [Knopp10, Gronat13, Cao13, Sattler16 …. ]
42 Detecting “confusing” image regions [Knopp10] Key idea: Spatially far images should not match. Find the most similar images that are spatially far
43 Detecting “confusing” image regions [Knopp10] Find image areas with high density of local matches Matches with confused images Confusion score Confuising regions
44 Learning per-place linear SVM [Gronat-CVPR13] Key idea: Spatially far images should not match. 200m Objective function: where h is squared hinge loss. See also: [Cao13] Similar to Exemplar SVM by [Malisiewicz11]
45 NetVLAD [Arandjelovic16] GPS only provides weak supervision –Given a query, GPS gives us: • Definite negatives : – geographically far from the query Query
46 Major changes in appearance Feature-based Place Recognition
47 Changes across time, weather, season Figure from [Maddern14] Figure from [Neubert15] More studied in robot vision community, e.g. Generating illumination invariant images [Maddern14] - Leaning from repeated recordings [Neubert15] -
48 Place recognition under large changes in appearance & illumination [Torii15] A very challenging dataset that contains major changes • in illumination as well as structural changes. Day Sunset Night Database image Query images
Recommend
More recommend