Efficient visual search of local features Efficient visual search of - - PowerPoint PPT Presentation

efficient visual search of local features efficient
SMART_READER_LITE
LIVE PREVIEW

Efficient visual search of local features Efficient visual search of - - PowerPoint PPT Presentation

Efficient visual search of local features Efficient visual search of local features Cordelia Schmid Cordelia Schmid Visual search Visual search change in viewing angle Matches Matches 22 correct matches Image search system for large


slide-1
SLIDE 1

Efficient visual search of local features Efficient visual search of local features

Cordelia Schmid Cordelia Schmid

slide-2
SLIDE 2

Visual search Visual search

change in viewing angle

slide-3
SLIDE 3

Matches Matches

22 correct matches

slide-4
SLIDE 4

Image search system for large datasets Image search system for large datasets

Large image dataset (one million images or more) (one million images or more) Image search ranked image list query Image search system

  • Issues for very large databases
  • to reduce the query time

q y

  • to reduce the storage requirements
  • with minimal loss in retrieval accuracy
slide-5
SLIDE 5

Two strategies g

  • 1. Efficient approximate nearest neighbour search on local

feature descriptors feature descriptors. 2 Quantize descriptors into a “visual vocabulary” and use

  • 2. Quantize descriptors into a “visual vocabulary” and use

efficient techniques from text retrieval. (Bag of words representation) (Bag-of-words representation)

slide-6
SLIDE 6

Strategy 1: Efficient approximate NN search

Local features invariant descriptor descriptor vectors

Images

invariant d i t descriptor vectors

1. Compute local features in each image independently 2. Describe each feature by a descriptor vector 3. Find nearest neighbour vectors between query and database g q y 4. Rank matched images by number of (tentatively) corresponding regions 5. Verify top ranked images based on spatial consistency

slide-7
SLIDE 7

Finding nearest neighbour vectors

Establish correspondences between query image and images in the database by nearest neighbour matching on SIFT vectors

128D descriptor space Model image Image database

S l f ll i bl f ll f t t i th i Solve following problem for all feature vectors, , in the query image: where, , are features from all the database images.

slide-8
SLIDE 8

Quick look at the complexity of the NN-search

N … images M … regions per image (~1000) D … dimension of the descriptor (~128) Exhaustive linear search: O(M NMD) Example:

  • Matching two images (N=1), each having 1000 SIFT descriptors

Nearest neighbors search: 0 4 s (2 GHz CPU implemenation in C) Nearest neighbors search: 0.4 s (2 GHz CPU, implemenation in C)

  • Memory footprint: 1000 * 128 = 128kB / image

# of images CPU time Memory req.

N = 1,000 … ~7min (~100MB) N = 10,000 … ~1h7min (~ 1GB)

g y q

… N = 107 ~115 days (~ 1TB) … All images on Facebook: All images on Facebook: N = 1010 … ~300 years (~ 1PB)

slide-9
SLIDE 9

Nearest-neighbor matching ea est e g bo atc g

S l f ll i bl f ll f t t i th i Solve following problem for all feature vectors, xj, in the query image: where xi are features in database images. Nearest-neighbour matching is the major computational bottleneck

  • Linear search performs dn operations for n features in the

d t b d d di i database and d dimensions

  • No exact methods are faster than linear search for d>10

A i t th d b h f t b t t th t f

  • Approximate methods can be much faster, but at the cost of

missing some correct matches. Failure rate gets worse for large datasets. g

slide-10
SLIDE 10

K-d tree d t ee

  • K-d tree is a binary tree data structure for organizing a set of points

E h i t l d i i t d ith i li d h l

  • Each internal node is associated with an axis aligned hyper-plane

splitting its associated points into two sub-trees.

  • Dimensions with high variance are chosen first
  • Dimensions with high variance are chosen first.
  • Position of the splitting hyper-plane is chosen as the mean/median of

the projected points – balanced tree.

l1 l l

4 6 l1

p j p

l2 l3 l4 l5 l7 l6

7 6 5 8 l1 l2

l8 l9 l10 2 5 4 11 8

1 3 2 9 10

1 3 9 10 6 7

11

slide-11
SLIDE 11

Large scale object/scene recognition Large scale object/scene recognition

Image dataset: k d i li t > 1 million images query Image search system ranked image list q y

  • Each image described by approximately 2000 descriptors

2 * 109 descriptors to index for one million images! – 2 109 descriptors to index for one million images!

  • Database representation in RAM:

Database representation in RAM:

– Size of descriptors : 1 TB, search+memory intractable

slide-12
SLIDE 12

Bag-of-features [Sivic&Zisserman’03] Bag of features [Sivic&Zisserman 03]

sparse freq enc ector centroids (visual words) Set of SIFT descriptors Query image

Harris-Hessian-Laplace regions + SIFT descriptors Bag-of-features processing +tf-idf weighting

sparse frequency vector

+tf idf weighting querying

Inverted

  • “visual words”:

querying

file

  • visual words :

– 1 “word” (index) per local descriptor

ranked image

Geometric

Re-ranked

p – only images ids in inverted file => 8 GB fits!

g short-list

verification

list [Chum & al. 2007]

slide-13
SLIDE 13

Indexing text with inverted files Indexing text with inverted files

Document collection: Inverted file: Term List of hits (occurrences in documents) Inverted file: Term List of hits (occurrences in documents) People [d1:hit hit hit], [d4:hit hit] … Common [d1:hit hit] [d3: hit] [d4: hit hit hit] Common [d1:hit hit], [d3: hit], [d4: hit hit hit] … Sculpture [d2:hit], [d3: hit hit hit] …

Need to map feature descriptors to “visual words”

slide-14
SLIDE 14
slide-15
SLIDE 15
slide-16
SLIDE 16

Visual words Visual words

  • Example: each group

f t h b l t

  • f patches belongs to

the same visual word

16

  • K. Grauman, B. Leibe

Figure from S ivic & Zisserman, ICCV 2003

slide-17
SLIDE 17

Inverted file index for images comprised of visual words

Word List of image

  • number

numbers

  • Score each image by the number of common visual words (tentative
  • Score each image by the number of common visual words (tentative

correspondences)

  • Dot product between bag-of-features

Image credit: A. Zisserman

  • K. Grauman, B. Leibe

Dot product between bag of features

  • Fast for sparse vectors !
slide-18
SLIDE 18

Visual words – approximate NN search Visual words approximate NN search

Map descriptors to words by quantizing the feature space

  • Map descriptors to words by quantizing the feature space

– Quantize via k-means clustering to obtain visual words – Assign descriptors to closest visual words Assign descriptors to closest visual words

  • Bag-of-features as approximate nearest neighbor search

g pp g

Descriptor matching with k-nearest neighbors Bag-of-features matching function where q(x) is a quantizer, i.e., assignment to a visual word and q( ) q , , g δa,b is the Kronecker operator (δa,b=1 iff a=b)

slide-19
SLIDE 19

Approximate nearest neighbor search evaluation Approximate nearest neighbor search evaluation

  • ANN algorithms usually returns a short-list of nearest neighbors

this short list is supposed to contain the NN with high probability – this short-list is supposed to contain the NN with high probability – exact search may be performed to re-order this short-list

  • Proposed quality evaluation of ANN search: trade-off between

– Accuracy: NN recall = probability that the NN is in this list against against – Ambiguity removal = proportion of vectors in the short-list

  • the lower this proportion, the more information we have about the

t vector

  • the lower this proportion, the lower the complexity if we perform exact

search on the short-list

  • ANN search algorithms usually have some parameters to handle this trade-off
slide-20
SLIDE 20

ANN evaluation of bag-of-features ANN evaluation of bag of features

  • ANN algorithms

returns a list of

0 7

returns a list of potential neighbors

0.6 0.7

k=100 200

  • Accuracy: NN recall

= probability that the NN is in this list

ecall 0 4 0.5

500 1000

NN is in this list

  • Ambiguity removal:

NN re 0.3 0.4

2000 5000 10000

= proportion of vectors in the short-list

0.2

20000 30000 50000

  • In BOF, this trade-off

is managed by the number of clusters k

0.1 1e-07 1e-06 1e-05 0 0001 0 001 0 01 0 1 BOW

number of clusters k

1e 07 1e 06 1e 05 0.0001 0.001 0.01 0.1 rate of points retrieved

slide-21
SLIDE 21

20K visual word: false matches

slide-22
SLIDE 22

200K visual word: good matches missed

slide-23
SLIDE 23

Problem with bag-of-features Problem with bag of features

  • The intrinsic matching scheme performed by BOF is weak
  • The intrinsic matching scheme performed by BOF is weak

– for a “small” visual dictionary: too many false matches – for a “large” visual dictionary: many true matches are missed g y y

  • No good trade-off between “small” and “large” !

– either the Voronoi cells are too big – or these cells can’t absorb the descriptor noise  intrinsic approximate nearest neighbor search of BOF is not  intrinsic approximate nearest neighbor search of BOF is not sufficient – possible solutions

  • soft assignment [Philbin et al. CVPR’08]
  • additional short codes [Jegou et al. ECCV’08]
slide-24
SLIDE 24

Hamming Embedding [Jegou et al ECCV’08] Hamming Embedding [Jegou et al. ECCV 08]

Representation of a descriptor x – Vector-quantized to q(x) as in standard BOF + short binary vector b(x) for an additional localization in the Voronoi cell Two descriptors x and y match iif

where h(a,b) Hamming distance

slide-25
SLIDE 25

Hamming Embedding Hamming Embedding

  • Nearest neighbors for Hamming distance  those for Euclidean distance

 a metric in the embedded space reduces dimensionality curse effects  a metric in the embedded space reduces dimensionality curse effects Effi i

  • Efficiency

– Hamming distance = very few operations – Fewer random memory accesses: 3 x faster that BOF with same dictionary i ! size!

slide-26
SLIDE 26

Hamming Embedding Hamming Embedding

  • Off-line (given a quantizer)

draw an orthogonal projection matrix P of size d × d – draw an orthogonal projection matrix P of size db × d  this defines db random projection directions – for each Voronoi cell and projection direction, compute the median p j , p value for a learning set

  • On-line: compute the binary signature b(x) of a given

descriptor

project x onto the projection directions as z(x) = (z z ) – project x onto the projection directions as z(x) = (z1,…zdb) – bi(x) = 1 if zi(x) is above the learned median value, otherwise 0

slide-27
SLIDE 27

ANN evaluation of Hamming Embedding ANN evaluation of Hamming Embedding

0.7 0.6

k=100 200 22 32 28 24

compared to BOW: at least 10 times less points in the h t li t f th l l

0 4 0.5

500 1000 18 20

short-list for the same level

  • f accuracy

NN recall 0.3 0.4

2000 5000 10000 ht=16

Hamming Embedding provides a much better trade-off between recall and

N 0.2

20000 30000 50000

trade-off between recall and ambiguity removal

0.1 1e 08 1e 07 1e 06 1e 05 0 0001 0 001 0 01 0 1 HE+BOW BOW 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 rate of points retrieved

slide-28
SLIDE 28

Matching points - 20k word vocabulary Matching points 20k word vocabulary

201 matches 240 matches Many matches with the non-corresponding image!

slide-29
SLIDE 29

Matching points - 200k word vocabulary Matching points 200k word vocabulary

69 matches 35 matches Still many matches with the non-corresponding one

slide-30
SLIDE 30

Matching points - 20k word vocabulary + HE Matching points 20k word vocabulary + HE

83 matches 8 matches 10x more matches with the corresponding image!

slide-31
SLIDE 31

Bag-of-features [Sivic&Zisserman’03] Bag of features [Sivic&Zisserman 03]

sparse freq enc ector centroids (visual words) Set of SIFT descriptors Query image

Harris-Hessian-Laplace regions + SIFT descriptors Bag-of-features processing +tf-idf weighting

sparse frequency vector

querying

Inverted

  • “visual words”:

querying

file

  • visual words :

– 1 “word” (index) per local descriptor

ranked image

Geometric

Re-ranked

– only images ids in inverted file => 8 GB fits!

g short-list

verification

list [Chum & al. 2007]

slide-32
SLIDE 32

Geometric verification

Use the position and shape of the underlying features t i t i l lit to improve retrieval quality Both images have many matches – which is correct? g y

slide-33
SLIDE 33

Geometric verification

We can measure spatial consistency between the query d h l i i l li and each result to improve retrieval quality Many spatially consistent matches – correct result Few spatially consistent matches – incorrect matches – correct result matches – incorrect result

slide-34
SLIDE 34

Geometric verification

Gives localization of the object

slide-35
SLIDE 35

Geometric verification Geometric verification

Remove outliers matches contain a high number of

  • Remove outliers, matches contain a high number of

incorrect ones

  • Estimate geometric transformation
  • Robust strategies

RANSAC – RANSAC – Hough transform

slide-36
SLIDE 36

Example: estimating 2D affine transformation

  • Simple fitting procedure (linear least squares)

A i t i i t h f hl l

  • Approximates viewpoint changes for roughly planar
  • bjects and roughly orthographic cameras

Can be used to initialize fitting for more complex models

  • Can be used to initialize fitting for more complex models

Matches consistent with an affine transformation

slide-37
SLIDE 37

Fitting an affine transformation

Assume we know the correspondences, how do we get the transformation? transformation?

) , (

i i y

x   ) , (

i i y

x

 m

         t

                   m m

2 1

1

                           

2 1 4 3 2 1

t t y x m m m m y x

i i i i

                              

i i i i i i

y x m m y x y x

4 3

1 1                   t t

2 1

slide-38
SLIDE 38

Fitting an affine transformation

L   m1 m       L   L xi yi 1         m2 m3       L  x

i

        xi yi 1 L       m4 t1          y

i

L        t2    

Linear system with six unknowns Each match gives us two linearly independent Each match gives us two linearly independent equations: need at least three to solve for the transformation parameters

slide-39
SLIDE 39

Comparison

Hough Transform

Ad t

RANSAC

Ad t Advantages

  • Can handle high percentage of
  • utliers (>95%)

E t t i f l tt i

Advantages

  • General method suited to large range
  • f problems

E t i l t

  • Extracts groupings from clutter in

linear time

Disadvantages

  • Easy to implement
  • “Independent” of number of dimensions

Disadvantages g

  • Quantization issues
  • Only practical for small number of

dimensions (up to 4)

  • Basic version only handles moderate

number of outliers (<50%) ( p )

Improvements available

  • Probabilistic Extensions

Many variants available, e.g.

  • PROSAC: Progressive RANSAC

Probabilistic Extensions

  • Continuous Voting Space
  • Can be generalized to arbitrary

shapes and objects

[Chum05]

  • Preemptive RANSAC [Nister05]

[Leibe08]

s apes a d objects

slide-40
SLIDE 40

Geometric verification – example

  • 1. Query
  • 2. Initial retrieval set (bag of words model)

  • 3. Spatial verification (re-rank on # of inliers)
slide-41
SLIDE 41

Evaluation dataset: Oxford buildings

All Soul's Ashmolean Bridge of Sighs Balliol Keble Magdalen Bodleian Th University Museum Thom Tower Cornmarket Radcliffe Camera

 Ground truth obtained for 11 landmarks  Ground truth obtained for 11 landmarks  Evaluate performance by mean Average Precision

slide-42
SLIDE 42

Measuring retrieval performance: Precision - Recall

  • Precision: % of returned images that

g are relevant

  • Recall: % of relevant images that are

returned

0.8 1

returned images relevant images

0.6 ecision 0.2 0.4 pre 0.2 0.4 0.6 0.8 1 recall

all images

slide-43
SLIDE 43

Average Precision

1 0.6 0.8 ion

  • A good AP score requires both high

recall and high precision

0 2 0.4 precis

recall and high precision

  • Application-independent

AP

0.2 0.4 0.6 0.8 1 0.2 recall recall

P f d b A P i i ( AP) Performance measured by mean Average Precision (mAP)

  • ver 55 queries on 100K or 1.1M image datasets
slide-44
SLIDE 44
slide-45
SLIDE 45

INRIA holidays dataset INRIA holidays dataset

Evaluation for the INRIA holidays dataset 1491 images

  • Evaluation for the INRIA holidays dataset, 1491 images

– 500 query images + 991 annotated true positives – Most images are holiday photos of friends and family Most images are holiday photos of friends and family

  • 1 million & 10 million distractor images from Flickr
  • Vocabulary construction on a different Flickr set

Vocabulary construction on a different Flickr set

  • Evaluation metric: mean average precision (in [0,1],

bigger = better) bigger better)

– Average over precision/recall curve

slide-46
SLIDE 46

Holiday dataset – example queries Holiday dataset example queries

slide-47
SLIDE 47

Dataset : Venice Channel Dataset : Venice Channel

Query Base 2 Base 1 Base 4 Base 3

slide-48
SLIDE 48

Dataset : San Marco square Dataset : San Marco square

Query Base 1 Base 3 Base 2 Query Base 1 Base 3 Base 2 Base 4 Base 5 Base 7 Base 6 Base 9 Base 8

slide-49
SLIDE 49

Example distractors - Flickr Example distractors Flickr

slide-50
SLIDE 50

Experimental evaluation

  • Evaluation on our holidays dataset, 500 query images, 1 million distracter

images g

  • Metric: mean average precision (in [0,1], bigger = better)

0 8 0.9 1

baseline HE +re-ranking

0.6 0.7 0.8 P 0 3 0.4 0.5 mAP 0.1 0.2 0.3 1000000 100000 10000 1000 database size

slide-51
SLIDE 51

Results – Venice Channel

Base 1 Flickr Query Flickr Base 4 Query

Demo at http://bigimbaz inrialpes fr Demo at http://bigimbaz.inrialpes.fr

slide-52
SLIDE 52

Towards large-scale image search Towards large scale image search

BOF+inverted file can handle up to 10 millions images

  • BOF+inverted file can handle up to ~10 millions images

– with a limited number of descriptors per image  RAM: 40GB – search: 2 seconds search: 2 seconds

  • Web-scale = billions of images

g

– with 100 M per machine  search: 20 seconds, RAM: 400 GB – not tractable

  • Solution: represent each image by one compressed vector
slide-53
SLIDE 53

Very large scale image search

d i ti t centroids (visual words) Set of SIFT descriptors Query image

Hessian-Affine regions + SIFT descriptors Bag-of-features processing +tf-idf weighting

description vector [Mikolajezyk & Schmid 04] [Lowe 04]

Vector i compression

  • Each image is represented by one vector

(Bag-of-features VLAD Fisher GIST)

Vector search

(Bag of features, VLAD, Fisher, GIST)

  • Vector compression to reduce storage

ranked image

Geometric

Re-ranked

requirements and search time

g short-list

verification

list [Lowe 04, Chum & al 2007]

slide-54
SLIDE 54

Related work on very large scale image search

Min-hash and geometrical min-hash [Chum et al. 07-09]

Compressing the BoF representation (miniBof) [ Jegou et al. 09]  these approaches require hundreds of bytes to obtain a “reasonable quality” pp q y q y GIST d i t ith S t l H hi [W i t l ’08]

GIST descriptors with Spectral Hashing [Weiss et al.’08]  very limited invariance to scale/rotation/crop

slide-55
SLIDE 55

Global scene context – GIST descriptor

The “gist” of a scene: Oliva & Torralba (2001)

5 frequency bands and 6 orientations for each image location

Tiling of the image to describe the image

slide-56
SLIDE 56

GIST descriptor + spectral hashing

The position of the descriptor in the image is encoded in the representation

Gist

Torralba et al. (2003)

Spectral hashing produces binary codes similar to spectral clusters

slide-57
SLIDE 57

Related work on very large scale image search

Min-hash and geometrical min-hash [Chum et al. 07-09]

Compressing the BoF representation (miniBof) [ Jegou et al. 09]  require hundreds of bytes are required to obtain a “reasonable quality” q y q q y GIST d i t ith S t l H hi [W i t l ’08]

GIST descriptors with Spectral Hashing [Weiss et al.’08]  very limited invariance to scale/rotation/crop

Aggregating local descriptors into a compact image representation [Jegou &al.‘10]

Efficient object category recognition using classemes [Torresani et al.’10]

slide-58
SLIDE 58

Aggregating local descriptors Aggregating local descriptors

Set of n local descriptors  1 vector

  • Set of n local descriptors  1 vector

P l h b f f t ft ith SIFT f t

  • Popular approach: bag of features, often with SIFT features
  • Recently improved aggregation schemes
  • Recently improved aggregation schemes

– Fisher vector [Perronnin & Dance ‘07] – VLAD descriptor [Jegou, Douze, Schmid, Perez ‘10] VLAD descriptor [Jegou, Douze, Schmid, Perez 10] – Supervector [Zhou et al. ‘10] – Sparse coding [Wang et al. ’10, Boureau et al.’10]

  • Use in very large-scale retrieval and classification
slide-59
SLIDE 59

Aggregating local descriptors

Most popular approach: BoF representation [Sivic & Zisserman 03]

sparse vector

highly dimensional → significant dimensionality reduction introduces loss g y

Vector of locally aggregated descriptors (VLAD) [Jegou et al. 10] non sparse vector

non sparse vector

fast to compute

excellent results with a small vector dimensionality

Fisher vector [Perronnin & Dance 07]

probabilistic version of VLAD

probabilistic version of VLAD

initially used for image classification

comparable or improved performance over VLAD for image retrieval

slide-60
SLIDE 60

VLAD : vector of locally aggregated descriptors

Determine a vector quantifier (k-means)

  • utput: k centroids (visual words): c1,…,ci,…ck

centroid ci has dimension d

For a given image

assign each descriptor to closest center ci

accumulate (sum) descriptors per cell

accumulate (sum) descriptors per cell vi := vi + (x - ci) VLAD (di i D k d) x

VLAD (dimension D = k x d)

The vector is square-root + L2-normalized ci

Alternative: Fisher vector

[Jegou, Douze, Schmid, Perez, CVPR’10]

slide-61
SLIDE 61

VLADs for corresponding images

v1 v2 v3 ...

SIFT-like representation per centroid (+ components: blue, - components: red)

good coincidence of energy & orientations

slide-62
SLIDE 62

Fisher vector

Use a Gaussian Mixture Model as vocabulary

Statistical measure of the descriptors of the image w.r.t the GMM D i ti f lik lih d t GMM t

Derivative of likelihood w.r.t. GMM parameters GMM parameters: weight mean co-variance (diagonal) Translated cluster → Translated cluster → large derivative on for this component

[Perronnin & Dance 07]

slide-63
SLIDE 63

Fisher vector

For image retrieval in our experiments: l d i ti t di K*D [K

b f G i D di f d i ]

  • only deviation wrt mean, dim: K*D [K number of Gaussians, D dim of descriptor]
  • variance does not improve for comparable vector length
slide-64
SLIDE 64

VLAD/Fisher/BOF performance and dimensionality reduction

We compare Fisher, VLAD and BoF on INRIA Holidays Dataset (mAP %)

Dimension is reduced to D’ dimensions with PCA

Observations:

GIST 960 36.5

Observations:

Fisher, VLAD better than BoF for a given descriptor size

Choose a small D if output dimension D’ is small

Performance of GIST not competitive

[Jegou, Perronnin, Douze, Sanchez, Perez, Schmid, PAMI’12]

slide-65
SLIDE 65

Compact image representation

Aim: improving the tradeoff between

search speed

memory usage

search quality

Approach: joint optimization of three stages

local descriptor aggregation

dimension reduction

dimension reduction

indexing algorithm Image representation VLAD / Fisher PCA + PQ codes (Non) – exhaustive search VLAD / Fisher PQ codes search

slide-66
SLIDE 66

Product quantization for nearest neighbor search

Vector split into m subvectors: S b t ti d t l b ti

Subvectors are quantized separately by quantizers where each is learned by k-means with a limited number of centroids

Example: y = 128-dim vector split in 8 subvectors of dimension 16

each subvector is quantized with 256 centroids -> 8 bit

very large codebook 256^8 ~ 1 8x10^19

very large codebook 256^8 ~ 1.8x10^19

16 components y1 y2 y3 y4 y5 y6 y7 y8 q1 q2 q3 q4 q5 q6 q7 q8

256 t id

1 2 3 4 5 6 7 8

q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)

centroids

8 bits

⇒ 8 subvectors x 8 bits = 64-bit quantization index

[Jegou, Douze, Schmid, PAMI’11]

slide-67
SLIDE 67

Optimizing the dimension reduction and quantization together

Fisher vectors undergoes two approximations

mean square error from PCA projection

mean square error from quantization

Given k and bytes/image, choose D’ minimizing their sum Results on Holidays dataset:

  • there exists an optimal D’
  • 16 byte best results for k=64

16 byte best results for k 64

  • 320 byte best results for k=256
slide-68
SLIDE 68

Results on the Holidays dataset with various quantization parameters

slide-69
SLIDE 69

Joint optimization of Fisher/VLAD and dimension reduction-indexing

For Fisher/ \VLAD

The larger k, the better the raw search performance

But large k produce large vectors, that are harder to index

Optimization of the vocabulary size

Fixed output size (in bytes)

D’ computed from k via the joint optimization of reduction/indexing  end-to-end parameter optimization

slide-70
SLIDE 70

Comparison to the state of the art

slide-71
SLIDE 71

Large scale experiments (10 million images)

Exhaustive search of VLADs, D’=64

4.77s

With the product quantizer

Exhaustive search with ADC: 0.29s

Non-exhaustive search with IVFADC: 0.014s IVFADC -- Combination with an inverted file IVFADC -- Combination with an inverted file

slide-72
SLIDE 72

Large scale experiments (10 million images)

0 7 0.8 0.6 0.7 0 4 0.5 @100

Timings

0.3 0.4 recall@ 4.768s

g

0 1 0.2 BOF D=200k VLAD k=64 VLAD k=64, D'=96 ADC: 0.286s IVFADC: 0.014s SH ≈ 0 267s 0.1 1000 10k 100k 1M 10M , VLAD k=64, ADC 16 bytes VLAD+Spectral Hashing, 16 bytes SH ≈ 0.267s 1000 10k 100k 1M 10M Database size: Holidays+images from Flickr

slide-73
SLIDE 73

Conclusion & future work

Excellent search accuracy and speed in 10 million of images

Each image is represented by very few bytes (20 – 40 bytes)

Tested on up to 220 million video frames

extrapolation for 1 billion images: 20GB RAM, query time < 1s on 8 cores

On-line available: Matlab source code for product quantizer

Alternative: using Fisher vectors instead of VLAD descriptors [Perronnin’10]

Extension to video & more “semantic” search

slide-74
SLIDE 74

Event retrieval in large video collections [Revaud et al. 2013]

Video description

frame t  VLAD descriptor, reduced to 512D with PCA

Comparison of two videos

  • query

Comparison of two videos

  • database
  • video

Fast calculation in the frequency domain + product quantization