Neural Codes for Image Retrieval David Stutz July 22, 2015 David Stutz | July 22, 2015 David Stutz | July 22, 2015 0/48 1/48
Table of Contents Introduction 1 Image Retrieval 2 Bag of Visual Words Vector of Locally Aggregated Descriptors Sparse-Coded Features Compression and Nearest-Neighbor Search 3 Convolutional Neural Networks Multi-layer Perceptrons Convolutional Neural Networks Architectures Training 4 Neural Codes for Image Retrieval 5 Experiments 6 Summary David Stutz | July 22, 2015 2/48
Table of Contents Introduction 1 Image Retrieval 2 Bag of Visual Words Vector of Locally Aggregated Descriptors Sparse-Coded Features Compression and Nearest-Neighbor Search 3 Convolutional Neural Networks Multi-layer Perceptrons Convolutional Neural Networks Architectures Training 4 Neural Codes for Image Retrieval 5 Experiments 6 Summary David Stutz | July 22, 2015 3/48
1. Introduction Image retrieval: Problem. Given a large database of images and a query image, find images showing the same object or scene. advantage: supports activities, emotions, ... Originally: ◮ Text-based retrieval systems based on manual annotations; ◮ unpractical for large collections of images. Today, content-based image retrieval: ◮ Techniques based on the Bag of Visual Words [SZ03] model. David Stutz | July 22, 2015 4/48
1. Introduction Image retrieval: Problem. Given a large database of images and a query image, find images showing the same object or scene. advantage: supports activities, emotions, ... Originally: ◮ Text-based retrieval systems based on manual annotations; ◮ unpractical for large collections of images. Today, content-based image retrieval: ◮ Techniques based on the Bag of Visual Words [SZ03] model. David Stutz | July 22, 2015 4/48
1. Introduction Image retrieval: Problem. Given a large database of images and a query image, find images showing the same object or scene. advantage: supports activities, emotions, ... Originally: ◮ Text-based retrieval systems based on manual annotations; ◮ unpractical for large collections of images. Today, content-based image retrieval: ◮ Techniques based on the Bag of Visual Words [SZ03] model. David Stutz | July 22, 2015 4/48
Table of Contents Introduction 1 Image Retrieval 2 Bag of Visual Words Vector of Locally Aggregated Descriptors Sparse-Coded Features Compression and Nearest-Neighbor Search 3 Convolutional Neural Networks Multi-layer Perceptrons Convolutional Neural Networks Architectures Training 4 Neural Codes for Image Retrieval 5 Experiments 6 Summary David Stutz | July 22, 2015 5/48
2. Image Retrieval Formalization of content-based image retrieval: Problem. Find K -nearest-neighbors of query z 0 in a (large) database X = { x 1 , . . . , x N } of image representations. • • • • • K = 2 , N = 7 z 0 • • • David Stutz | July 22, 2015 6/48
2. Image Retrieval Formalization of content-based image retrieval: Problem. Find K -nearest-neighbors of query z 0 in a (large) database X = { x 1 , . . . , x N } of image representations. • • • • • • • • • • • K = 2 , N large • • z 0 • • • • • • • • • important: image representation David Stutz | July 22, 2015 6/48
2. Image Retrieval Formalization of content-based image retrieval: Problem. Find K -nearest-neighbors of query z 0 in a (large) database X = { x 1 , . . . , x N } of image representations. • • • • • • • • • • • K = 2 , N large • • z 0 • • • • • • • • • important: image representation Examples for image representations from the “Computer Vision” lecture: ◮ Histograms; ◮ Bag of Visual Words [SZ03]. David Stutz | July 22, 2015 6/48
2.1. Bag of Visual Words Intuition: assign local descriptors y l,n of image x n to visual words ˆ y 1 , . . . , ˆ y M previously obtained using clustering. y l,n ˆ y m David Stutz | July 22, 2015 7/48
2.1. Bag of Visual Words 1. Extract local descriptors Y n for each image x n . 2. Cluster all local descriptors Y = � N n =1 Y n to obtain visual words ˆ Y = { ˆ y 1 , . . . , ˆ y M } . 3. Assign each y l,n ∈ Y n to nearest visual word (embedding step): � � f ( y l,n ) = δ ( NN ˆ Y ( y l,n ) = ˆ y 1 ) , . . . . 4. Count visual word occurrences (aggregation step): L � F ( Y n ) = f ( y l,n ) . l =1 David Stutz | July 22, 2015 8/48
2.1. Bag of Visual Words 1. Extract local descriptors Y n for each image x n . 2. Cluster all local descriptors Y = � N n =1 Y n to obtain visual words ˆ Y = { ˆ y 1 , . . . , ˆ y M } . 3. Assign each y l,n ∈ Y n to nearest visual word (embedding step): � � f ( y l,n ) = δ ( NN ˆ Y ( y l,n ) = ˆ y 1 ) , . . . . 4. Count visual word occurrences (aggregation step): L � F ( Y n ) = f ( y l,n ) . l =1 David Stutz | July 22, 2015 8/48
2.1. Bag of Visual Words 1. Extract local descriptors Y n for each image x n . 2. Cluster all local descriptors Y = � N n =1 Y n to obtain visual words ˆ Y = { ˆ y 1 , . . . , ˆ y M } . 3. Assign each y l,n ∈ Y n to nearest visual word (embedding step): � � f ( y l,n ) = δ ( NN ˆ Y ( y l,n ) = ˆ y 1 ) , . . . . 4. Count visual word occurrences (aggregation step): L � F ( Y n ) = f ( y l,n ) . l =1 David Stutz | July 22, 2015 8/48
2.2. Vector of Locally Aggregated Descriptors Intuition: consider the residuals y l,n − ˆ y m instead of counting visual words. y l,n y m − y l,n ˆ ˆ y m David Stutz | July 22, 2015 9/48
2.2. Vector of Locally Aggregated Descriptors 1. Extract and cluster local descriptors. 2. Compute residuals of local descriptors visual words (embedding step): � � f ( y l,n ) = δ ( NN ˆ Y ( y l,n ) = ˆ y 1 )( y l,n − ˆ y 1 ) , . . . . 3. Aggregate residuals (aggregation step): L � F ( Y n ) = f ( y l,n ) . l =1 4. L 2 -normalize F ( Y n ) . David Stutz | July 22, 2015 10/48
2.2. Vector of Locally Aggregated Descriptors 1. Extract and cluster local descriptors. 2. Compute residuals of local descriptors visual words (embedding step): � � f ( y l,n ) = δ ( NN ˆ Y ( y l,n ) = ˆ y 1 )( y l,n − ˆ y 1 ) , . . . . 3. Aggregate residuals (aggregation step): L � F ( Y n ) = f ( y l,n ) . l =1 4. L 2 -normalize F ( Y n ) . David Stutz | July 22, 2015 10/48
2.2. Vector of Locally Aggregated Descriptors 1. Extract and cluster local descriptors. 2. Compute residuals of local descriptors visual words (embedding step): � � f ( y l,n ) = δ ( NN ˆ Y ( y l,n ) = ˆ y 1 )( y l,n − ˆ y 1 ) , . . . . 3. Aggregate residuals (aggregation step): L � F ( Y n ) = f ( y l,n ) . l =1 4. L 2 -normalize F ( Y n ) . David Stutz | July 22, 2015 10/48
2.3. Sparse-Coded Features Intuition: soft-assign local descriptors to visual words. y l,n y m ′ ˆ ˆ y m David Stutz | July 22, 2015 11/48
2.3. Sparse-Coded Features 1. Extract and cluster local descriptors. 2. Compute sparse codes (embedding step): � y l,n − ˆ Y r l � 2 f ( y l,n ) = argmin 2 + λ � r l � 1 . r l contains ˆ y m as columns 3. Pool sparse codes (aggregation step): � � F ( Y n ) = 1 ≤ l ≤ L { f 1 ( y l,n ) } , . . . max first component of f ( y l,n ) David Stutz | July 22, 2015 12/48
2.3. Sparse-Coded Features 1. Extract and cluster local descriptors. 2. Compute sparse codes (embedding step): � y l,n − ˆ Y r l � 2 f ( y l,n ) = argmin 2 + λ � r l � 1 . r l contains ˆ y m as columns 3. Pool sparse codes (aggregation step): � � F ( Y n ) = 1 ≤ l ≤ L { f 1 ( y l,n ) } , . . . max first component of f ( y l,n ) David Stutz | July 22, 2015 12/48
2.3. Sparse-Coded Features 1. Extract and cluster local descriptors. 2. Compute sparse codes (embedding step): � y l,n − ˆ Y r l � 2 f ( y l,n ) = argmin 2 + λ � r l � 1 . r l contains ˆ y m as columns 3. Pool sparse codes (aggregation step): � � F ( Y n ) = 1 ≤ l ≤ L { f 1 ( y l,n ) } , . . . max first component of f ( y l,n ) David Stutz | July 22, 2015 12/48
2.4. Compression, Nearest-Neighbor Search Until now: image representation. Additional aspects of image retrieval: ◮ compression of image representations; ◮ efficient indexing and nearest-neighbor search [JDS11]; ◮ query expansion [CPS + 07] and spatial verification [PCI + 07]. For example, compression can be accomplished using: ◮ Unsupervised methods, e.g. Principal Component Analysis (PCA); ◮ or discriminate methods, e.g. Joint Subspace and Classifier Learning [GRPV12] or Large Margin Dimensionality Reduction [SPVZ13]. discussed later ... David Stutz | July 22, 2015 13/48
2.4. Compression, Nearest-Neighbor Search Until now: image representation. Additional aspects of image retrieval: ◮ compression of image representations; ◮ efficient indexing and nearest-neighbor search [JDS11]; ◮ query expansion [CPS + 07] and spatial verification [PCI + 07]. For example, compression can be accomplished using: ◮ Unsupervised methods, e.g. Principal Component Analysis (PCA); ◮ or discriminate methods, e.g. Joint Subspace and Classifier Learning [GRPV12] or Large Margin Dimensionality Reduction [SPVZ13]. discussed later ... David Stutz | July 22, 2015 13/48
Recommend
More recommend