Evaluation of neural code compression techniques for image retrieval Feature compression for Image Retrieval Gabriel Nieves-Ponce (nieves1@umbc.edu) University of Maryland Baltimore County CMSC-676 Information Retrieval
Intro to image retrieval ● Image retrieval (IR) is a subset of information retrieval. ● Aims to retrieve semantically similar images to user define queries. ● Commonly used image representation techniques: ○ Handcrafted Descriptors: SIFT, SURF, ORB, etc... ○ Learned Feature Vectors (neural codes): ResNet, VGG, etc… ● No free lunch: ○ Descriptors are small but inflexible ○ Neural Codes are large but flexible
Descriptors ● Small and easy to compute ● Rotations and translations may skew results ● Requires additional steps to perform similarity metric ○ Geometric Verification Fig 1. SIFT Keypoints mapping
Scale Invariant Feature Transform (SIFT) SIFT is a popular algorithm for computing image descriptors. As the name implies, one of the benefit of SIFT is scale invarience. Because of the scale invariance offered by SIFT we can use both closeups and far away images of an object and be confident that it will capture Keypoint Descriptors shared by both images. In the next couple of slides I will provide a quick overview of how SIFT achieve this. In the next slide we will briefly introduce the concepts of convolutions. Convolutions are one of the cornerstones of image processing and it’s widely used on most popular algorithms, including SIFT.
Convolutions Imagine you have a 100x100 image. Now imagine that you look at subsets of the image with a pixel area of 10x10, scanning the image as seen in Fig. 2. For every quadrant we perform some computation that returns a real number value. Once we scan the whole image, we should end up with a lower resolution matrix which is the result of all the individual computations perform on the Fig. 2 Kernel Convolution 10x10 “patches” within our image. What I just described here is known as a Convolution.
Scale Invariant Feature Transform (SIFT) Remember that 10x10 pixel area we mentioned before? This is called a kernel. The proper definition of kernels are beyond the scope of this presentation but for the purpose of understanding SIFT we can define this kernel to be a matrix that contains a set of real numbers that when multiplied (convolved) against the image, it augments - or diminishes - certain properties. If you ever used a filter to alter an image on instagram, you most definitely have used kernels before. These kernels are also called filters; you can think of a filter as a collection of kernels. SIFT uses these filters to make edges on an image more pronounced and everything else less pronounced. This makes it easy for the algorithm to find the edges within the image. SIFT uses a Gaussian Filter on the image to apply a Gaussian Blur as a preprocessing step to extracting the keypoint descriptors.
Scale Invariant Feature Transform (SIFT) The next series of steps are quite involved and beyond the scope of this presentation, so I will quickly mention the steps without diving into the math. SIFT applies this preprocessing step a number of times, each time rescaling the image to achieve scale invariance. Furthermore it looks for areas of interest known as keypoint. Each keypoint it computed as a function of their surroundings pixels - are there a lot of white pixels around these black pixels? Then this might be an edge. After all the keypoints are located, they are assign them an orientation to achieve rotational invariance. Finally keypoint descriptors are computed and stored as 128 dimensional vector.
Speeded up Robust Feature (SURF) SURF is a popular algorithm that was inspired by the SIFT paper [1] . SURF is very similar to SIFT and both have comparable retrieval performance. The main takeaway is that SURF is up to three time faster than SIFT and provides better rotational invariance while still achieving similar scale invariance.
Neural Codes ● High dimensionality ○ Very large vector ● Expensive to compute - GPUs ● Rotational and translational invariant ● Can use L2 norm directly to compute similarity Fig 3. VGG16 Convolutional Activation Map (CAM)
Convolutional Neural Network (CNN) Remember earlier when we mention convolutions. A CNN is a neural network architecture that takes this concept to the extreme. See Fig. 4 Fig. 4 CNN Architecture
Convolutional Neural Network (CNN) A CNN perform a series of convolutions followed by pooling. The output of each convolution produces a list of matrices called Feature Maps (FM) similar to the preprocessing step in SIFT. You can further convolve FM to extract lower level representations at the expense of resolution. Fig. 5 CNN Architecture
Convolutional Neural Network (CNN) A convolution layer can be described, very naively, as a series of convolutions and pooling packaged into one phase. There are a number of different CNN architectures such as ResNet50, and VGG16 and they all have their own unique implementations. The type of features maps you get will depend not only on the architecture you decide to use, but also on the convolutional layer you choose. Un like SIFT, the convolution kernels, or filters, in a CNN are updated and improved via back propagation to achieve maximum class activation, see fig. 3 [4], without any human input. It learns which filters are best directly from the dataset. Finally, the rule of thumb is that the early convolutional layers learn simple edge detection while the lower layers can detect very complex patterns. Take a look at fig. 5 [5] to see a visualization of the kernels that were learned by the last VGG16 convolutional layer.
Problem statement ● The neural codes have yielded promising results in retrieval tasks when compared to descriptors [3] ● Uncompress state of the art descriptors were able to outperform the uncompress neural codes. ● In the advent of cloud computing accelerators such as GPUs have become accessible to most consumers ● Can we overcome the feature vector size problem? ○ A. Babenko et al [3] demonstrated some success using compressed neural codes with minimal loss in performance. ○ Can we improve upon his findings?
Compression and Representation ● Dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. ● Dimensionality Reduction: ○ Principal Component Analysis (PCA) [6] ○ Linear Discriminant Analysis (LDA) [7][8] ● Quantization (Compact Coding for ANN Search): ○ Product Quantization (PQ) [4] ○ Optimized Product Quantization (OPQ) [5]
Related Work ● Babenko et al. [3] demonstrated that compressed neural codes outperform even the best performing low resolution descriptors. ○ Both PCA and Discriminative Dimensionality Reduction (DDR) where used to generate low resolution codes. ● PCA: ○ Unsupervised ○ Optimizes for highest variance Discriminative Dimensionality Reduction (*): ● ○ Supervised ○ Optimizes for highest separation of classes * = PCA is applied to codes before DDR to avoid overfitting
Goals ● Compare performance using state of the art deep neural networks (DNN): ○ ResNet, DenseNet, InceptionV3 ● Extend the compression techniques ○ Linear Discriminant Analysis (LDA)[7][8] ○ Product Quantization (PQ)[4] ○ Optimize Product Quantization (OPQ)[5] ● Further research convolutional features and their quality: ○ Convolutional Activation Map (CAM) [10] ○ Kernel Deconvolution [11]
Fig 6. VGG16 Kernel Deconvolution
References ● [1] Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60, 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94 ● [2] Bay H., Tuytelaars T., Van Gool L. (2006) SURF: Speeded Up Robust Features. In: Leonardis A., Bischof H., Pinz A. (eds) Computer Vision – ECCV 2006. ECCV 2006. Lecture Notes in Computer Science, vol 3951. Springer, Berlin, Heidelberg. ● [3] Babenko A., Slesarev A., Chigorin A., Lempitsky V. (2014) Neural Codes for Image Retrieval. In: Fleet D., Pajdla T., Schiele B., Tuytelaars T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8689. Springer, Cham. ● [4] H. Jégou, M. Douze and C. Schmid, "Product Quantization for Nearest Neighbor Search," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117-128, Jan. 2011.
References ● [5] T. Ge, K. He, Q. Ke and J. Sun, "Optimized Product Quantization for Approximate Nearest Neighbor Search," 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, 2013, pp. 2946-2953. ● [6] Lindsay I. Smith. “A Tutorial on Principal Component Analysis.” http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf, February 26, 2002. ● [7] R.A. Fisher, “The Statistical Utilization of Multiple Measurements, Annals of Eugenics”, vol. 8, pp. 376-386, 1938. ● [8] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed. Academic Press, 1990. ● [9] Nieves-Ponce, Gabriel. 2020, UMBC CMSC_676 Term Paper, master, https://gitlab.com/nievespg/umbc/-/tree/10-term-paper/CMSC_676/term_paper
References ● [10] B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. In CVPR, 2016. ● [11] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. In ECCV, 2014.
Recommend
More recommend