Learning Visual Semantics: Models, Massive Computation, and Innovative Applications Part II: Visual Features and Representations Liangliang Cao, IBM Watson Research Center
Evolvement of Visual Features • Low level features and histogram • SIFT and bag-of-words models • Sparse coding • Super vector and Fisher vector • Deep CNN 2
Evolvement of Visual Features • Low level features and histogram Less parameters • SIFT and bag-of-words models • Sparse coding • Super vector and Fisher vector • Deep CNN More parameters 3
Evolvement of Visual Features • Low level features and spatial histogram • SIFT and bag-of-words models Three fundamental techniques • Sparse coding 1. histogram 2. spatial gridding • Super vector and Fisher vector 3. filter have been used extensively • Deep CNN 4
Low Level Features and Spatial Pyramid 5
Raw Pixels as Feature Application 1: Face recognition Application 2: Hand written digits Concatenating raw pixels as 1D vector Tiny Image [Torralba et al 2007]: resize an image to 32x32 color thumbnail, which corresponds to a 3072 dimensional vector Pictures courtesy to Face Research Lab, Antonio Torralba and Sam Roweis
From Pixels to Histograms Color histogram [Swain and Ballard 91] r is proposed to model the distribution g b of colors in an image. Unlike raw pixel based vectors, histograms are not sensitive to • misalignment • scale transform • global rotation We can extend color histogram to : • Edge histogram • Shape context histogram • Local binary patterns (LBP) • Histogram of gradients Similar color histogram feature 7
From Histogram to Spatialized Histogram Problem of histograms: No spatial information! The same histogram! Example thanks to Erik Learned-Miller Histograms of spatial cells Spatial pyramid matching [Lazebnik et al CVPR’06] Ojala et al, PAMI’02 8
IBM IMARS Spatial Gridding First position in 1 st and 2 nd ImageCLEF Medical Imaging Classification Task: Determine which modality a medical image belongs to . - Images from Pubmed articles - 31 categories (x-ray, CT, MRI, ultrasound, etc.) 9
IBM IMARS Spatial Gridding First position in 1 st and 2 nd ImageCLEF Medical Imaging Classification http://www.imageclef.org/2012/medical 10
Image Filters • In addition to histogram, another group of features can be represented as “filters”. For example: 1. Harr-like filters (Viola-Jones face detection) 2. Gabor filters Widely used in fingerprint, (simple cells in the visual iris, OCR, texture and face cortex can be modeled by recognition. Gabor functions) 11
SIFT Feature and Bag-of-Words Model 1999 • Raw pixel Classical • SIFT SIFT features and features • Histogram feature beyond • HOG – Color Histogram • SURF – Edge histogram • Frequency analysis • DAISY • Image filters • BRIEF • Texture features • … – LBP • DoG • Scene features • Hessian detector – GIST • Shape descriptors • Laplacian of Harris • Edge detection • FAST • Corner detection • ORB • … 12
Scale-Invariant Feature Transform (SIFT) David G. Lowe - Distinctive image features from scale-invariant keypoints, IJCV 2004 - Object recognition from local scale-invariant features, ICCV 1999 SIFT Descriptor: Histogram of gradient orientation - Histogram is more robust to position than raw pixels - Edge gradient is more distinctive than color for local patches Concatenate histograms in spatial cells David Lowe’s excellent performance tuning: • Good parameters: 4 ori, 4 x 4 grid • Soft-assignment to spatial bins • Gaussian weighting over spatial location • Reduce the influence of large gradient magnitudes: thresholding +normalization 13
Scale-Invariant Feature Transform (SIFT) David G. Lowe - Distinctive image features from scale-invariant keypoints, IJCV 2004 - Object recognition from local scale-invariant features, ICCV 1999 SIFT Detector: Detect maxima and minima of difference-of-Gaussian in scale space Post-processing: keep corner points but reject low-contrast and edge points • In general object recognition, we may combine multiple detectors (e.g., Harris, Hessian), or use dense sampling for good performance. • Following SIFT, many research works including SURF, BRIEF, ORB, BRISK and etc have been proposed for faster local feature extraction. 14
Histogram of Local Features And Bag-of-Words Models 15
Histogram of Local Features frequency ….. dim = # codewords codewords 16
Histogram of Local Features + Spatial Gridding …… dim = #codewords x #grids 17
Bag of Words Models 18
Bag-of-Words Representation Object Bag of ‘words’ Computer Vision: Text and NLP: Slide credit: Fei-Fei Li 19
Topic Models for Bag-of-Words Representation Supervised classification Unsupervised classification Fei-Fei et al. CVPR 2005 Sivic et al. ICCV 2005 Classification + segmentation Cao and Fei-Fei. ICCV 2007 20
Pros and Cons of Bag of Words Models Images differ from texts! Bag of Words Models are good in - Modeling prior knowledge - Providing intuitive interpretation But these models suffer from - Loss of spatial information - Loss of information in quantization of “visual words” Better coding approach 21
Sparse Coding 22
Sparse Coding • Naïve histogram uses Vector Quantization as a hard assignment, while Sparse Coding provides a soft assignment. • Sparse Coding: approximation of l 0 norm (sparse solution): • SC works better with max pooling (while traditional VQ with averages pooling) • References: [M. Ranzato et al, CVPR’07] [J. Yang et al, CVPR09], [J. Wang et al CVPR10], [Y. Boureau et al, CVPR10] 23
Sparse Coding + Spatial Pyramid Yang et al, Linear Spatial Pyramid Matching using Sparse Coding for Image Classification, CVPR 2009 Sparse coding + spatial pyramid + linear SVM 24
Efficient Approach Locality preserving linear coding: 1. find k nearest neighbors to the query 2. compute sparse coding with the k neighbors Significantly faster than naïve SC, e.g., O(1000 a ) -> O(5 a ) For further speedup, we can use LS regression to replace SC [J. Wang et al CVPR10] Matlab implementation (http://www.ifp.illinois.edu/~jyang29/LLC.htm ) Can be further speed up for top-k search 25
Sparse Coding Are Not Necessarily Sparse Sparse coding Hard quantization s.t. Sparsest solution! Less sparse! Sparse coding is less sparse. Image level representation is not sparse after pooling. Is the success of SC due to sparsity? 26
Fisher Vector and Super Vector 27
Information Loss • Coding with information loss: VQ: Sparse coding: • Lossless coding: • Significant difference with a function: a scalar!! SC or VQ: a function!! Lossless coding: 28
Lossless Coding as Mixture of Experts • Let’s look at each codeword as a “local expert”: Gating function (e.g., GMM, sparse GMM, Harmonic K-means, etc) Expert 1 Expert 2 Expert 3 29
Pooling Towards Image-Level Representation Component 1 Component 2 Component 3 + + + + + + Pooling : Normalize and concatenate Both Fisher Vector and Super Vector can be written in this form (with different subtraction and normalization and factors) Related references: • Fisher Vector [Perronnin et al, ECCV10] • Supervector [X. Zhou, K. Yu, T. Zhang et al, ECCV10] • HG [X. Zhou et al, ECCV09] 30
Pooling Towards Image-Level Representation Component 1 Component 2 Component 3 + + + + + + Pooling : Normalize and concatenate Big model : The dimension becomes C (#components) x d (#fea dim) For example, if C=1000, d=128, the final dimension is 128K 100+ times longer than that from SC or VQ! 31
Very Long Vector as Feature Representation We can generate very long image feature vector as we discussed before The strong feature we used for ImageNet LSVRC 2010 – Dense sampling: LBP + HOG, fea dim=100 (after PCA) – GMM with 1024 components – 4 spatial gridding (1+3x1) – Dimension of image feature: 100 x 1024 x 4 = 0.41 M LBP GMM pooling HOG 32
How to solve big models? 33
For Small Datasets: Use Kernel Trick! Kernel trick: • 10K images => Kernel matrix: 10K x 10K ~100M • Computational complexity depends on the size of Kernel matrix which is less than feature dimension We tried nonlinear kernels for face verification and got good performance Results on LFW dataset Learning Locally-Adaptive Decision Functions for Person Verification , CVPR’13 (with Z. Li and S. Chang, F. Liang, T. Huang, J. Smith) 34
For Large Dataset: Use Stochastic Gradient Descent • Suppose we are working on ImageNet data using 0.4 M feature vectors. • Total training data: 1.2M x 0.4M ~ 0.5 T real values! – Too big to load into memory – Too many samples to use kernel tricks • Solution: Stochastic Gradient Descent (SGD) – Idea: estimate the gradient on a randomly picked sample – Comparing with gradient descent: 35
SGD Can Be Very Simple To Implement A 10 line binary SVM solver by Shai Shalev-Shwartz decreasing learning rate 36
Deep CNN and Related Tech 37
Recommend
More recommend