Dictionaries, Manifolds and Domain Adaptation for Image and Video- based Recognition Rama Chellappa University of Maryland
Student and the teachers
Major points • Training and testing data come from different distributions. – Distributions are complex due to variations in patterns – Domain adaptation • Robust representations and distance measures – Vector space vs manifolds – Euclidean vs geodesics • Will develop these points for two representations of images and videos. – Dictionaries – Manifolds
Outline of the talk • Dictionaries – Learning and applications to image and video-based recognition. • Manifolds – Representation, inference and applications to image and video- based recognition. – Analytical and empirical • Domain adaptation – How to adapt representations to new domains – Domain shifts could be due to pose, illumination, rate, time lapse, views,.. – Semi-supervised, unsupervised • Relies on works of Prof. Amari and Chikuse.
Motivation - 1
Motivation – 2 • Task: Given a probe video of one or more subjects, retrieve their IDs from a gallery of still face images or face videos. • Challenges: Getting a face image is more than half the problem Low Pose Uncontrolled Blur resolution Variation Illumination Camera motion
Dictionaries for signal and image analysis • Matching Pursuit algorithms Mallat (early 90’s) • Orthogonal matching pursuits (Pati, et al,1993, Tropp 2004) • Saito and Coifman, 1997 • Etemad, Chellappa, 1997 • Represent signals using wavelets, wavelet packets,.. • Learning dictionary from data instead of using off-the- shelf bases. (Olshausen and Field, 1997), …
Modern day dictionaries • Represent Signals and images using signals and images. • Sparse coding has neural backings. • Allow compositional representations • Dictionary updates – Batch (Method of Optimal directions) – K-SVD • Dictionaries for images are more complicated – Need to account for pose, illumination, resolution variations.
Basic formulation • Assume L classes and n images per class in gallery. • The training images of the kth class is represented as • Dictionary D is obtained by concatenating all the training images • The unknown test vector can be represented as a linear combination of the training images as Wright et al, 2009 •The coefficient vector α is sparse . Wagner et al, 2011
Dictionary-based face recognition α can be recovered by Basis Pursuit as Find the reconstruction error Select the class giving the minimum while representing the test image reconstruction error. with coefficients of each class separately .
Learning dictionaries – K-SVD Training faces K-SVD Learned dictionary M. Aharon, M. Elad, and A. M. Bruckstein, 2006
Outlier rejection
The illumination problem • Robust albedo estimation (Biswas et al. PAMI 2009) – Estimate albedo – Relight images with different light source direction – Use relighted images for training
Robust estimation of albedo Inverse problem Surface Normals + + Light Source Albedo Intensity Image Albedo + Shape Single Intensity Image Biswas, et al ICCV 2007 PAMI 2009
Albedo estimation Lambertian assumption Albedo Surface Normal Light Source Intensity Light Source Estimated : Initial Surface Normal : Initial Albedo Estimate Error in initial albedo estimate
Albedo estimation Initial albedo estimate Signal Dependent Additive Noise Non-stationary Mean Non-stationary Variance (NMNV) model for the true unknown albedo Unbiased source assumption Uncorrelated Noise
Estimated albedo – PIE dataset
Relighting using the estimated albedo
Experimental results • DFR – 99.17 % • SRC – 98.1 % • CDPCA – 98.83 % Yale B data set V. M. Patel, T. Wu, S. Biswas, P. J. Phillips, and R. Chellappa, “ Dictionary-based face recognition under variable lighting and pose”, IEEE Trans, Information Forensics and Security, 2011.
Outdoor face dataset An outdoor dataset with 18 subjects with 5 gallery images each and 90 low resolution images. Gallery – 120 x 120 Probe – 20 x 20 Method Recognition SLRFR 67% Reg. 60% .LDA+SVM BTAS 2011 CLPM 16.1%
Video dictionaries for face recognition Preprocessing Dictionary learning Using summarization (extract frames and for each partition and algorithm to partition detect/crop face finding sequence- cropped face images regions) specific dictionaries 1] N. Shroff, P. Turaga, and R. Chellappa, “Video precis: High lighting diverse aspects of videos,” IEEE Transactions on Multimedia , 2010., NIPS 2011 Constructing Recognition / distance/similarity verification matrices ECCV 2012 21 ECCV 2012
Dictionary learning (build sequence-specific dictionaries) • Let be the gallery matrix of the k -th partition of the j -th video sequence of subject i . • Given , use K-SVD [2] algorithm to build a (partition level) sub-dictionary such that • Concatenate the (partition-level) sub-dictionaries to form a sequence-specific dictionary [2] M. Aharon, M. Elad and A. M. Bruckstein, “The K-SVD: an algorithm for designing of overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing , vol. 54, no. 11, pp. 4311-4322, 2006
Recognition/Identification • Given the m -th query video sequence • We generate the partition as • The distance , between and (i.e. dictionary of the p -th video sequence) is calculated as where • We select the best matched with such that
MBGC recognition results MBGC dataset: 397 walking (frontal-face) videos: 198 SDs + 199 HDs 371 activity (profile-face) videos: 185 SDs + 186 HDs
• Facial expression analysis using AUs and high-level knowledge available in FACS regarding AUs composition and expression decomposition • AUs have ambiguous semantic descriptions so it is difficult to accurately model them AU-Dictionary – We use local features to model each AU 25
26
We learn separate dictionaries for each AU • AU-Dictionary is then formed using all the individual AU dictionaries . AU-1 AU-2 AU-5 AU-10 AU-12 AU-23 D = 27
28
• Objective function to be minimized: B 29
• Goal : – To simultaneously learn structures on the expressive face and corresponding subspace representations – We want the final subspaces to be as separate as possible • Objective: structures disjoint subsets of local patch descriptors • :learned dictionaries for the structures • Learned structures for the universal expressions from the CK+ dataset 30
• Min residual error 31
32
Some additional results • Competitive results for iris recognition. Enables cancelability. (PAMI 2011) • Non-linear dictionaries through kernelization produces improvements of 5- 10% depending on the problem. (ICASSP 2012) – Illustrated using the USPS dataset, Caltech 101 and 256 datasets. • Building dictionaries in the Radon transform domain yields robustness to in- plane rotation and scale in CBIR applications. (IEEE TIP) • Characteristic views (Chakravarthy and Freeman) can be built using sparse representation theory. (ICIP 2012) • Joint sparsity driven dictionary learning produces improvements in multi- modal biometrics applications. (Under review) • Reconstruction from sparse gradients (IEEE TIP 2012) in collaboration with Anna Gilbert.
Domain adaptation: Motivation Source domain Target domain Data: X, Labels: Data: X’, Labels: Y Y’ Transfer Learning 1 P(Y|X) ≠ P(Y’|X’), P(X) ≈ P(X’) Image credit: Saenko et al., ECCV 2010, Bergamo et al., NIPS 2010 Domain adaptation 1 S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Trans. Knowledge and Data Engineering, 22:1345 –1359, P(X) ≠ P(X’), P(Y|X) ≈ P(Y’|X’) October 2010.
Domain adaptation - Related work Semi-supervised Unsupervised • Learns domain change • No correspondence, no through correspondence knowledge of domain change – Daume and Marcu, JAIR ’06 – Ben-David et al., AISTATS ’10 – Duan et al., ICML ’09 – Blitzer et al., NIPS ’08 – Xing et al., KDD ’07 – Wang and Mahadevan, IJCAI – Saenko et al., ECCV 2010, ’09 Kulis et al., CVPR 2011 – Gopalan, Li and Chellappa, – Bergamo and Torresani, NIPS ICCV 2011 2010 – Gong et al, .. CVPR 2012 – Lai and Fox, IJRR 2010 – Zheng and Chellappa, ICPR 2012 D. Xu’s group, 2012
Unsupervised domain adaptation* Intermediate domains Domain 1 Domain 2 (Incremental learning) (labeled) (unlabeled) Labeled G N,d source domain (X) Unlabeled target domain (X~) S 1 Generative S 2 S 1.3 subspace from X Generative subspace from X~ (no labels) S 1.6 * R. Gopalan, R. Li, R. Chellappa, “Domain adaptation for object recognition: An unsupervised approach”, International Conference on Computer Vision, ICCV 2011 (Oral)
Recommend
More recommend