Machine Learning Solutions to Visual Recognition Problems Jakob Verbeek Habilitation ` a Diriger des Recherches Universit´ e Grenoble Alpes Jury Prof. Eric Gaussier Univ. Grenoble Alpes Pr´ esident Prof. Matthieu Cord Univ. Pierre et Marie Curie Rapporteur Prof. Erik Learned-Miller Univ. of Massachusetts Rapporteur Prof. Andrew Zisserman Univ. of Oxford Rapporteur Dr. Cordelia Schmid INRIA Rhˆ one-Alpes Examinateur Prof. Tinne Tuytelaars K.U. Leuven Examinateur 1 / 45
Learning-based methods to understand natural imagery ◮ Recognition: people, objects, actions, events, . . . ◮ Localization: box, segmentation mask, space-time tube, . . . ◮ A technique-driven rather than application-driven approach 2 / 45
Layout of this presentation ◮ Synthetic review of past activities ◮ Overview of contributions ◮ Perspectives 3 / 45
Part I Synthetic review of past activities 4 / 45
Academic background: 1994 — 2005 — 2016 ◮ 1994-1998: MSc Artificial Intelligence, University of Amsterdam ◮ With honors, Peter Gr¨ unwald, Ronald de Wolf, Paul Vit´ anyi ◮ 1999-2000: MSc Logic, ILLC, University of Amsterdam ◮ With honors, Michiel van Lambalgen ◮ 2000-2004: PhD Computer Science, University of Amsterdam ◮ Ben Kr¨ ose, Nikos Vlassis, Frans Groen ◮ 2005-2007: Postdoctoral fellow, INRIA Rhˆ one-Alpes ◮ Bill Triggs ◮ since 2007: Permantent researcher, INRIA Rhˆ one-Alpes ◮ 2009: Promotion to CR1 ◮ 2016: Outstanding research distinction (PEDR) 5 / 45
Supervised PhD students ◮ 2006-2010: Matthieu Guillaumin ◮ Amazon research, Berlin, Germany ◮ 2008-2011: Josip Krapac ◮ PostDoc Univ. Zagreb, Croatia ◮ 2009-2012: Thomas Mensink, AFRIF best thesis award 2012 ◮ PostDoc Univ. Amsterdam, Netherlands ◮ 2010-2014: Gokberk Cinbis, AFRIF best thesis award 2014 ◮ Assistent Prof. Bilkent Univ. Ankara, Turkey ◮ 2011-2015: Dan Oneat ¸˘ a ◮ Data scientist, Eloquentix, Bucharest, Romania ◮ Since 2013: Shreyas Saxena ◮ Since 2016: Pauline Luc 6 / 45
Research funding: ANR, EU, Cifre, LabEx ◮ 2006-2009: Cognitive-Level Annotation using Latent Statistical Structure (CLASS) , funded by European Union ◮ 2008-2010: Interactive Image Search , funded by ANR ◮ 2009-2012: Modeling multi-media documents for cross-media access , Cifre PhD with Xerox Research Centre Europe ◮ 2010-2013: Quaero Consortium for Multimodal Person Recognition , funded by ANR ◮ 2011-2015: AXES: Access to Audiovisual Archives , funded by European Union ◮ 2013-2016: Physionomie: Physiognomic Recognition for Forensic Investigation , funded by ANR ◮ 2016-2018: Weakly supervised structured prediction for semantic segmentation , Cifre with Facebook AI Research ◮ 2016-2020: Deep covolutional and recurrent networks for image speech and text , Laboratory of Execelence Persyval 7 / 45
Publications ◮ 19 journal articles: 14 in TPAMI, IJCV, PR, TIP ◮ 34 conference papers: 25 (6 oral) ECCV, CVPR, ICCV, NIPS ◮ 5723 citations, H-index 36, i10-index 58 (Google scholar) ◮ 3 patents, joint inventions with Xerox Research Centre Europe 8 / 45
Research community service ◮ Associate editor ◮ International Journal of Computer Vision (since 2014) ◮ Image and Vision Computing Journal (since 2011) ◮ Chairs for international conferences ◮ Tutorial chair ECCV 2016 ◮ Area chair CVPR 2015 ◮ Area chair ECCV 2012, 2014. ◮ Area chair BMVC 2012, 2013, 2014. 9 / 45
Part II Overview of contributions 10 / 45
Layout of this presentation ◮ Synthetic review of past activities ◮ Overview of contributions 1. The Fisher vector representation 2. Metric learning approaches 3. Learning with incomplete supervision ◮ Perspectives 11 / 45
The Fisher vector representation ◮ Data representation by Fisher score vector [Jaakkola and Haussler, 1999] R D ∇ θ ln p ( x ; θ ) , θ ∈ I (1) ◮ Useful to represent non-vectorial data, e.g . sets, sequences,. . . ◮ For images: iid GMM for sets of local descriptors [Perronnin and Dance, 2007] N K � � p ( x 1: N ) = π k N ( x n ; µ k , σ k ) (2) n =1 k =1 ◮ Fisher vector contains local first and second order statistics N � ⊤ � 1 , x n , x 2 � ∇ ( π k ,µ k ,σ k ) ln p ( x ; θ ) = b + A p ( k | x n ) (3) n n =1 12 / 45
Related publications ◮ Fisher vectors for non-iid image models Cinbis, Schmid, Verbeek [CVPR’12, PAMI’16], 40 citations ◮ Approximate power and L2 normalization of FV Oneata, Schmid, Verbeek [CVPR’14], 23 citations ◮ Application for action and event recognition Oneata, Schmid, Verbeek, Wang [ICCV’13, IJCV’15], 158 citations ◮ Application for object localization Cinbis, Schmid, Verbeek [ICCV’13], 64 citations ◮ Fisher vectors for descriptor layout coding Jurie, Krapac, Verbeek [ICCV’11], 135 citations 13 / 45
Fisher vectors for non-iid image models ◮ Independence assumption generates sum-pooling in FV ◮ Bag-of-words [Csurka et al., 2004, Sivic and Zisserman, 2003] and iid GMM FV [Perronnin and Dance, 2007] ◮ Very poor assumption from modeling perspective ◮ Images are locally self-similar ◮ Representation should discount frequent events ◮ Compensated by power normalization, Hellinger or χ 2 kernel 14 / 45
Replace iid models with non-iid exchangeable counterparts π α π λ k a k λ k w i w i b k m k x i x i β k µ k µ k i =1 , 2 ,...,N k =1 , 2 ,...,K i =1 , 2 ,...,N k =1 , 2 ,...,K Gaussian mixture Latent Gaussian mixture ◮ Bayesian approach treat model parameters as latent variables ◮ Compute Fisher vector w.r.t. hyper-parameters ◮ Variational inference to approximate intractable gradients 15 / 45
Comparison to power normalization 1 1 0.8 0.9 SqrtMoG or LatMoG FV 0.6 0.8 0.4 0.7 0.2 0.6 0 LatMoG-1 -0.2 0.5 LatMoG-2 α = 1.0e−02 -0.4 0.4 LatMoG-3 α = 1.0e−01 -0.6 LatMoG-4 0.3 α = 1.0e+00 SqrtMoG α = 1.0e+01 -0.8 0.2 α = 1.0e+02 -1 α = 1.0e+03 0.1 -1 -0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 square−root MoG FV 0 0 10 20 30 40 50 60 70 80 90 100 Bag-of-word counts Gaussian mean parameter ◮ Fisher vector non-iid model vs . power-normalization ◮ Qualitatively similar monotonic concave transformations ◮ Latent variable model explains effectiveness of power-normal. 16 / 45
Layout of this presentation ◮ Synthetic review of past activities ◮ Overview of contributions 1. The Fisher vector representation 2. Metric learning approaches 3. Learning with incomplete supervision ◮ Perspectives 17 / 45
Metric learning approaches ◮ Measures of similarity or distance have many applications ◮ Retrieval and matching of local descriptors or entire images ◮ Nearest neighbor prediction models ◮ Verification: do two objects belong to the same category ◮ Supervised training to discover the important features ◮ Notion of similarity is task dependent ◮ Methods such as FDA [Fisher, 1936] use only second moments FDA [Mensink et al., 2012] 18 / 45
Related publications ◮ Coordinated Local Metric Learning Saxena and Verbeek [ICCV’15 Workshop] ◮ Metric learning for nearest class-mean classifier Csurka, Mensink, Perronin, Verbeek [PAMI’13, ECCV’12 oral], 126 citations ◮ Multiple instance metric learning Guillaumin, Schmid, Verbeek [ECCV’10], 83 citations ◮ Discriminative metric learning in nearest neighbor models Guillaumin, Mensink, Schmid, Verbeek [ICCV’09 oral], 377 citations ◮ Logistic discriminant metric learning Guillaumin, Schmid, Verbeek [ICCV’09], 420 citations 19 / 45
Instantaneous adaptation to new samples and classes ◮ Consider photo-sharing service: stream of labeled images ◮ Re-training a discriminative model for new data is costly ◮ Generative models easily updated, but often perform worse ◮ KNN classifiers are very costly to evaluate for large dataset 20 / 45
Instantaneous adaptation to new samples and classes ◮ Consider photo-sharing service: stream of labeled images ◮ Re-training a discriminative model for new data is costly ◮ Generative models easily updated, but often perform worse ◮ KNN classifiers are very costly to evaluate for large dataset ◮ Nearest mean classifier is linear and easily updated y = arg min k || W ( x − µ k ) || 2 (4) ◮ Maximum likelihood estimation with softmax loss p ( y = k | x ) ∝ exp −|| W ( x − µ k ) || 2 (5) ◮ Corresponds to posterior in generative Gaussian mixture model p ( x | y = k ) = N ( x ; µ k , Σ) (6) 20 / 45
Experimental evaluation: ImageNet Challenge 2010 ◮ Train 1: metric and means from 1,000 classes ◮ Train 2: metric from 800 classes, means on all 1,000 ◮ Test: 200 classes not used for metric in (Train 2) Error in % KNN NCM Trained on all 38.4 36.4 Trained on 800 42.4 39.9 ◮ Linear NCM classifier better than non-parametric KNN ◮ In both cases metric is learned ◮ Training from other classes moderately impacts performance 21 / 45
Visualization of nearest classes using L2 and learned metric ◮ Classes closest to center of “Palm” in FV image space ◮ Learned Mahalanobis metric semantically more meaningful ◮ Improves predication accuracy ◮ Remaining errors are more sensible 22 / 45
Layout of this presentation ◮ Synthetic review of past activities ◮ Overview of contributions 1. The Fisher vector representation 2. Metric learning approaches 3. Learning with incomplete supervision ◮ Perspectives 23 / 45
Recommend
More recommend