machine learning solutions to visual recognition problems
play

Machine learning solutions to visual recognition problems Jakob - PDF document

Machine learning solutions to visual recognition problems Jakob Verbeek Synth` ese des travaux scientifiques pour obtenir le grade de Habilitation ` a Diriger des Recherches. Summary This thesis gives an overview of my research since my


  1. Machine learning solutions to visual recognition problems Jakob Verbeek Synth` ese des travaux scientifiques pour obtenir le grade de Habilitation ` a Diriger des Recherches.

  2. Summary This thesis gives an overview of my research since my arrival in December 2005 as a postdoctoral fellow at the in the LEAR team at INRIA Rhˆ one- Alpes. After a general introduction in Chapter 1, the contributions are pre- sented in chapters 2–4 along three themes. In each chapter we describe the contributions, their relation to related work, and highlight two contribu- tions with more detail. Chapter 2 is concerned with contributions related to the Fisher vec- tor representation. We highlight an extension of the representation based on modeling dependencies among local descriptors (Cinbis et al., 2012, 2016a). The second highlight is on an approximate normalization scheme which speeds-up applications for object and action localization (Oneata et al., 2014b). In Chapter 3 we consider the contributions related to metric learning. The first contribution we highlight is a nearest-neighbor based image an- notation method that learns weights over neighbors, and effectively de- termines the number of neighbors to use (Guillaumin et al., 2009a). The second contribution we highlight is an image classification method based on metric learning for the nearest class mean classifier that can efficiently generalize to new classes (Mensink et al., 2012, 2013b). The third set of contributions, presented in Chapter 4, is related to learn- ing visual recognition models from incomplete supervision. The first high- lighted contribution is an interactive image annotation method that ex- ploits dependencies across different image labels, to improve predictions and to identify the most informative user input (Mensink et al., 2011, 2013a). The second highlighted contribution is a multi-fold multiple instance learn- ing method for learning object localization models from training images where we only know if the object is present in the image or not (Cinbis et al., 2014, 2016b). Finally, Chapter 5 summarizes the contributions, and presents future re- search directions. A curriculum vitae with a list of publications is available in Appendix A. i

  3. R´ esum´ e Cette th` ese donne un aperc ¸u de mes recherches depuis mon arriv´ ee en d´ ecembre 2005 en tant que postdoctorat au sein de l’´ equipe LEAR ` a l’INRIA Rhˆ one-Alpes. Apr` es une introduction g´ en´ erale au Chapitre 1, les contribu- tions seront pr´ esent´ ees dans les chapitres 2–4. Chaque chapitre d´ ecrira les contributions li´ es ` a un th` eme et leur relation avec les travaux y aff´ erent. Deux contributions seront ´ egalement mise en exergue. Le Chapitre 2 concernera les contributions li´ ees ` a la repr´ esentation vec- torielle de Fisher. Nous mettons en avant une extension de cette repr´ esenta- tion bas´ ee sur la mod´ elisation des d´ ependances parmi les descripteurs lo- caux (Cinbis et al., 2012, 2016a). La deuxi` eme contribution pr´ esent´ ee en d´ etail est un ensemble d’approximations des normalisations du vecteur de Fisher, qui permettent une acc´ el´ eration dans des applications de localisa- tion d’objets et d’actions (Oneata et al., 2014b). Dans le Chapitre 3, nous consid´ ererons les contributions li´ ees ` a l’ap- prentissage de m´ etrique. La premi` ere contribution que nous d´ etaillerons est une m´ ethode d’annotation d’image type plus proche voisin. Cette m´ eth- ode permet d’affecter des poids aux voisins et de d´ eterminer le nombre de voisins ` a utiliser (Guillaumin et al., 2009a). La deuxi` eme contribution que nous mettrons en valeur est une m´ ethode de classification d’image bas´ ee sur l’apprentissage de m´ etrique qui permet de g´ en´ eraliser ` a de nouvelles classes (Mensink et al., 2012, 2013b). La troisi` eme s´ erie de contributions, pr´ esent´ ees dans le Chapitre 4, sont li´ ees ` a l’apprentissage de mod` eles de reconnaissance visuelle avec des don- n´ ees incompl` etes. La contribution mise en valeur est une m´ ethode d’anno- tation d’image interactive qui exploite les d´ ependances entre les diff´ erentes etiquettes d’image, pour am´ ´ eliorer les pr´ evisions et optimiser les interac- tions avec l’utilisateur (Mensink et al., 2011, 2013a). La deuxi` eme contri- bution majeure est une m´ ethode d’appentissage ` a multiple-instances pour apprendre des mod` eles de localisation d’objet ` a partir d’images pour les- quelles nous savons seulement si l’objet est pr´ esent dans l’image ou non (Cinbis et al., 2014, 2016b). Enfin, le Chapitre 5 r´ esume les contributions et pr´ esente des pistes pour de futures recherches. Une curriculum vitae avec une liste des publications est disponible en Annexe A. ii

  4. Contents 1 Introduction 1 1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contents of this document . . . . . . . . . . . . . . . . . . . . 3 2 The Fisher vector representation 6 2.1 The Fisher vector image representation . . . . . . . . . . . . 7 2.2 Modeling local descriptor dependencies . . . . . . . . . . . . 12 2.3 Approximate Fisher vector normalization . . . . . . . . . . . 17 2.4 Summary and outlook . . . . . . . . . . . . . . . . . . . . . . 22 3 Metric learning approaches 24 3.1 Contributions and related work . . . . . . . . . . . . . . . . . 25 3.2 Image annotation with TagProp . . . . . . . . . . . . . . . . . 28 3.3 Metric learning for distance-based classification . . . . . . . 34 3.4 Summary and outlook . . . . . . . . . . . . . . . . . . . . . . 39 4 Learning with incomplete supervision 41 4.1 Contributions and related work . . . . . . . . . . . . . . . . . 42 4.2 Interactive annotation using label dependencies . . . . . . . 47 4.3 Weakly supervised learning for object localization . . . . . . 52 4.4 Summary and outlook . . . . . . . . . . . . . . . . . . . . . . 58 5 Conclusion and perspectives 59 5.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . 59 5.2 Long-term research directions . . . . . . . . . . . . . . . . . . 62 Bibliography 66 A Curriculum vitae 81 iii

  5. Chapter 1 Introduction In this chapter we briefly sketch the context of the work presented in this document in Section 1.1. Then, in Section 1.2 and briefly describe the con- tent of the rest of the document. 1.1 Context In the last decade we have witnessed an explosion in the amount of im- ages and videos that are digitally available, e.g . in broadcasting archives, social media sharing websites, and personal collections. The following two statistics clearly underline this observation. According to Business Insider 1 Facebook had 350 million photo uploads per day in 2013. The world leader in internet infrastructure Cisco estimates that “Globally, IP video traffic will be 80% of all IP traffic (both business and consumer) by 2019, up from 67% in 2014.” (cis, 2015). These unprecedented large quantities of visual data motivate the need for computer vision techniques to assist retrieval, anno- tation, and navigation of visual content. Arguably, the ultimate goal of computer vision as a scientific and en- gineering discipline is to be able to build general purpose “intelligent” vi- sion systems. Such a system should be able to “represent” (store in an in- ternally useful format), “interpret” (map input to this format), and “un- derstand” (infer facts about the input based on the representation) at a high semantic level the scene depicted in an image, or a dynamic scene that unfolds in a video. Let us try to clarify these desiderata by giving more concrete examples. Scene understanding involves determining which type of objects are present in a scene, where they are, how they interact with each other, etc . These questions require high-level semantic interpre- tation of the scene, which abstracts away from many of the physical geo- metric and photometric properties such as viewpoint, illumination, blur, 1 See http://www.businessinsider.com 1

Recommend


More recommend