learning how to move and where to look from unlabeled
play

Learning How to Move and Where to Look from Unlabeled Video Kristen - PowerPoint PPT Presentation

Learning How to Move and Where to Look from Unlabeled Video Kristen Grauman Department of Computer Science University of Texas at Austin Visual recognition Objects amusement park sky Activities Scenes Locations The Wicked Cedar Point


  1. Learning How to Move and Where to Look from Unlabeled Video Kristen Grauman Department of Computer Science University of Texas at Austin

  2. Visual recognition Objects amusement park sky Activities Scenes Locations The Wicked Cedar Point Text / writing Twister Faces Gestures Ferris ride Motions wheel ride Emotions… 12 E Lake Erie water ride tree tree people waiting in line people sitting on ride umbrellas tree maxair carousel deck bench tree pedestrians Kristen Grauman, UT Austin

  3. Visual recognition: applications Organizing visual content Science and medicine AI and autonomous robotics Personal photo/video collections Gaming, HCI, Augmented Reality Surveillance and security Kristen Grauman, UT Austin

  4. Significant recent progress in the field Big labeled Deep learning datasets ImageNet top-5 error (%) 30 25 20 GPU technology 15 10 5 0 1 2 3 4 5 6 Kristen Grauman, UT Austin

  5. Recognition benchmarks PASCAL (2007-12) BSD (2001) Caltech 101 (2004), Caltech 256 (2006) LabelMe (2007) ImageNet (2009) SUN (2010) Places (2014) MS COCO (2014) Visual Genome (2016) Kristen Grauman, UT Austin

  6. How do our systems learn about the visual world today? dog … Expensive and restrictive in scope … boat Kristen Grauman, UT Austin

  7. Big picture goal: Embodied visual learning Status quo : Learn from “disembodied” bag of labeled snapshots. Our goal: Visual learning in the context of acting and moving in the world. Inexpensive and unrestricted in scope Kristen Grauman, UT Austin

  8. Big picture goal: Embodied visual learning Status quo : Learn from “disembodied” bag of labeled snapshots. Our goal: Visual learning in the context of acting and moving in the world. Inexpensive and unrestricted in scope Kristen Grauman, UT Austin

  9. Talk overview Towards embodied visual learning 1. Learning representations tied to ego-motion 2. Learning representations from unlabeled video 3. Learning how to move and where to look Kristen Grauman, UT Austin

  10. The kitten carousel experiment [Held & Hein, 1963] passive kitten active kitten Key to perceptual development: self-generated motion + visual feedback Kristen Grauman, UT Austin

  11. Our idea: Ego-motion vision Goal: Teach computer vision system the connection: “how I move” “how my visual surroundings change” + Unlabeled video Ego-motion motor signals [Jayaraman & Grauman, ICCV 2015] Kristen Grauman, UT Austin

  12. Ego-motion vision: view prediction After moving: Kristen Grauman, UT Austin

  13. Ego-motion vision for recognition Learning this connection requires:  Depth, 3D geometry Also key to  Semantics recognition!  Context Can be learned without manual labels! Our approach: unsupervised feature learning using egocentric video + motor signals [Jayaraman & Grauman, ICCV 2015] Kristen Grauman, UT Austin

  14. Approach idea: Ego-motion equivariance Invariant features: unresponsive to some classes of transformations Simard et al, Tech Report, ’98 Wiskott et al, Neural Comp ’02 Hadsell et al, CVPR ’06 Mobahi et al, ICML ’09 Zou et al, NIPS ’12 Sohn et al, ICML ’12 Cadieu et al, Neural Comp ’12 Goroshin et al, ICCV ’15 Lies et al, PLoS computation biology ’14 … Kristen Grauman, UT Austin

  15. Approach idea: Ego-motion equivariance Invariant features: unresponsive to some classes of transformations Equivariant features: predictably responsive to some classes of transformations, through simple mappings (e.g., linear) “equivariance map” Invariance discards information; equivariance organizes it. Kristen Grauman, UT Austin

  16. Approach idea: Ego-motion equivariance Equivariant embedding Training data organized by ego-motions Unlabeled video + motor signals left turn right turn forward motor signal Learn Pairs of frames related by similar ego-motion should be related by same time feature transformation [Jayaraman & Grauman, ICCV 2015] Kristen Grauman, UT Austin

  17. Approach idea: Ego-motion equivariance Equivariant embedding Training data organized by ego-motions Unlabeled video + motor signals motor signal Learn time [Jayaraman & Grauman, ICCV 2015] Kristen Grauman, UT Austin

  18. Ego-motion equivariant feature learning Given: Desired : for all motions and all images , Unsupervised training � � � � � � Feature space � � (� � ) � � (�� � ) [Jayaraman & Grauman, ICCV 2015] Kristen Grauman, UT Austin

  19. Ego-motion equivariant feature learning Given: Desired : for all motions and all images , Unsupervised training � � � � � � Supervised training softmax loss � � � class , and jointly trained [Jayaraman & Grauman, ICCV 2015] Kristen Grauman, UT Austin

  20. Results: Recognition Learn from unlabeled car video (KITTI) Geiger et al, IJRR ’13 Exploit features for static scene classification (SUN, 397 classes) Xiao et al, CVPR ’10 Kristen Grauman, UT Austin

  21. Results: Recognition Ego-equivariance for unsupervised feature learning 9 SUN scenes: 397 multi-class accuracy 8 7 Accuracy (%) 6 Egomotion-equivariance induces the 5 strongest representations 4 3 2 1 0 1 2 3 4 5 Series1 Series2 Pre-trained models available Series3 Series4 + Hadsell, Chopra, LeCun, “Dimensionality Reduction by Learning an Invariant Mapping”, CVPR 2006 * Agrawal, Carreira, Malik, “Learning to see by moving”, ICCV 2015 Kristen Grauman, UT Austin

  22. Talk overview Towards embodied visual learning 1. Learning representations tied to ego-motion 2. Learning representations from unlabeled video 3. Learning how to move and where to look Kristen Grauman, UT Austin

  23. Learning from arbitrary unlabeled video? Unlabeled video Unlabeled video + ego-motion Kristen Grauman, UT Austin

  24. Learning from arbitrary unlabeled video? Unlabeled video Unlabeled video + ego-motion Kristen Grauman, UT Austin

  25. Background: Slow feature analysis [Wiskott & Sejnowski, 2002] Find functions g(x) that map quickly varying input slowly varying signal x( t ) features y( t ) Figure: Laurenz Wiskott, http://www.scholarpedia.org/article/File:SlowFeatureAnalysis-OptimizationProblem.png Kristen Grauman, UT Austin

  26. Background: Slow feature analysis [Wiskott & Sejnowski, 2002] Find functions g(x) that map quickly varying input slowly varying signal x( t ) features y( t ) Figure: Laurenz Wiskott, http://www.scholarpedia.org/article/File:SlowFeatureAnalysis-OptimizationProblem.png Kristen Grauman, UT Austin

  27. Prior work: Slow feature analysis Wiskott et al, 2002 Hadsell et al. 2006 Mobahi et al. 2009 Bergstra & Bengio 2009 Goroshin et al. 2013 Wang & Gupta 2015 … Learn feature map such that: (invariance) Kristen Grauman, UT Austin

  28. Our idea: Steady feature analysis Higher order temporal coherence Learn feature map such that: (invariance) (equivariance) [Jayaraman & Grauman, CVPR 2016] Kristen Grauman, UT Austin

  29. Our idea: Steady feature analysis Learn feature map such that: (invariance) (equivariance) [Jayaraman & Grauman, CVPR 2016] Kristen Grauman, UT Austin

  30. Datasets Unlabeled video Target task (few labels) Human Motion PASCAL 10 Actions Database (HMDB) SUN 397 Scenes KITTI Video NORB NORB 25 Objects 32 x 32 images or 96 x 96 images Kristen Grauman, UT Austin

  31. Results: Steady feature analysis * ** Multi-class recognition accuracy *Hadsell et al., Dimensionality Reduction by Learning an Invariant Mapping, CVPR’06 **Mobahi et al., Deep Learning from Temporal Coherence in Video, ICML’09 Kristen Grauman, UT Austin

  32. Pre-training a representation Supervised pre-training Labeled images Few labeled images from a related domain for target task Fine-tune Unsupervised “pre-training” Few labeled images Unlabeled video for target task Kristen Grauman, UT Austin

  33. Results: Can we learn more from unlabeled video than “related” labeled images? + HMDB (unlabeled video) CIFAR-100 PASCAL (labeled for other (few img labels) categories) Kristen Grauman, UT Austin

  34. Results: Can we learn more from unlabeled video than “related” labeled images? + HMDB (unlabeled video) Better even than providing 50,000 extra manual labels for auxiliary classification task! CIFAR-100 PASCAL (labeled for other (few img labels) categories) Kristen Grauman, UT Austin

  35. Talk overview Towards embodied visual learning 1. Learning representations tied to ego-motion 2. Learning representations from unlabeled video 3. Learning how to move and where to look Kristen Grauman, UT Austin

  36. Current recognition benchmarks Passive, disembodied snapshots at test time, too BSD (2001) PASCAL (2007-12) Caltech 101 (2004), Caltech 256 (2006) LabelMe (2007) ImageNet (2009) SUN (2010) Places (2014) MS COCO (2014) Visual Genome (2016) Kristen Grauman, UT Austin

  37. Current recognition benchmarks Passive, disembodied snapshots at test time, too Object recognition ? ? ? Scene recognition ? ? Kristen Grauman, UT Austin

  38. Moving to recognize Time to revisit active recognition in challenging settings! Bajcsy 1985, Aloimonos 1988, Ballard 1991, Wilkes 1992, Dickinson 1997, Schiele & Crowley 1998, Tsotsos 2001, Denzler 2002, Soatto 2009, Ramanathan 2011, Borotschnig 2011, … Kristen Grauman, UT Austin

Recommend


More recommend