from unlabeled video
play

from unlabeled video Kristen Grauman Department of Computer Science - PowerPoint PPT Presentation

Learning image representations from unlabeled video Kristen Grauman Department of Computer Science The University of T exas at Austin Work with Dinesh Jayaraman Learning visual categories Recent major strides in category recognition


  1. Learning image representations from unlabeled video Kristen Grauman Department of Computer Science The University of T exas at Austin Work with Dinesh Jayaraman

  2. Learning visual categories • Recent major strides in category recognition • Facilitated by large labeled datasets ImageNet 80M Tiny Images SUN Database [Deng et al.] [Torralba et al.] [Xiao et al.] [Papageorgiou& Poggio 1998,Viola & Jones 2001, Dalal & Triggs 2005, Grauman & Darrell 2005, Lazebnik et al. 2006, Felzenszwalbet al. 2008, Krizhevsky et al. 2012, Russakovsky IJCV 2015…]

  3. Big picture goal: Embodied vision Status quo : Learn from “disembodied” bag of labeled snapshots. Our goal: Learn in the context of acting and moving in the world. Kristen Grauman, UT Austin

  4. Beyond “bags of labeled images”? Visual development in nature is based on: • continuous observation • multi-sensory feedback • motion and action … in an environment. Inexpensive, and unrestricted in scope Evidence from: psychology, evolutionary biology, cognitive science. [Held et al, 1964][Moravec et al, 1984][Wilson et al, 2002]

  5. Talk overview 1. Learning representations tied to ego-motion 2. Learning representations from unlabeled video 3. Learning how to move and where to look Kristen Grauman, UT Austin

  6. The kitten carousel experiment [Held & Hein, 1963] passive kitten active kitten Key to perceptual development: self-generated motion + visual feedback Kristen Grauman, UT Austin

  7. Our idea: Ego-motion ↔ vision Goal: Teach computer vision system the connection: “ how I move ” ↔ “ how my visual surroundings change ” + Unlabeled video Ego-motion motor signals Kristen Grauman, UT Austin

  8. Ego-motion ↔ vision: view prediction After moving: Kristen Grauman, UT Austin

  9. Ego-motion ↔ vision for recognition Learning this connection requires:  Depth, 3D geometry Also key to  Semantics recognition!  Context Can be learned without manual labels! Our approach: unsupervised feature learning using egocentric video + motor signals Kristen Grauman, UT Austin

  10. Approach idea: Ego-motion equivariance Invariant features: unresponsive to some classes of transformations 𝐴 𝑕𝐲 ≈ 𝐴(𝐲) Simard et al, Tech Report, ’ 98 Wiskott et al, Neural Comp ’ 02 Hadsell et al, CVPR ’ 06 Mobahi et al, ICML ’ 09 Zou et al, NIPS ’ 12 Sohn et al, ICML ’ 12 Cadieu et al, Neural Comp ’ 12 Goroshin et al, ICCV ’ 15 Lies et al, PLoS computation biology ’ 14 … Kristen Grauman, UT Austin

  11. Approach idea: Ego-motion equivariance Invariant features: unresponsive to some classes of transformations 𝐴 𝑕𝐲 ≈ 𝐴(𝐲) Equivariant features: predictably responsive to some classes of transformations, through simple mappings (e.g., linear) “ equivariance map” 𝐴 𝑕𝐲 ≈ 𝑁 𝑕 𝐴(𝐲) Invariance discards information; equivariance organizes it. Kristen Grauman, UT Austin

  12. Approach idea: Ego-motion equivariance Equivariant embedding Training data organized by ego-motions Unlabeled video + motor signals left turn right turn forward motor signal Learn Pairs of frames related by similar ego-motion should be related by same time → feature transformation Kristen Grauman, UT Austin

  13. Approach idea: Ego-motion equivariance Equivariant embedding Training data organized by ego-motions Unlabeled video + motor signals motor signal Learn time → Kristen Grauman, UT Austin

  14. Approach overview Our approach: unsupervised feature learning using egocentric video + motor signals 1. Extract training frame pairs from video 2. Learn ego-motion-equivariant image features 3. Train on target recognition task in parallel Kristen Grauman, UT Austin

  15. Training frame pair mining Discovery of ego-motion clusters 𝑕 =left turn yaw change 𝑕 =forward 𝑕 =right turn Right turn forward distance Kristen Grauman, UT Austin

  16. Training frame pair mining Discovery of ego-motion clusters 𝑕 =left turn yaw change 𝑕 =forward 𝑕 =right turn Right turn forward distance Kristen Grauman, UT Austin

  17. Ego-motion equivariant feature learning Desired : for all motions 𝑕 and all images 𝐲 , Given: 𝐴 𝛊 𝑕𝐲 ≈ 𝑁 𝑕 𝐴 𝛊 (𝐲) Unsupervised training 𝐲 𝑗 𝐴 𝛊 (𝐲 𝑗 ) 𝑁 𝛊 𝑕 𝑕 ∥ 𝑁 𝑕 𝐴 𝛊 (𝐲 𝑗 ) − 𝐴 𝛊 (𝑕𝐲 𝑗 ) ∥ 𝟑 𝑕𝐲 𝑗 𝐴 𝛊 (𝑕𝐲 𝑗 ) 𝛊 Supervised training 𝐲 𝑙 𝐴 𝛊 (𝐲 𝑙 ) 𝑋 softmax loss 𝑀 𝐷 (𝐲 𝑙 , y 𝑙 ) 𝛊 class y 𝑙 𝛊 , 𝑁 𝑕 and 𝑋 jointly trained Kristen Grauman, UT Austin

  18. Method recap Ego-motion training pairs Neural network training Equivariant embedding APPROACH 𝑁 𝛊 𝑕 𝑀 𝐹 𝛊 𝑋 𝑀 𝐷 𝛊 Scene and object recognition Next-best view selection RESULTS Football field? Pagoda? Airport? Cathedral? Army base? cup frying pan Kristen Grauman, UT Austin

  19. Datasets KITTI video [Geiger et al. 2012] Car platform Egomotions: yaw and forward distance SUN images [Xiao et al. 2010] Large-scale scene classification task with 397 categories (static images) NORB images [LeCun et al. 2004] Toy recognition Egomotions: elevation and azimuth Kristen Grauman, UT Austin

  20. Results: Equivariance check Visualizing how well equivariance is preserved left Query pair left Neighbor pair (our features) zoom Neighbor pair (pixel space) Kristen Grauman, UT Austin

  21. Results: Equivariance check How well is equivariance preserved? Recognition loss only Temporal coherence Ours Normalized error: Temporal coherence: Hadsell et al. CVPR 2006, Mohabi et al. ICML 2009 Kristen Grauman, UT Austin

  22. Results: Recognition Learn from unlabeled car video (KITTI) Geiger et al, IJRR ’13 Exploit features for static scene classification (SUN, 397 classes) Xiao et al , CVPR ’10 Kristen Grauman, UT Austin

  23. Results: Recognition Do ego-motion equivariant features improve recognition? 6 labeled training 397 classes recognition accuracy (%) KITTI ⟶ SUN examples per class 1.58 KITTI ⟶ KITTI 1.21 1.02 0.70 0.25 invariance NORB ⟶ NORB Up to 30% accuracy increase over state of the art! * Hadsell et al ., Dimensionality Reduction by Learning an Invariant Mapping, CVPR’06 ** Mobahi et al., Deep Learning from Temporal Coherence in Video, ICML’09 Kristen Grauman, UT Austin

  24. Recap so far http://vision.cs.utexas.edu/projects/egoequiv/  New embodied visual feature learning paradigm  Ego-motion equivariance boosts performance across multiple challenging recognition tasks  Future work: volition at training time too Kristen Grauman, UT Austin

  25. Talk overview 1. Learning representations tied to ego-motion 2. Learning representations from unlabeled video 3. Learning how to move and where to look Kristen Grauman, UT Austin

  26. Learning from arbitrary unlabeled video? Unlabeled video Unlabeled video + ego-motion Kristen Grauman, UT Austin

  27. Learning from arbitrary unlabeled video? Unlabeled video Unlabeled video + ego-motion Kristen Grauman, UT Austin

  28. Background: Slow feature analysis [Wiskott & Sejnowski, 2002] Find functions g(x) that map quickly varying input slowly varying signal x( t ) features y( t ) Figure: Laurenz Wiskott, http://www.scholarpedia.org/article/File:SlowFeatureAnalysis-OptimizationProblem.png Kristen Grauman, UT Austin

  29. Background: Slow feature analysis [Wiskott & Sejnowski, 2002] Find functions g(x) that map quickly varying input slowly varying signal x( t ) features y( t ) Figure: Laurenz Wiskott, http://www.scholarpedia.org/article/File:SlowFeatureAnalysis-OptimizationProblem.png Kristen Grauman, UT Austin

  30. Background: Slow feature analysis [Wiskott & Sejnowski, 2002] • Existing work exploits “slowness” as temporal coherence in video → learn invariant representation [Hadsell et al. 2006; Mobahi et al. 2009; Bergstra & Bengio 2009; Goroshin et al. 2013; Wang & Gupta 2015,…] • Fails to capture how visual content changes over time in learned embedding Kristen Grauman, UT Austin

  31. Our idea : Steady feature analysis • Higher order temporal coherence in video → learn equivariant representation Second order slowness operates on frame triplets: in learned embedding [Jayaraman & Grauman, CVPR 2016] Kristen Grauman, UT Austin

  32. Approach: Steady feature analysis Learn classifier W and representation θ jointly, with unsupervised regularization loss: slow Contrastive loss that also exploits “negative” tuples steady Kristen Grauman, UT Austin

  33. Approach: Steady feature analysis supervised slow unsupervised steady [Jayaraman & Grauman, CVPR 2016]

  34. Recap: Steady feature analysis Equivariance ≈ “steadily” varying frame features! d² 𝐴 𝛊 (𝐲t) /dt² ≈ 𝟏 [Jayaraman & Grauman, CVPR 2016] Kristen Grauman, UT Austin

  35. Datasets Unlabeled video Target task (few labels) Human Motion PASCAL 10 Actions Database (HMDB) SUN 397 Scenes KITTI Video NORB NORB 25 Objects 32 x 32 images or 96 x 96 images

  36. Results: Sequence completion Given sequential pair, infer next frame (embedding) Our top 3 estimates for KITTI dataset Kristen Grauman, UT Austin

Recommend


More recommend