from unlabeled video Kristen Grauman Department of Computer Science - PowerPoint PPT Presentation

Learning image representations from unlabeled video Kristen Grauman Department of Computer Science The University of T exas at Austin Work with Dinesh Jayaraman

Learning visual categories • Recent major strides in category recognition • Facilitated by large labeled datasets ImageNet 80M Tiny Images SUN Database [Deng et al.] [Torralba et al.] [Xiao et al.] [Papageorgiou& Poggio 1998,Viola & Jones 2001, Dalal & Triggs 2005, Grauman & Darrell 2005, Lazebnik et al. 2006, Felzenszwalbet al. 2008, Krizhevsky et al. 2012, Russakovsky IJCV 2015…]

Big picture goal: Embodied vision Status quo : Learn from “disembodied” bag of labeled snapshots. Our goal: Learn in the context of acting and moving in the world. Kristen Grauman, UT Austin

Beyond “bags of labeled images”? Visual development in nature is based on: • continuous observation • multi-sensory feedback • motion and action … in an environment. Inexpensive, and unrestricted in scope Evidence from: psychology, evolutionary biology, cognitive science. [Held et al, 1964][Moravec et al, 1984][Wilson et al, 2002]

Talk overview 1. Learning representations tied to ego-motion 2. Learning representations from unlabeled video 3. Learning how to move and where to look Kristen Grauman, UT Austin

The kitten carousel experiment [Held & Hein, 1963] passive kitten active kitten Key to perceptual development: self-generated motion + visual feedback Kristen Grauman, UT Austin

Our idea: Ego-motion ↔ vision Goal: Teach computer vision system the connection: “ how I move ” ↔ “ how my visual surroundings change ” + Unlabeled video Ego-motion motor signals Kristen Grauman, UT Austin

Ego-motion ↔ vision: view prediction After moving: Kristen Grauman, UT Austin

Ego-motion ↔ vision for recognition Learning this connection requires:  Depth, 3D geometry Also key to  Semantics recognition!  Context Can be learned without manual labels! Our approach: unsupervised feature learning using egocentric video + motor signals Kristen Grauman, UT Austin

Approach idea: Ego-motion equivariance Invariant features: unresponsive to some classes of transformations 𝐴 𝑕𝐲 ≈ 𝐴(𝐲) Simard et al, Tech Report, ’ 98 Wiskott et al, Neural Comp ’ 02 Hadsell et al, CVPR ’ 06 Mobahi et al, ICML ’ 09 Zou et al, NIPS ’ 12 Sohn et al, ICML ’ 12 Cadieu et al, Neural Comp ’ 12 Goroshin et al, ICCV ’ 15 Lies et al, PLoS computation biology ’ 14 … Kristen Grauman, UT Austin

Approach idea: Ego-motion equivariance Invariant features: unresponsive to some classes of transformations 𝐴 𝑕𝐲 ≈ 𝐴(𝐲) Equivariant features: predictably responsive to some classes of transformations, through simple mappings (e.g., linear) “ equivariance map” 𝐴 𝑕𝐲 ≈ 𝑁 𝑕 𝐴(𝐲) Invariance discards information; equivariance organizes it. Kristen Grauman, UT Austin

Approach idea: Ego-motion equivariance Equivariant embedding Training data organized by ego-motions Unlabeled video + motor signals left turn right turn forward motor signal Learn Pairs of frames related by similar ego-motion should be related by same time → feature transformation Kristen Grauman, UT Austin

Approach idea: Ego-motion equivariance Equivariant embedding Training data organized by ego-motions Unlabeled video + motor signals motor signal Learn time → Kristen Grauman, UT Austin

Approach overview Our approach: unsupervised feature learning using egocentric video + motor signals 1. Extract training frame pairs from video 2. Learn ego-motion-equivariant image features 3. Train on target recognition task in parallel Kristen Grauman, UT Austin

Training frame pair mining Discovery of ego-motion clusters 𝑕 =left turn yaw change 𝑕 =forward 𝑕 =right turn Right turn forward distance Kristen Grauman, UT Austin

Ego-motion equivariant feature learning Desired : for all motions 𝑕 and all images 𝐲 , Given: 𝐴 𝛊 𝑕𝐲 ≈ 𝑁 𝑕 𝐴 𝛊 (𝐲) Unsupervised training 𝐲 𝑗 𝐴 𝛊 (𝐲 𝑗 ) 𝑁 𝛊 𝑕 𝑕 ∥ 𝑁 𝑕 𝐴 𝛊 (𝐲 𝑗 ) − 𝐴 𝛊 (𝑕𝐲 𝑗 ) ∥ 𝟑 𝑕𝐲 𝑗 𝐴 𝛊 (𝑕𝐲 𝑗 ) 𝛊 Supervised training 𝐲 𝑙 𝐴 𝛊 (𝐲 𝑙 ) 𝑋 softmax loss 𝑀 𝐷 (𝐲 𝑙 , y 𝑙 ) 𝛊 class y 𝑙 𝛊 , 𝑁 𝑕 and 𝑋 jointly trained Kristen Grauman, UT Austin

Method recap Ego-motion training pairs Neural network training Equivariant embedding APPROACH 𝑁 𝛊 𝑕 𝑀 𝐹 𝛊 𝑋 𝑀 𝐷 𝛊 Scene and object recognition Next-best view selection RESULTS Football field? Pagoda? Airport? Cathedral? Army base? cup frying pan Kristen Grauman, UT Austin

Datasets KITTI video [Geiger et al. 2012] Car platform Egomotions: yaw and forward distance SUN images [Xiao et al. 2010] Large-scale scene classification task with 397 categories (static images) NORB images [LeCun et al. 2004] Toy recognition Egomotions: elevation and azimuth Kristen Grauman, UT Austin

Results: Equivariance check Visualizing how well equivariance is preserved left Query pair left Neighbor pair (our features) zoom Neighbor pair (pixel space) Kristen Grauman, UT Austin

Results: Equivariance check How well is equivariance preserved? Recognition loss only Temporal coherence Ours Normalized error: Temporal coherence: Hadsell et al. CVPR 2006, Mohabi et al. ICML 2009 Kristen Grauman, UT Austin

Results: Recognition Learn from unlabeled car video (KITTI) Geiger et al, IJRR ’13 Exploit features for static scene classification (SUN, 397 classes) Xiao et al , CVPR ’10 Kristen Grauman, UT Austin

Results: Recognition Do ego-motion equivariant features improve recognition? 6 labeled training 397 classes recognition accuracy (%) KITTI ⟶ SUN examples per class 1.58 KITTI ⟶ KITTI 1.21 1.02 0.70 0.25 invariance NORB ⟶ NORB Up to 30% accuracy increase over state of the art! * Hadsell et al ., Dimensionality Reduction by Learning an Invariant Mapping, CVPR’06 ** Mobahi et al., Deep Learning from Temporal Coherence in Video, ICML’09 Kristen Grauman, UT Austin

Recap so far http://vision.cs.utexas.edu/projects/egoequiv/  New embodied visual feature learning paradigm  Ego-motion equivariance boosts performance across multiple challenging recognition tasks  Future work: volition at training time too Kristen Grauman, UT Austin

Talk overview 1. Learning representations tied to ego-motion 2. Learning representations from unlabeled video 3. Learning how to move and where to look Kristen Grauman, UT Austin

Learning from arbitrary unlabeled video? Unlabeled video Unlabeled video + ego-motion Kristen Grauman, UT Austin

Background: Slow feature analysis [Wiskott & Sejnowski, 2002] Find functions g(x) that map quickly varying input slowly varying signal x( t ) features y( t ) Figure: Laurenz Wiskott, http://www.scholarpedia.org/article/File:SlowFeatureAnalysis-OptimizationProblem.png Kristen Grauman, UT Austin

Background: Slow feature analysis [Wiskott & Sejnowski, 2002] • Existing work exploits “slowness” as temporal coherence in video → learn invariant representation [Hadsell et al. 2006; Mobahi et al. 2009; Bergstra & Bengio 2009; Goroshin et al. 2013; Wang & Gupta 2015,…] • Fails to capture how visual content changes over time in learned embedding Kristen Grauman, UT Austin

Our idea : Steady feature analysis • Higher order temporal coherence in video → learn equivariant representation Second order slowness operates on frame triplets: in learned embedding [Jayaraman & Grauman, CVPR 2016] Kristen Grauman, UT Austin

Approach: Steady feature analysis Learn classifier W and representation θ jointly, with unsupervised regularization loss: slow Contrastive loss that also exploits “negative” tuples steady Kristen Grauman, UT Austin

Approach: Steady feature analysis supervised slow unsupervised steady [Jayaraman & Grauman, CVPR 2016]

Recap: Steady feature analysis Equivariance ≈ “steadily” varying frame features! d² 𝐴 𝛊 (𝐲t) /dt² ≈ 𝟏 [Jayaraman & Grauman, CVPR 2016] Kristen Grauman, UT Austin

Datasets Unlabeled video Target task (few labels) Human Motion PASCAL 10 Actions Database (HMDB) SUN 397 Scenes KITTI Video NORB NORB 25 Objects 32 x 32 images or 96 x 96 images

Results: Sequence completion Given sequential pair, infer next frame (embedding) Our top 3 estimates for KITTI dataset Kristen Grauman, UT Austin

from unlabeled video Kristen Grauman Department of Computer Science - PowerPoint PPT Presentation

Learning image representations from unlabeled video Kristen Grauman Department of Computer Science The University of T exas at Austin Work with Dinesh Jayaraman Learning visual categories Recent major strides in category recognition

Learning from Unlabeled Video Carl Vondrick Columbia University Survivor Bias of Video Data

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein

10 Steps to Counting Unlabeled Planar Graphs: 20 Years Later Manuel Bodirsky October 2007

Word2Vec Michael Collins, Columbia University Motivation We can easily collect very large

10701 Semi supervised learning Can Unlabeled Data improve supervised learning? Important

Clustering Clustering is an unsupervised classification method, i.e. unlabeled data is partitioned

Unlabeled Motzkin numbers Max Alekseyev Dept. Computer Science and Engineering 2013 Max

Video Games Written and Researched by: Patrick Kania First Video Game The first Video Game made

Learning How to Move and Where to Look from Unlabeled Video Kristen Grauman Department of

Visual Learning with Unlabeled Video and Look-Around Policies Kristen Grauman Department of

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/26/2018 NVIDIA Video Technologies Overview Video

Video Sur Video Sur rveillance, rveillance, , Video Analyti Video Analyti ics, and You.

GAN-based Photo Video Synthesis Summary of Generating Videos with Scene Dynamics Lei Zhang CS

Sharing Your Story Through Online Video SHARING YOUR STORY THROUGH VIDEO Agenda 1 The power of

Image and Video Coding: Introduction bitstream encoder decoder Motivation Image and Video

EUROP: Ethical Legal and S ocietal Issues D L Bisset iTechnic Ltd What is EUROP? European

Palermo 7-9 November 2013 Welcome to the Information Forum AEC 40 th Annual Congress Palermo

Disclosures Rehabilitating Pre- and Post-liver I have nothing to disclose transplant Patients

Alg` ebres de von Neumann et th eorie ergodique des actions de groupes S eminaire Tripode,

From Statistical Transportability to Estimating the Effect of Stochastic Interventions Juan D.

I ntelligent Agents I ntelligent Agents Rational Agent PEAS Types of Agents Some

Angry Thunder and Vicious Frost: Remarks on the Unaccusativity of Chinese Weather Verbs Sicong

what a type and token analysis can reveal Deise Prina Dutra- UFMG deisepdutra@gmail.com Barbara