Self-supervised Learning of Interpretable Keypoints from Unlabelled Videos www.robots.ox.ac.uk/~vgg/research/ Oral presentation, CVPR 2020 unsupervised_pose/ Tomas Jakab Ankush Gupta VGG, University of Oxford DeepMind, London Hakan Bilen Andrea Vedaldi School of Informatics, VGG, University of Oxford University of Edinburgh Facebook AI Resesarch, London
Goal Learn object keypoints detectors from unlabelled videos and unpaired pose prior Self-supervised landmarks predicted by our model
Factorizing appearance and keypoints 2D keypoint keypoints not bottleneck input image reconstruction interpretable L ( 𝑦 " , 𝑧 " ) decoder encoder . . . ( 𝑦 $ , 𝑧 $ ) style image appearance encoder Related works on self-supervised landmarks: Thewlis, Bilen, Vedaldi. Proc. NIPS, 2017 Thewlis, Bilen, Vedaldi. Proc. ICCV, 2017 Jakab, Gupta, Bilen, Vedaldi. Proc. NeurIPS, 2018 Zhang, Guo, Jin, Luo, He, Lee. Proc. CVPR, 2018 ≈ reconstruction loss Thewlis, Albanie, Bilen, Vedaldi. Proc. CVPR, 2019 Wiles, Koepke, Zisserman. Proc. BMVC, 2018 Lorenz, Bereska, Milbich, Ommer. Proc. CVPR, 2019 Thewlis, Albanie, Bilen, Vedaldi. Proc. CVPR, 2019
Learning to label as image translation skeleton image input image reconstruction decoder encoder discriminator unpaired pose prior style image appearance encoder looks like a skeleton?
Reintroducing bottleneck pre-trained offline skeleton handcrafted bottleneck image input image reconstruction keypoint analytical decoder encoder detector renderer discriminator unpaired pose prior style image appearance encoder looks like a skeleton?
Landmarks estimation Human3.6M prediction Pennaction prediction
Unpaired transfer MPie landmarks unlabelled face videos Voxceleb2 300-W predictions
Evaluation: unsupervised to labeled keypoints other self-supervised methods our method discovered keypoints semantic keypoints directly predicting semantic keypoints supervised linear regression unlabelled videos/images unlabelled videos/images training time + unpaired posed prior images annotated with test time no additional data object keypoints
Human pose estimation unsupervised discovery + supervised regression unsupervised discovery 8.0 25.0 + supervised regression %-MSE norm. by image size 7.0 20.0 6.0 no paired data 5.0 MSE in pixels 15.0 4.0 no paired data 10.0 3.0 2.0 5.0 1.0 0.0 0.0 hourglass Thewlis et Zhang et al. Lorenz et al. ours hourglass Jakab & ours (supervised) al. (supervised) Gupta et al Simplified Human3.6M Human3.6M
Facial landmarks estimation unsupervised discovery + supervised regression %-MSE norm. by inter-ocular distance 10.0 no paired data 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 LBF TCDCN RAR Wing Loss Thewlis et Thewlis et Thewlis et Wiles et al. ours ours + al. al. al. supervised regression 300-W
Unpaired sample efficiency 4.5 %-MSE norm. by inter-ocular distance 10.0 4.0 9.0 %-MSE norm. by image size 3.5 8.0 7.0 3.0 6.0 2.5 5.0 2.0 4.0 1.5 3.0 1.0 2.0 0.5 1.0 0.0 0.0 6000 (full 500 50 400k (full 5000 500 50 dataset) dataset) number of unpaired samples number of unpaired samples 300-W Simplified Human3.6M
Disentangling appearance and geometry Mixing geometry and appearance by conditioning on a different identity geometry appearance reconstruction
Manipulating geometric representation
Self-supervised Learning of Interpretable Keypoints from Unlabelled Videos www.robots.ox.ac.uk/~vgg/research/ Oral presentation, CVPR 2020 unsupervised_pose/ Tomas Jakab Ankush Gupta VGG, University of Oxford DeepMind, London Hakan Bilen Andrea Vedaldi School of Informatics, VGG, University of Oxford University of Edinburgh Facebook AI Resesarch, London
Recommend
More recommend