Competitive Collaboration Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation Anurag Ranjan Perceiving Systems Max Planck Institute for Intelligent Systems 1
Varun Jampani Lukas Balles Deqing Sun Kihwan Kim Jonas Wulff Michael Black 2
TΓΌbingen, Germany 3
Outline Motion and Deep Learning Competitive Unsupervised Learning of Geometry Optical Flow with Structure Collaboratio Everything n Supervise Unsupervise d d 4
Motion and Optical Flow 5
Optical Flow 2D velocity for all pixels between two frames of a video sequence. π½ π¦, π§, π’ β 1 = π½(π¦ + π£, π§ + π€, π’) 6
Why do we need Optical Flow SLAM Action Recognition Super-resolution Optical Flow Video Compression Slomo VFX Unsupervised Segmentation Motion Magnification 7 Unsupervised Segmentation: Mahendran et al., VFX: Black et al., Motion Magnification: Liu et al., Action Recognition: Simoyan et al.
Optical Flow 2D velocity for all pixels between two frames of a video sequence. π½ π¦, π§, π’ β 1 = π½(π¦ + π£, π§ + π€, π’) 8
Estimating Optical Flow π½ π¦, π§, π’ β 1 = π½(π¦ + π£, π§ + π€, π’) min π£,π€ β₯ π½ π¦, π§, π’ β 1 β π½ π¦ + π£, π§ + π€, π’ β₯ min π£,π€ π(π½ π’ β 1 β π₯arp π½ π’ , π£, π€ ) Photometric Loss 9
min π£,π€ π(π½ π’ β 1 β π₯arp π½ π’ , π£, π€ ) Photometric Loss 10
No prior on structure 11
Can we learn from data? 12
Optical Flow Estimation β β πΓn Dosovitskiy et al. 2015 13
FlowNet Dosovitskiy et al. 2015 14
Problem FlowNet is too big. 33 M parameters. Needs to learn both large and small motions. Does not perform well. 15
Approach Image statistics are scale invariant. Use an image pyramid. Train a small network for each pyramid level. Compute residual flow at each level. Network captures small displacements. Pyramid captures large displacements. Burt and Adelson. The Laplacian pyramid as a compact image code. IEEE COM, 1983 16
SPyNet Spatial Pyramid Network for Optical Flow Estimation Ranjan et al. Optical Flow estimation using a Spatial Pyramid Network. CVPR 2017. 17
π½ 1 , π½ 2 32x7x7 64x7x7 32x7x7 16x7x7 2x7x7 π€ π 18
π» π 19
π£ π£ + + 0 π 0 π 1 π» 1 π₯ π» 0 π€ 0 π€ 1 π π 1 π½ 0 1 1 π½ 1 π½ 2 π π 2 π½ 0 2 π½ 2 2 π½ 1 20
π£ π£ + + + 0 π 0 π π 1 2 π» 2 π» 1 π₯ π₯ π» 0 π€ 0 π€ 1 π€ 2 π π 1 π½ 0 1 1 π½ 1 π½ 2 π π 2 π½ 0 2 π½ 2 2 π½ 1 21
Spatial Temporal Spatial Temporal SPyNet FlowNet 22
Frames Ground Truth FlowNetS FlowNetC SPyNet 23
Average EPE on Sintel (Clean + Final) 8,500 8,400 Voxel2Voxel* 8,300 8,200 8,100 FlowNetC 8,000 7,900 FlowNetS 7,800 7,700 SPyNet 7,600 7,500 1 10 100 Number of Model Parameters (in Millions) *error metric not consistent with the benchmarks 24
Average EPE on Sintel (Clean + Final) 9,000 Voxel2Voxel* [2016] 8,500 SPyNet [2017] FlowNetS [2015] 8,000 FlowNetC [2015] 7,500 7,000 6,500 6,000 PWC-Net [2018] 5,500 FlowNet2 [2017] 5,000 4,500 4,000 1 10 100 1000 Number of Model Parameters (in Millions) *error metric not consistent with the benchmarks 25
Sintel Clean Sintel Clean d0-10 d10-60 d60-140 s0-10 s10-40 s40+ SpyNet+ft 43.442 5.501 3.122 1.719 0.832 3.343 FlownetS+ft 5.992 3.561 2.193 1.424 3.815 40.098 FlownetC+ft 5.575 3.182 1.993 1.622 3.974 33.369 Sintel Final Sintel Final d0-10 d10-60 d60-140 s0-10 s10-40 s40+ SpyNet+ft 3.290 49.707 6.694 4.368 1.395 5.534 FlownetS+ft 7.252 4.610 1.873 5.826 43.236 2.993 FlownetC+ft 7.190 4.619 3.298 2.305 6.169 40.779 Distance from Motion Boundaries Average Displacement 26
Problem SPyNet [1] [1] Ranjan et al. Optical Flow estimation using a Spatial Pyramid Network. CVPR 2017. 28
Why humans? β’ Useful for recognition problems. Scenes contain human actions. β’ Two-stream architectures use fast classical optical flow methods. β’ Deep Networks have massive GPU memory requirements. Left Image: Delaitre et al. Recognizing human actions in still images, BMVC 2010 29 Right Image: Simonyan et al. Two-stream convolutional networks for action recognition in videos. NIPS 2014 .
Problem Flying Chairs MPI Sintel KITTI [3] [1] [2] No dataset for human optical flow for training neural networks. [1] Dosovitskiy et al. Flownet: Learning optical flow with convolutional networks. ICCV 2015. [2] Butler et al. A naturalistic open source movie for optical flow evaluation. ECCV 2012. 30 [3] Geiger et al. Vision meets robotics: The KITTI dataset. International Journal of Robotics Research 32.11 (2013): 1231-1237.
Idea Create a new dataset for human optical flow. Use it to train an existing fast and compact optical flow method. 31
Human Flow Dataset Human Motion Realistic + + Environment Capture data Human Body [3] [1] Model [2] + Cloth texture, Lighting, Noise, Motion Blur, Camera Blur Blender Simulate and Extract Motion Vectors [1] Ionescu et al. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE PAMI2014. 32 [2] Loper et al. MoSh: Motion and Shape Capture from Sparse Markers. SIGGRAPH Asia 2014. [3] Yu et al. "Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop." arXiv preprint arXiv:1506.03365(2015).
Human Flow Dataset 33
SPyNet π£ π£ + + + 0 π 0 π π 1 2 π» 2 π» 1 π₯ π₯ π» 0 π€ 0 π€ 1 π€ 2 π π 1 π½ 0 1 1 π½ 1 π½ 2 π π 2 π½ 0 2 π½ 2 2 π½ 1 Ranjan et al. Optical Flow estimation using a Spatial Pyramid Network. CVPR 2017. 35
Evaluation of Optical Flow Networks Average EPE Human Flow Dataset 0.6 0.5 SPyNet PWC-Net 0.4 SPyNet+HF 0.3 PWC-Net+HF 0.2 0.1 0 0.010 0.100 1.000 10.000 Inference Time (s) 36
Evaluation of Optical Flow Networks Average EPE Human Flow Dataset 1 FlowNetS 0.9 0.8 0.7 PCA Flow 0.6 SPyNet 0.5 Epic Flow LDOF PWC-Net 0.4 SPyNet+HF FlowNet2 0.3 PWC-Net+HF Flow Fields 0.2 0.1 0 0.010 0.100 1.000 10.000 Inference Time (s) 37
Visuals β Video Ground Truth Human Flow SpyNet 38
Visuals β Video Ground Truth Human Flow SpyNet 39
Visuals β Video Ground Truth Human Flow SpyNet 40
Visuals β Video Human Flow SpyNet 41
Visuals β Video Human Flow SpyNet 42
Human Flow may not work on other parts of the scene. 43
Introduction to Scene Geometry 44
Motion of a Static Scene For static scenes: Depth + Camera Motion = Optical 45 Flow
Multi-view Geometry Pinhole Camera Matrix π¦ 2 = πΏ π π¦ 1 = πΏπ, π’ π, π½ 2 π½ 1 π π π¦ 1 π = π β₯ π½ 1 π¦ 1 β π½ 2 π¦ 2 β₯= 0 min π,π’,π π(π½ 1 β π₯arp π½ 2 , π, π’, π ) Photometric Loss 46
Static Scene and Moving Objects 47
How to decompose a scene? 48
Competitive Collaboration 49
π π π 50
π πΊ π π π π Competitor Competitor π 51
Competition π πΊ π π π π Competitor Competitor π Moderator 52
Collaboration π πΊ β β π π π π Competitor Competitor π Moderator 53
Mixed Domain Learning π΅ πΆ π 54
Competition Loss πΉ πππ = π β πΌ π΅ , 5 + 1 β π β πΌ(πΆ , 5) 55
Collaboration Loss πΉ πππ = πΉ πππ + α β log(π π§ + π) ππ πΉ π΅ < πΉ πΆ β log(1 β π π§ + π) πππΉ π΅ β₯ πΉ πΆ πΉ π΅ = πΌ(π΅ ( ), 5) 56
π΅ πΆ π 57
Accuracy Model Training MNIST SVHN MNIST+SVHN Error Error Error Alice Basic 1.34 11.88 8.96 Alice CC 1.41 11.55 8.74 Bob CC 1.24 11.75 8.84 Alice+Bob+Mod CC 1.24 11.55 8.70 Alice 3x Basic 1.33 10.86 8.22 58
Moderator Behavior Alice Bob MNIST 0 % 100 % SVHN 100 % 0 % 59
Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation 60
Monocular Depth Prediction πΈ π π· CameraMotion Estimation Zhou et al. CVPR 2017 61
Meister et al. AAAI β18, Janai et al. ECCV β18 Monocular Depth Prediction Optical Flow Estimation πΈ πΊ π π· CameraMotion Estimation Zhou et al. CVPR 2017 62
Monocular Depth Prediction Optical Flow Estimation πΈ πΊ π π π π π π· π CameraMotion Estimation Motion Segmentation 63
Photometric Photometric Loss Loss πΉ π = π(π½, π₯arp(π½ + , π, π )) β π πΉ πΊ = π(π½, π₯arp(π½ + , π£ + )) β (1 β π) Monocular Depth Prediction Optical Flow Estimation πΈ πΊ Loss π πΉ Loss π· π CameraMotion Estimation Motion Segmentation πΉ π· = πΌ(π± β₯π£ π β π£ πΊ β₯<π π , π) 64
Recommend
More recommend