videos
play

Videos Saurabh Gupta CS 543 / ECE 549 Computer Vision Spring 2020 - PowerPoint PPT Presentation

Videos Saurabh Gupta CS 543 / ECE 549 Computer Vision Spring 2020 Outline Optical Flow Tracking Correspondence Recognition in Videos Optical Flow Data / Supervision Architecture Datasets Traditional datasets:


  1. Videos Saurabh Gupta CS 543 / ECE 549 Computer Vision Spring 2020

  2. Outline • Optical Flow • Tracking • Correspondence • Recognition in Videos

  3. Optical Flow • Data / Supervision • Architecture

  4. Datasets • Traditional datasets: Yosemite, Middlebury • KITTI: http://www.cvlibs.net/datasets/kitti/eval_scene_flo w.php?benchmark=flow • Sintel: http://sintel.is.tue.mpg.de/ • Synthetic Datasets • Flying Chairs et al: https://lmb.informatik.uni- freiburg.de/resources/datasets/FlyingChairs.en.html • Supervision: from Simulation • Metrics: End-point Error

  5. “Classical Optical Flow Pipeline”

  6. PWC Net cv l ( x 1 , x 2 )= 1 ⌘ T ⇣ c l c l 1 ( x 1 ) w ( x 2 ) , N Models Matter, So Does Training: An Empirical Study of CNNs for Optical Flow Estimation. Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. arXiv 2018.

  7. PWC Net Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 6 Frame 6 Frame 6 Frame 6 Frame 6 Frame 6 Frame 6 W/o context W/o context W/o context W/o context W/o context W/o context W/o context Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 6 Frame 6 Frame 6 Frame 6 Frame 6 W/o context W/o context W/o context W/o context W/o context Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 6 Frame 6 Frame 6 Frame 6 Frame 6 W/o context W/o context W/o context W/o context W/o context W/o DenseNet W/o DenseNet W/o DenseNet PWC-Net PWC-Net PWC-Net PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft W/o DenseNet W/o DenseNet W/o DenseNet W/o DenseNet PWC-Net PWC-Net PWC-Net PWC-Net PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft W/o DenseNet W/o DenseNet W/o DenseNet W/o DenseNet W/o DenseNet W/o DenseNet W/o DenseNet PWC-Net PWC-Net PWC-Net PWC-Net PWC-Net PWC-Net PWC-Net PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft W/o DenseNet W/o DenseNet W/o DenseNet PWC-Net PWC-Net PWC-Net PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft Max. Chairs Sintel Sintel KITTI 2012 KITTI 2015 Disp. Clean Final AEPE Fl-all AEPE Fl-all 2.13 3.66 5.09 5.25 29.82% 13.85 43.52% 0 2.09 3.30 4.50 5.26 25.99 % 13.67 38.99 % 2 Full model ( 4 ) 2.00 3.33 4.59 5.14 28.67% 13.20 41.79% 1.97 3.31 4.60 4.96 27.05% 12.97 40.94% 6 (b) Cost volume. Removing the cost volume ( 0 ) results in moderate performance loss. PWC-Net can handle large motion using a small search range to compute the cost volume.

  8. Flying Chairs Dataset initial object random object motion sampling transform transform object prototype Outputs: optical flow first frame second frame initial background random background motion sampling transform transform background prototype

  9. “FlyingChairs” (synth.) “FlyingThings3D” (synth.) “Monkaa” (synth.) “Virtual KITTI” (synth.) Dosovitskiy et al (2015) Mayer et al (2016) Mayer et al (2016) Gaidon et al (2016) Test data Training data Sintel KITTI2015 FlyingChairs Sintel 6 . 42 18 . 13 5 . 49 FlyingChairs 5 . 73 16 . 23 3 . 32 FlyingThings3D 6 . 64 18 . 31 5 . 21 Monkaa 8 . 47 16 . 17 7 . 08 Driving 10 . 95 11 . 09 9 . 88

  10. Tracking • Problem Statements • Tracking by Detection • General Object Tracking

  11. Problem Statements • Single Object Tracking (eg: https://nanonets.com/blog/content/images/2019/07/ messi_football_track.gif) • Multi-object Tracking (eg: https://motchallenge.net/vis/MOT20-02/gt/) • Multi-object Tracking and Segmentation (eg: https://www.youtube.com/watch?v=K38_pZw_P9s)

  12. Tracking by Detection Detections per frame Final Video sequence trajectories Object Data detection association . . . Detector Tracker F IGURE 2.2: Tracking-by-detection paradigm. Firstly, an independent detector is ap- plied to all image frames to obtain likely pedestrian detections. Secondly, a tracker is run on the set of detections to perform data association, i.e. , link the detections to obtain full trajectories. Source: Laura Leal-Taixé

  13. Tracking by Detection Strike a Pose! Tracking People by Learning Their Appearance. D. Ramanan et al. , PAMI 2007

  14. General Object Tracking Current frame Conv Layers Search Region Crop Fully-Connected Layers Crop Predicted loca3on of target within search region What to track Conv Layers Previous frame Learning to Track at 100 FPS with Deep Regression Networks. D. Held et al., ECCV16.

  15. Correspondence in Time Tracking Middle Ground Optical Flow (Box-level, long-range) (Mid-level, long-range) (Pixel-level, short-range) Self-Supervised / Unsupervised Learning Human Annotations Synthetic Data Source: Xiaolong Wang

  16. Learning to Track ℱ : a deep tracker ℱ ℱ ℱ How to obtain supervision? Source: Xiaolong Wang

  17. Supervision: Cycle-Consistency in Time Track backwards ℱ ℱ ℱ ℱ ℱ ℱ Track forwards, back to the future Source: Xiaolong Wang

  18. Supervision: Cycle-Consistency in Time ℱ ℱ ℱ ℱ ℱ ℱ Backpropagation through time, along the cycle Source: Xiaolong Wang

  19. Multiple Cycles Sub-cycles: a natural curriculum Source: Xiaolong Wang

  20. Multiple Cycles Shorter cycles: a natural curriculum Source: Xiaolong Wang

  21. Multiple Cycles Shorter cycles: a natural curriculum Source: Xiaolong Wang

  22. Tracker ℱ Densely match features in learned feature space φ 𝑄 ! 𝑄 !"# φ (𝑌, 𝑍) Correlatio Crop n Filter φ 𝐽 !"# Source: Xiaolong Wang

  23. Visualization of Training Source: Xiaolong Wang

  24. Test Time: Nearest Neighbors in Feature Space φ 𝑢 − 1 𝑢 Source: Xiaolong Wang

  25. Test Time: Nearest Neighbors in Feature Space φ 𝑢 − 1 𝑢 Source: Xiaolong Wang

  26. Evaluation: Label Propagation Source: Xiaolong Wang

  27. Evaluation: Label Propagation Source: Xiaolong Wang

  28. Evaluation: Label Propagation Source: Xiaolong Wang

  29. Evaluation: Label Propagation Source: Xiaolong Wang

  30. Source: Xiaolong Wang Instance Mask Tracking DAVIS Dataset Source: Xiaolong Wang DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.

  31. Pose Keypoint Tracking JHMDB Dataset Source: Xiaolong Wang

  32. Comparison Optical Flow Our Correspondence Source: Xiaolong Wang

  33. Texture Tracking DAVIS Dataset Source: Xiaolong Wang DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.

  34. Semantic Masks Tracking Video Instance Parsing Dataset Source: Xiaolong Wang Zhou et al. Adaptive Temporal Encoding Network for Video Instance-level Human Parsing. ACM MM 2018.

  35. Outline • Optical Flow • Tracking • Correspondence • Recognition in Videos • Tasks • Datasets • Models • Applications

  36. Recognition in Videos • Tasks / Datasets • Models

  37. Tasks and Datasets • Action Classification • Kinetics Dataset: https://arxiv.org/pdf/1705.06950.pdf • ActivityNet, Sports-8M, … • Action “Detection” • In space, in time. Eg: JHMDB, AV

  38. Tasks and Datasets • Time scale • Atomic Visual Actions (AVA) Dataset: https://research.goo gle.com/ava/explor e.html • Bias • Something Something Dataset: We don’t quite know how do https://20bn.com/da define good meaningful tasks for tasets/something- videos. More on this later. something

  39. Models • Recurrent Neural Nets (See: https://colah.github.io/posts/2015-08- Understanding-LSTMs/) • Simple Extensions of 2D CNNs k H k k k H L d < L H k k output output L L output W W W (a) (c) (b) 2D convolution 2D convolution on multiple frames 3D convolution • 3D Convolution Networks • Two-Stream Networks • Inflated 3D Conv Nets • Slow Fast Networks • Non-local Networks

  40. Recurrent Neural Networks Source: https://colah.github.io/posts/2015-09-NN-Types-FP/

  41. 3D Convolutions Karpathy et al. Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014

  42. 3D Convolutions k H k output W (a) 2D convolution k d < L H k L output W (c) 3D convolution k L H k output L W (b) 2D convolution on multiple frames

Recommend


More recommend