learning graph representations for video understanding
play

Learning Graph Representations for Video Understanding Xiaolong - PowerPoint PPT Presentation

Learning Graph Representations for Video Understanding Xiaolong Wang Carnegie Mellon University Computer Vision Dog He et al. Mask R-CNN. ICCV 2017. Gler et al. DensePose: Dense Human Pose Estimation In The Wild. CVPR 2018. Deep Learning


  1. Learning Graph Representations for Video Understanding Xiaolong Wang Carnegie Mellon University

  2. Computer Vision Dog He et al. Mask R-CNN. ICCV 2017. Güler et al. DensePose: Dense Human Pose Estimation In The Wild. CVPR 2018.

  3. Deep Learning ImageNet Mushroom Dog Ant Jelly Fungus Nest Train a Convolutional Neural Network Mushroom Dog Image Ant Jelly Fungus Nest Russakovsky et al. ImageNet Large Scale Visual Recognition Challenge. 2014.

  4. Convolutional Neural Networks • Convolution is local • Long-range Pairwise relations are not modeled Figure credit: Van Den Oord et al.

  5. Related Work: Relation Networks [Santoro et al, 2017]

  6. Related Work: Self-Attention [Vaswani et al, 2017]

  7. Related Work: Graph Convolution Networks [Kipf et al, 2017]

  8. This Tutorial • Perform connections on different graph/relation networks • Under the application of video understanding • Both supervised and self-supervised methods

  9. Video Recognition Playing 3D 3D 3D Soccer Conv Conv Conv

  10. Reasoning for Action Recognition Long-rang explicit reasoning X. Wang , R. Girshick , A. Gupta, and K. He. Non-local Neural Networks . CVPR 2018.

  11. Non-local Means 𝑟 1 𝑞 𝑟 3 𝑟 2 Buades et al. A non-local algorithm for image denoising . CVPR, 2005.

  12. Non-local Operator Operation in feature space Can be embedded into any ConvNets 𝑦 𝑗 𝑦 𝑘

  13. Non-local Operator 1 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 Affinity Features 𝑦 𝑗 𝑦 𝑘

  14. Non-local Operator 1 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 𝑈𝐼𝑋 × 𝑈𝐼𝑋 𝑈𝐼𝑋 512 𝑈𝐼𝑋 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 512 × = 𝑈𝐼𝑋 512 𝑈𝐼𝑋 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝜄: 1 × 1 𝜚: 1 × 1 × 1 × 1 𝑈 × 𝐼 × 𝑋 × 512 𝑦 14

  15. Non-local Operator 1 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 𝑈𝐼𝑋 × 𝑈𝐼𝑋 𝑈𝐼𝑋 512 𝑈𝐼𝑋 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 512 × = 𝑈𝐼𝑋 512 𝑈𝐼𝑋 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝜄: 1 × 1 𝜚: 1 × 1 × 1 × 1 𝑈 × 𝐼 × 𝑋 × 512 𝑦 15

  16. Non-local Operator 1 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 normalize 𝑈𝐼𝑋 × 𝑈𝐼𝑋 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝜄: 1 × 1 𝜚: 1 × 1 × 1 × 1 𝑈 × 𝐼 × 𝑋 × 512 𝑦 16

  17. Non-local Operator 1 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 𝑈 𝑦 𝑘 ) normalize 𝑔 𝑦 𝑗 , 𝑦 𝑘 = exp(𝑦 𝑗 𝑈𝐼𝑋 × 𝑈𝐼𝑋 𝐷(𝑦) = 𝑔 𝑦 𝑗 , 𝑦 𝑘 512 × 𝑈𝐼𝑋 ∀𝑘 𝑈𝐼𝑋 × 512 𝑈 𝑦 𝑘 ) 𝑔 𝑦 𝑗 , 𝑦 𝑘 exp(𝑦 𝑗 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 = 𝜄: 1 × 1 𝜚: 1 × 1 𝑈 𝑦 𝑘 ) 𝐷(𝑦) ∀𝑘 exp(𝑦 𝑗 × 1 × 1 𝑈 × 𝐼 × 𝑋 × 512 𝑦 17

  18. Non-local Operator 1 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 𝑈𝐼𝑋 × 512 normalize 𝑈𝐼𝑋 × 𝑈𝐼𝑋 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 512 𝑈𝐼𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝜄: 1 × 1 𝜚: 1 × 1 𝑕: 1 × 1 × 1 × 1 × 1 𝑈 × 𝐼 × 𝑋 × 512 𝑦 18

  19. Non-local Operator 1 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 𝑈𝐼𝑋 × 512 normalize 𝑈𝐼𝑋 × 𝑈𝐼𝑋 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 512 𝑈𝐼𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝜄: 1 × 1 𝜚: 1 × 1 𝑕: 1 × 1 × 1 × 1 × 1 𝑈 × 𝐼 × 𝑋 × 512 𝑦 19

  20. Non-local Operator as A Residual Block 𝑨 𝑗 = 𝑧 𝑗 𝑋 + 𝑦 𝑗 Action Video 3D Non- Class 3D 3D Non-local Conv local Conv Conv

  21. Examples

  22. Action Recognition in Daily Lives We let the people upload their own videos! Charades Dataset: 157 classes, 9.8k videos, 30s per video Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang , Ivan Laptev, Ali Farhadi, Abhinav Gupta. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding . ECCV 2016.

  23. Action Recognition on Charades Method mAP 3D Conv 31.8% 3D Conv + Non-local 33.5%

  24. Opening A Book 24

  25. Opening A Book The Non-local Block 25

  26. Opening A Book Object states changes over time Human-object, object-object interactions X. Wang and A. Gupta. Video as Space-Time Region Graphs . ECCV 2018.

  27. Opening A Book A 4 A 2 A 1 A 3 B 3 B 4 B 1 B 2 Highly Correlated 27

  28. Relations between Regions

  29. Relations between Regions 𝑔 𝑦 𝑗 , 𝑦 𝑘 = 𝜚 𝑦 𝑗 𝑈 𝜚 ′ (𝑦 𝑘 ) exp 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝐻 𝑗𝑘 = ∀𝑘 exp 𝑔 𝑦 𝑗 , 𝑦 𝑘

  30. Graph Convolutional Network 𝑎 = 𝐻𝑌𝑋 𝑒 𝑒 𝑂 𝑒 × × = 𝑌 𝑋 𝑎 𝑒 𝑂 𝐻 𝑂 𝑂 Kipf. Semi-Supervised Classification with Graph Convolutional Networks. 2017

  31. Graph Convolutional Network Propagation 31

  32. Connecting Non-local and GCN The Non-local Operator: 1 𝑨 𝑗 = 𝑧 𝑗 𝑋 + 𝑦 𝑗 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 𝑔 𝑦 𝑗 , 𝑦 𝑘 = 𝑕(𝑦 𝑘 ) = 𝐻 𝑗𝑘 𝑕(𝑦 𝑘 ) 𝑋 + 𝑦 𝑗 ∀𝑘 𝑔 𝑦 𝑗 , 𝑦 𝑘 ∀𝑘 ∀𝑘 = 𝐻 𝑗𝑘 𝑕(𝑦 𝑘 ) 𝑎 = 𝐻 𝑕 𝑌 𝑋 + 𝑌 ∀𝑘 The Graph Convolution

  33. Action Recognition on Charades Method mean AP 3D Conv 31.8% 3D Conv + Non-local 33.5% +4.4% 3D Conv + Region Graph 36.2% 33

  34. Action Recognition on Charades 3D Conv 45% 3D Conv + Graph 40% 35% 30% No Yes Involves Objects ? 34

  35. Action Recognition on Charades 3D Conv 45% 3D Conv + Graph 40% 35% 30% Pose Variances 35

  36. Connection to Mean-Shift The Non-local Operator: 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑧 𝑗 = 𝑕(𝑦 𝑘 ) ∀𝑘 𝑔 𝑦 𝑗 , 𝑦 𝑘 ∀𝑘 The Mean-Shift Clustering: 𝐿 𝑦, 𝑦 𝑘 𝑛(𝑦) = 𝑦 𝑘 𝑦 𝑘 ∈𝑂(𝑦) 𝐿 𝑦, 𝑦 𝑘 𝑦 𝑘 ∈𝑂(𝑦) Converging to the same mean? https://tw.rpi.edu/web/project/JeffersonProjectAtLakeGeorge/Clustering

  37. Recent Related Work Actor-Centric Relation Network Video Action Transformer Network [Sun et al, 2018] [Girdhar et al, 2019] Long-Term Feature Banks for Detailed Video Understanding [Wu et al, 2019]

  38. Learning Affinity with Semantic Supervision

  39. Learn Correspondence Goal: without Human Supervision

  40. The visual world exhibits continuity

  41. Prior Work: Learning from Time Inputs Outputs Predict Color in Time Predict Pixel in Time [Vondrick et al, 2018] [Mathieu et al, 2015] Predict Arrow of Time [Wei et al, 2018]

  42. Using Tracking to Learn Features Similarity CNN CNN Tracking → Similarity [Wang et al, 2015]

  43. Using Tracking to Learn Features Similarity CNN CNN Limited by Off-the-shelf Trackers Tracking → Similarity [Wang et al, 2015]

  44. Similarity requires tracking Tracking requires similarity Let’s jointly learn both!

  45. Learning to Track ℱ : a deep tracker ℱ ℱ ℱ How to obtain supervision?

  46. Supervision: Cycle-Consistency in Time Track backwards ℱ ℱ ℱ ℱ ℱ ℱ Track forwards, back to the future

  47. Supervision: Cycle-Consistency in Time ℱ ℱ ℱ ℱ ℱ ℱ Backpropagation through time along the cycle

  48. Differentiable Tracking 𝑞 Patch feature in time 𝑢: 𝑦 𝑢 𝑞 𝐽 𝑦 𝑢−1 𝑦 𝑢 100 𝑑 100 × = 900 𝑑 900 Encoder 𝜚 transpose Encoder 𝜚 𝐽 Image feature in time 𝑢 − 1: 𝑦 𝑢−1 48

  49. Differentiable Tracking 𝑞 𝑞 Patch feature in time 𝑢: 𝑦 𝑢 Patch feature in time 𝑢 − 1: 𝑦 𝑢−1 Encoder 𝜚 Transformer 𝜄 Spatial Cropping transpose Encoder 𝜚 𝐽 Image feature in time 𝑢 − 1: 𝑦 𝑢−1 49

  50. Differentiable Tracking 𝑞 𝑞 ) 𝐽 𝑦 𝑢−1 = ℱ(𝑦 𝑢−1 , 𝑦 𝑢 Encoder 𝜚 ℱ Transformer 𝜄 Spatial Cropping transpose Encoder 𝜚 50

  51. Recurrent Tracking 𝑢 − 1 𝑢 − 3 𝑢 − 2 𝑞 ℱ ℱ ℱ 𝑦 𝑢 ℒ 𝑑𝑧𝑑𝑚𝑓 𝑞 ℱ ℱ ℱ 𝑦 𝑢 𝑢 − 2 𝑢 − 1 𝑢 51

  52. Cycle-Consistency Loss Function 𝑞 − 𝑀𝑝𝑑 𝑦 𝑢 𝑞 || 2 2 ℒ 𝑑𝑧𝑑𝑚𝑓 = ||𝑀𝑝𝑑 𝑦 𝑢 𝑞 𝑦 𝑢 ℱ ℱ ℱ 𝑞 𝑦 𝑢 ℱ ℱ ℱ

  53. Multiple Cycles Sub-cycles: a natural curriculum

  54. Skip Cycles Skip-cycles: skipping occlusions

  55. Visualization of Training

  56. Test Time: Nearest Neighbors in Feature Space 𝑢 − 1 𝑢

  57. Test Time: Nearest Neighbors in Feature Space 𝑢 − 1 𝑢

  58. Instance Mask Tracking DAVIS Dataset DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.

  59. Pose Keypoint Tracking JHMDB Dataset

  60. Comparison Our Correspondence Optical Flow

Recommend


More recommend