towards efficient end to end architectures for action
play

Towards efficient end-to-end architectures for action recognition - PowerPoint PPT Presentation

Towards efficient end-to-end architectures for action recognition and detection in videos Limin Wang Computer Vision Laboratory, ETH Zurich | | Limin Wang (CVL ETHZ) 17/4/12 1 Outline 1. Overview of action understanding 2. Temporal


  1. Towards efficient end-to-end architectures for action recognition and detection in videos Limin Wang Computer Vision Laboratory, ETH Zurich | | Limin Wang (CVL ETHZ) 17/4/12 1

  2. Outline § 1. Overview of action understanding § 2. Temporal segment networks § 3. Structured segment networks § 4. UntrimmedNets § 5. Conclusion | | Limin Wang (CVL ETHZ) 17/4/12 2

  3. Action recognition in videos § 1. Action recognition “in the lab”: KTH, Weizmann etc. § 2. Action recognition “in TV, Movies”: UCF Sports, Holloywood etc. § 3. Action recognition “in Web Videos”: HMDB, UCF101, THUMOS, ActivityNet etc. Haroon Idrees et al. The THUMOS Challenge on Action Recognition for Videos "in the Wild” , in Computer Vision and Image Understanding (CVIU), 2017. | | Limin Wang (CVL ETHZ) 17/4/12 3

  4. How to define action categories Heilbron F. C. et al. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 4

  5. How to define action categories Heilbron F. C. et al. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 5

  6. How to label videos Heilbron F. C. et al. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 6

  7. Dataset overview | | Limin Wang (CVL ETHZ) 17/4/12 7

  8. Action understanding § Action Recognition: classify the short clip or untrimmed video into pre-defined class. § Action Temporal Localization: detect starting and ending times of action instances in untrimmed video. § Action Spatial Detection: detect the bounding boxes of actors in trimmed videos. § Action Spatial-Temporal Detection: combine temporal and spatial localization in untrimmed videos. | | Limin Wang (CVL ETHZ) 17/4/12 8

  9. Action recognition – STIP+HOG/HOF (03, 08) [1]. Ivan Laptev and Tony Lindeberg, Space-time Interest Points, in ICCV, 2003. [2] Ivan Laptev, Marcin Marszałek, Cordelia Schmid, Benjamin Rozenfeld, Learning realistic human actions from movies, in CVPR, 2008. | | Limin Wang (CVL ETHZ) 17/4/12 9

  10. Action recognition – Dense Trajectories (11, 13) [1]. Heng Wang, Alexander Klaser, Cordelia Schmid, and Cheng-Lin Liu, Action Recognition by Dense Trajectories, in CVPR, 2011. [2] Heng Wang, and Cordelia Schmid, Action Recognition with Improved Trajectories, in ICCV, 2013. | | Limin Wang (CVL ETHZ) 17/4/12 10

  11. Action recognition – two stream CNN (2014) Karen Simonyan and Andrew Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos, in NIPS, 2014. | | Limin Wang (CVL ETHZ) 17/4/12 11

  12. Action recognition – 3D CNN (2015) Du Tran et al. Learning Spatiotemporal Features with 3D Convolutional Networks, in ICCV, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 12

  13. Action recognition – TDD (2015) Limin Wang, Yu Qiao, Xiaoou Tang, Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors , in CVPR, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 13

  14. Outline § 1. Overview of action understanding § 2. Temporal segment networks § 3. Structured segment networks § 4. UntrimmedNets § 5. Conclusion | | Limin Wang (CVL ETHZ) 17/4/12 14

  15. Motivation of TSN § Towards end-to-end and video-level architecture. § Modeling issue: mainstream CNN frameworks focus on appearance and short-term motion. § Learning issue: current action datasets are relatively small and it is not easy to train deep CNNs. | | Limin Wang (CVL ETHZ) 17/4/12 15

  16. Overview of TSN TSN is a video-level framework based on simple strategies of segment sampling and consensus aggregation . | | Limin Wang (CVL ETHZ) 17/4/12 16

  17. Segment sampling § Our segment sampling is based on the fact: there are high data redundancy in video § Our segment sampling share two properties: § Sparse: processing efficiency § Duration invariance: video-level framework modeling the entire video content. | | Limin Wang (CVL ETHZ) 17/4/12 17

  18. Aggregation Function § Aggregation function aims to summarize the predictions of different snippet to yield the video-level prediction. § Simple aggregation functions: § Mean pooling, max pooling, weighted average § Advanced aggregation functions: § Top-k pooling, attention weighting | | Limin Wang (CVL ETHZ) 17/4/12 18

  19. Formulation of TSN | | Limin Wang (CVL ETHZ) 17/4/12 19

  20. Input modalities § Original two-stream CNNs take RGB image and stacking optical flow fields. § We study two other input modalities. § Stacking RGB difference § Approximation of motion information § Efficient to compute § Stacking warped optical field § Remove background motion | | Limin Wang (CVL ETHZ) 17/4/12 20

  21. Good practices § Cross modality pre-training: pre-train both spatial and temporal nets with ImageNet models. § Partial batch normalization: only re-estimate the parameters of first BN layer. § Smaller learning rate: as pre-training, using smaller learning rate for fine tuning. § Data augmentation: using more augmentation, including corner cropping, scale jittering, ratio jittering. § High dropout ratio: 0.7 dropout ratio for temporal net and 0.8 dropout ratio for spatial net. | | Limin Wang (CVL ETHZ) 17/4/12 21

  22. Experiment result -- training method | | Limin Wang (CVL ETHZ) 17/4/12 22

  23. Experiment result -- input modality | | Limin Wang (CVL ETHZ) 17/4/12 23

  24. Experiment result -- TSN framework | | Limin Wang (CVL ETHZ) 17/4/12 24

  25. Experiment result -- Comparison with STOA | | Limin Wang (CVL ETHZ) 17/4/12 25

  26. ActivityNet Challenge -- 2016 | | Limin Wang (CVL ETHZ) 17/4/12 26

  27. Model Visualization | | Limin Wang (CVL ETHZ) 17/4/12 27

  28. Outline § 1. Overview of action understanding § 2. Temporal segment network § 3. Structured segment network § 4. UntrimmedNet § 5. Conclusion | | Limin Wang (CVL ETHZ) 17/4/12 28

  29. Motivation of Structured Segment Network 1. Action detection in untrimmed video is an important problem. 2. Snippet-level classifier is difficult to accurately localize the temporal extent of action instance. Context and Structure Modeling! | | Limin Wang (CVL ETHZ) 17/4/12 29

  30. Structured Segment Network | | Limin Wang (CVL ETHZ) 17/4/12 30

  31. Three Stage Augmentation § Given a proposal, we will extend its temporal duration by augmentation (Context modeling). § Specifically, a proposal denoted by [s, e], its duration is d = e - s, then temporal extension [s’, e’]: e’ = e + d/2, s’ = s - d/2 | | Limin Wang (CVL ETHZ) 17/4/12 31

  32. Temporal Pyramid Pooling § Given a proposal, we use temporal pyramid pooling to summarize its representation (Structure modeling). § Specifically, given a proposal denoted by [s, e], for ith part in the kth level, it is pooled as follows: § The overall representation would be as follows: | | Limin Wang (CVL ETHZ) 17/4/12 32

  33. Two Classifier Design § To model the class classes and completeness of instances, we design a two classifier loss § Action class classifier measure the likelihood of action class distribution: P(c|p) § Completeness classifier measure the likelihood of instance completeness: P(b|c,p) § A joint loss to optimize these two classifiers: | | Limin Wang (CVL ETHZ) 17/4/12 33

  34. Temporal Region Proposal Bottom up proposal generation based on actionness map | | Limin Wang (CVL ETHZ) 17/4/12 34

  35. Experiment result (1) | | Limin Wang (CVL ETHZ) 17/4/12 35

  36. Experiment result (2) | | Limin Wang (CVL ETHZ) 17/4/12 36

  37. Experiment result (3) | | Limin Wang (CVL ETHZ) 17/4/12 37

  38. Detection example (1) | | Limin Wang (CVL ETHZ) 17/4/12 38

  39. Detection example (2) | | Limin Wang (CVL ETHZ) 17/4/12 39

  40. Outline § 1. Overview of action understanding § 2. Temporal segment network § 3. Structured segment network § 4. UntrimmedNet § 5. Conclusion | | Limin Wang (CVL ETHZ) 17/4/12 40

  41. Motivation of UntrimmedNet 1. Labeling untrimmed video is expensive and time consuming 2. Temporal annotation is subjective and not consistent across persons and datasets | | Limin Wang (CVL ETHZ) 17/4/12 41

  42. Overview of UntrimmedNet | | Limin Wang (CVL ETHZ) 17/4/12 42

  43. Clip Proposal § Uniform Sampling § Uniform sampling of fixed duration § Shot based Sampling § First shot detection based HOG difference § For each shot, perform uniform sampling. | | Limin Wang (CVL ETHZ) 17/4/12 43

  44. Clip-level Representation and Classification § Following TSN framework: § Sampling a few snippets from each clip. § Aggregating snippet-level predictions with average pooling § In practice, we use two stream input: RGB and Optical Flow | | Limin Wang (CVL ETHZ) 17/4/12 44

  45. Clip Selection § Selection aims to select discriminative clips or rank them with attention weights. § Two selection methods: § Hard selection: top-k pooling over clip-level prediction § Soft selection: learning attention weights for different clips | | Limin Wang (CVL ETHZ) 17/4/12 45

Recommend


More recommend