Towards efficient end-to-end architectures for action recognition and detection in videos Limin Wang Computer Vision Laboratory, ETH Zurich | | Limin Wang (CVL ETHZ) 17/4/12 1
Outline § 1. Overview of action understanding § 2. Temporal segment networks § 3. Structured segment networks § 4. UntrimmedNets § 5. Conclusion | | Limin Wang (CVL ETHZ) 17/4/12 2
Action recognition in videos § 1. Action recognition “in the lab”: KTH, Weizmann etc. § 2. Action recognition “in TV, Movies”: UCF Sports, Holloywood etc. § 3. Action recognition “in Web Videos”: HMDB, UCF101, THUMOS, ActivityNet etc. Haroon Idrees et al. The THUMOS Challenge on Action Recognition for Videos "in the Wild” , in Computer Vision and Image Understanding (CVIU), 2017. | | Limin Wang (CVL ETHZ) 17/4/12 3
How to define action categories Heilbron F. C. et al. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 4
How to define action categories Heilbron F. C. et al. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 5
How to label videos Heilbron F. C. et al. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 6
Dataset overview | | Limin Wang (CVL ETHZ) 17/4/12 7
Action understanding § Action Recognition: classify the short clip or untrimmed video into pre-defined class. § Action Temporal Localization: detect starting and ending times of action instances in untrimmed video. § Action Spatial Detection: detect the bounding boxes of actors in trimmed videos. § Action Spatial-Temporal Detection: combine temporal and spatial localization in untrimmed videos. | | Limin Wang (CVL ETHZ) 17/4/12 8
Action recognition – STIP+HOG/HOF (03, 08) [1]. Ivan Laptev and Tony Lindeberg, Space-time Interest Points, in ICCV, 2003. [2] Ivan Laptev, Marcin Marszałek, Cordelia Schmid, Benjamin Rozenfeld, Learning realistic human actions from movies, in CVPR, 2008. | | Limin Wang (CVL ETHZ) 17/4/12 9
Action recognition – Dense Trajectories (11, 13) [1]. Heng Wang, Alexander Klaser, Cordelia Schmid, and Cheng-Lin Liu, Action Recognition by Dense Trajectories, in CVPR, 2011. [2] Heng Wang, and Cordelia Schmid, Action Recognition with Improved Trajectories, in ICCV, 2013. | | Limin Wang (CVL ETHZ) 17/4/12 10
Action recognition – two stream CNN (2014) Karen Simonyan and Andrew Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos, in NIPS, 2014. | | Limin Wang (CVL ETHZ) 17/4/12 11
Action recognition – 3D CNN (2015) Du Tran et al. Learning Spatiotemporal Features with 3D Convolutional Networks, in ICCV, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 12
Action recognition – TDD (2015) Limin Wang, Yu Qiao, Xiaoou Tang, Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors , in CVPR, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 13
Outline § 1. Overview of action understanding § 2. Temporal segment networks § 3. Structured segment networks § 4. UntrimmedNets § 5. Conclusion | | Limin Wang (CVL ETHZ) 17/4/12 14
Motivation of TSN § Towards end-to-end and video-level architecture. § Modeling issue: mainstream CNN frameworks focus on appearance and short-term motion. § Learning issue: current action datasets are relatively small and it is not easy to train deep CNNs. | | Limin Wang (CVL ETHZ) 17/4/12 15
Overview of TSN TSN is a video-level framework based on simple strategies of segment sampling and consensus aggregation . | | Limin Wang (CVL ETHZ) 17/4/12 16
Segment sampling § Our segment sampling is based on the fact: there are high data redundancy in video § Our segment sampling share two properties: § Sparse: processing efficiency § Duration invariance: video-level framework modeling the entire video content. | | Limin Wang (CVL ETHZ) 17/4/12 17
Aggregation Function § Aggregation function aims to summarize the predictions of different snippet to yield the video-level prediction. § Simple aggregation functions: § Mean pooling, max pooling, weighted average § Advanced aggregation functions: § Top-k pooling, attention weighting | | Limin Wang (CVL ETHZ) 17/4/12 18
Formulation of TSN | | Limin Wang (CVL ETHZ) 17/4/12 19
Input modalities § Original two-stream CNNs take RGB image and stacking optical flow fields. § We study two other input modalities. § Stacking RGB difference § Approximation of motion information § Efficient to compute § Stacking warped optical field § Remove background motion | | Limin Wang (CVL ETHZ) 17/4/12 20
Good practices § Cross modality pre-training: pre-train both spatial and temporal nets with ImageNet models. § Partial batch normalization: only re-estimate the parameters of first BN layer. § Smaller learning rate: as pre-training, using smaller learning rate for fine tuning. § Data augmentation: using more augmentation, including corner cropping, scale jittering, ratio jittering. § High dropout ratio: 0.7 dropout ratio for temporal net and 0.8 dropout ratio for spatial net. | | Limin Wang (CVL ETHZ) 17/4/12 21
Experiment result -- training method | | Limin Wang (CVL ETHZ) 17/4/12 22
Experiment result -- input modality | | Limin Wang (CVL ETHZ) 17/4/12 23
Experiment result -- TSN framework | | Limin Wang (CVL ETHZ) 17/4/12 24
Experiment result -- Comparison with STOA | | Limin Wang (CVL ETHZ) 17/4/12 25
ActivityNet Challenge -- 2016 | | Limin Wang (CVL ETHZ) 17/4/12 26
Model Visualization | | Limin Wang (CVL ETHZ) 17/4/12 27
Outline § 1. Overview of action understanding § 2. Temporal segment network § 3. Structured segment network § 4. UntrimmedNet § 5. Conclusion | | Limin Wang (CVL ETHZ) 17/4/12 28
Motivation of Structured Segment Network 1. Action detection in untrimmed video is an important problem. 2. Snippet-level classifier is difficult to accurately localize the temporal extent of action instance. Context and Structure Modeling! | | Limin Wang (CVL ETHZ) 17/4/12 29
Structured Segment Network | | Limin Wang (CVL ETHZ) 17/4/12 30
Three Stage Augmentation § Given a proposal, we will extend its temporal duration by augmentation (Context modeling). § Specifically, a proposal denoted by [s, e], its duration is d = e - s, then temporal extension [s’, e’]: e’ = e + d/2, s’ = s - d/2 | | Limin Wang (CVL ETHZ) 17/4/12 31
Temporal Pyramid Pooling § Given a proposal, we use temporal pyramid pooling to summarize its representation (Structure modeling). § Specifically, given a proposal denoted by [s, e], for ith part in the kth level, it is pooled as follows: § The overall representation would be as follows: | | Limin Wang (CVL ETHZ) 17/4/12 32
Two Classifier Design § To model the class classes and completeness of instances, we design a two classifier loss § Action class classifier measure the likelihood of action class distribution: P(c|p) § Completeness classifier measure the likelihood of instance completeness: P(b|c,p) § A joint loss to optimize these two classifiers: | | Limin Wang (CVL ETHZ) 17/4/12 33
Temporal Region Proposal Bottom up proposal generation based on actionness map | | Limin Wang (CVL ETHZ) 17/4/12 34
Experiment result (1) | | Limin Wang (CVL ETHZ) 17/4/12 35
Experiment result (2) | | Limin Wang (CVL ETHZ) 17/4/12 36
Experiment result (3) | | Limin Wang (CVL ETHZ) 17/4/12 37
Detection example (1) | | Limin Wang (CVL ETHZ) 17/4/12 38
Detection example (2) | | Limin Wang (CVL ETHZ) 17/4/12 39
Outline § 1. Overview of action understanding § 2. Temporal segment network § 3. Structured segment network § 4. UntrimmedNet § 5. Conclusion | | Limin Wang (CVL ETHZ) 17/4/12 40
Motivation of UntrimmedNet 1. Labeling untrimmed video is expensive and time consuming 2. Temporal annotation is subjective and not consistent across persons and datasets | | Limin Wang (CVL ETHZ) 17/4/12 41
Overview of UntrimmedNet | | Limin Wang (CVL ETHZ) 17/4/12 42
Clip Proposal § Uniform Sampling § Uniform sampling of fixed duration § Shot based Sampling § First shot detection based HOG difference § For each shot, perform uniform sampling. | | Limin Wang (CVL ETHZ) 17/4/12 43
Clip-level Representation and Classification § Following TSN framework: § Sampling a few snippets from each clip. § Aggregating snippet-level predictions with average pooling § In practice, we use two stream input: RGB and Optical Flow | | Limin Wang (CVL ETHZ) 17/4/12 44
Clip Selection § Selection aims to select discriminative clips or rank them with attention weights. § Two selection methods: § Hard selection: top-k pooling over clip-level prediction § Soft selection: learning attention weights for different clips | | Limin Wang (CVL ETHZ) 17/4/12 45
Recommend
More recommend