Temporal Action Detection with Local and Global Context Tianwei Lin Baidu VIS
What is Temporal Action Detection (TAD)? Image: Classification Video: Classification Which action? People Dog Cricket Bowling
What is Temporal Action Detection (TAD)? Image: Object Detection Video: Temporal Action Detection Background Background Background People 1. Which action? 2. When does each action start/end? Dog Cricket Cricket Bowling Bowling
What is Temporal Action Detection (TAD)? Image: Object Detection Video: Temporal Action Detection Background Background Background People 1. Which action? 2. When does each action start/end? Dog X √ Cricket Cricket Bowling Bowling
What is Temporal Action Proposal Generation (TAPG)? Image: Object Proposal Video: Temporal Proposal Generation Background Background Background When does each action start/end? X √
What is high-quality proposal? • Flexible temporal duration at flexible position • Locate temporal boundaries precisely • Evaluate reliable confidence scores of proposals
Related Work • Anchor-based Approaches – top-down framework – global-context – define multi-scales anchors with regular interval as proposals – SSAD, SSTAD, CBR, TURN, etc. • Anchor-free Approaches – bottom-up framework – local-context – first evaluate local clues such as boundary probability or actionness, then generate proposals via exploiting these local clues – TAG, BSN, etc.
Anchor-based: SSAD Approach Overview Anchor Mechanism of SSAD Single Shot Temporal Action Detection. Tianwei Lin, Xu Zhao, Zheng Shou. In ACM Multimedia 2017.
Anchor-based: SSAD Single Shot Temporal Action Detection. Tianwei Lin, Xu Zhao, Zheng Shou. In ACM Multimedia 2017.
Anchor-based: TURN Gao J, Yang Z, Chen K, et al. Turn tap: Temporal unit regression network for temporal action proposals[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 3628-3636.
Anchor-based: Strength and Weakness • Strength – can efficiently generate multiple-scales proposals – can generate reliable confidence score since it takes more global contextual information of all anchors simultaneously • Weakness – it is hard to design the default setting of anchors – usually not temporal precise – not flexible enough to cover various temporal durations
Anchor-free: TAG Temporal Actionness Grouping(TAG) Local Temporal Action Detection with Structured Segment Networks. Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Dahua Lin, Xiaoou Tang. In ICCV 2017.
Anchor-free: BSN Boundary-Sensitive Network (BSN) Local then Global BSN: Boundary sensitive network for temporal action proposal generation. T. Lin, X. Zhao, and S. Haisheng. In European Conference on Computer Vision, 2018.
Anchor-free: BSN – temporal evaluation module
Anchor-free: BSN – proposal generation
Anchor-free: BSN – proposal generation
Anchor-free: BSN – proposal evaluation module
Anchor-free: BSN – NMS
Strength and weakness of BSN • Strength – can generate proposals with flexible duration and precise boundary (Locally) – can generate reliable confidence score using BSP feature (Globally) • Weakness – feature construction and confidence evaluation are conducted to each proposal respectively, leading to inefficiency – the proposal feature is too simple to capture enough temporal context – is a multiple stages method but not an unified framework
How can we improve BSN? • How can we evaluate confidence for all proposals simultaneously with rich context? – top-down methods achieve this via anchor mechanism – anchor mechanism is not suit for bottom-up methods like BSN
Boundary-Matching Network (BMN) Local and Global Lin T, Liu X, Li X, et al. Bmn: Boundary-matching network for temporal action proposal generation[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 3889-3898.
BM confidence map
How can we generate BM confidence map? - BM mechanism! • Pipeline: temporal feature sequence ! " ∈ $ %×' à BM feature map ( " ∈ $ %×)×*×' à BM confidence map ( % ∈ $ *×' • Key issue 1: How to generate BM feature map from temporal feature sequence? – target: convert ! " ∈ $ %×' to ( " ∈ $ %×)×*×' 1 ∈ $ %×) sampling from its temporal scope – for each proposal ( +×, proposals totally), generate - .,0 – precisely and efficiently • Difficulties to achieve this feature sampling procedure – how to sample feature in no-integer point? (precisely) – how to sample feature for all proposals simultaneously? (efficiently)
BM mechanism • Our Solution: – dot product between ! " ∈ $ %×' and sampling weight ( ∈ $ )×'×*×' in temporal dimension – for each proposal + ,,. , construct / ,,. ∈ $ )×' via uniformly sampling N points among temporal region [1 2 −0.258, 1 9 + 0.258] , for a non-integer sampling point 1 < , we define / ,,.,< [1] ∈ $ ' as: – thus, we can get / ,,. ∈ $ )×' for proposal + ,,.
BM mechanism • Our Solution: ( ∈ * +×- : – then, conduct dot product in temporal dimension between ! " and # $,& to generate ' $,&
BM mechanism • Our Solution: – expand ! ",$ ∈ & '×) to * ∈ & '×)×+×) – dot product between , - ∈ & .×) and * ∈ & '×)×+×) to generate / - ∈ & .×'×+×) – sampling feature of all proposals with rich context is generated precisely and efficiently – this procedure is denoted as BM layer in our method
How can we generate BM confidence map? - BM mechanism! • Key issue 2: How to generate BM confidence map from BM feature map? – Target: covert ! " ∈ $ %×'×(×) to ! % ∈ $ (×) – a series of 3D and 2D convolution layers • Key issue 3: What is the label for training BM confidence map? / ∈ [0,1] is maximum IoU between proposal and ground-truths – BM label map * % ∈ $ (×) , where + ,,.
BMN– Network Architecture
BMN – Experiments Results
BMN – Experiments Results
Qualitative Results
Qualitative Results
Lessons Learned • Proposal is a very important for accurate localization • High quality proposals should have three properties: Ø Flexible durations and locations Ø Precise temporal boundaries Ø Reliable confidence score • Boundary-Matching mechanism we proposed can efficiently and effectively generate high-quality temporal action proposals
Applications • Video highlight detection • Dynamic video cover generation • Fine-grained Highlights generation of sport videos, game videos, etc.
Recent Work: CTCN Li X, Lin T, Liu X, et al. Deep Concept-wise Temporal Convolutional Networks for Action Localization[J]. arXiv preprint arXiv:1908.09442, 2019.
Recent Work: Relation-Aware Pyramid Network Gao J, Shi Z, Li J, et al. Accurate Temporal Action Proposal Generation with Relation- Aware Pyramid Network[J]. AAAI, 2020.
Recent Work: PGCN Zeng R, Huang W, Tan M, et al. Graph convolutional networks for temporal action localization[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 7094-7103.
Our Work: StNet N 1 2 T 2D-Conv on Stacked 3D/2D-Conv Temporal 1D- Temporally sampling super-images for blocks for global S-T Xception for long term “super-images” local S-T modeling modeling dynamic modeling He D, Zhou Z, Gan C, et al. Stnet: Local and global spatial-temporal modeling for action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 8401-8408.
Our Work: MARL Wu W, He D, Tan X, et al. Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 6222-6231.
Our Work: Label Graph Superimposing Wang Y, He D, Li F, et al. Multi-Label Classification with Label Graph Superimposing[J]. AAAI, 2020.
Our Work: Dynamic Inference Wu W, He D, Tan X, et al. Dynamic Inference: A New Approach Toward Efficient Video Action Recognition[J]. arXiv preprint arXiv:2002.03342, 2020.
PaddleVideo • Action Recognition: TSN/TSM/StNet/Non-Local/NeXtVLAD… • Action Detection: BSN/BMN/CTCN… • Video Description: ETS • Temporal Localization via Language Query: TALL • https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/PaddleVideo
THANKS
Recommend
More recommend