tianwei lin baidu vis what is temporal action detection
play

Tianwei Lin Baidu VIS What is Temporal Action Detection (TAD)? - PowerPoint PPT Presentation

Temporal Action Detection with Local and Global Context Tianwei Lin Baidu VIS What is Temporal Action Detection (TAD)? Image: Classification Video: Classification Which action? People Dog Cricket Bowling What is Temporal Action Detection


  1. Temporal Action Detection with Local and Global Context Tianwei Lin Baidu VIS

  2. What is Temporal Action Detection (TAD)? Image: Classification Video: Classification Which action? People Dog Cricket Bowling

  3. What is Temporal Action Detection (TAD)? Image: Object Detection Video: Temporal Action Detection Background Background Background People 1. Which action? 2. When does each action start/end? Dog Cricket Cricket Bowling Bowling

  4. What is Temporal Action Detection (TAD)? Image: Object Detection Video: Temporal Action Detection Background Background Background People 1. Which action? 2. When does each action start/end? Dog X √ Cricket Cricket Bowling Bowling

  5. What is Temporal Action Proposal Generation (TAPG)? Image: Object Proposal Video: Temporal Proposal Generation Background Background Background When does each action start/end? X √

  6. What is high-quality proposal? • Flexible temporal duration at flexible position • Locate temporal boundaries precisely • Evaluate reliable confidence scores of proposals

  7. Related Work • Anchor-based Approaches – top-down framework – global-context – define multi-scales anchors with regular interval as proposals – SSAD, SSTAD, CBR, TURN, etc. • Anchor-free Approaches – bottom-up framework – local-context – first evaluate local clues such as boundary probability or actionness, then generate proposals via exploiting these local clues – TAG, BSN, etc.

  8. Anchor-based: SSAD Approach Overview Anchor Mechanism of SSAD Single Shot Temporal Action Detection. Tianwei Lin, Xu Zhao, Zheng Shou. In ACM Multimedia 2017.

  9. Anchor-based: SSAD Single Shot Temporal Action Detection. Tianwei Lin, Xu Zhao, Zheng Shou. In ACM Multimedia 2017.

  10. Anchor-based: TURN Gao J, Yang Z, Chen K, et al. Turn tap: Temporal unit regression network for temporal action proposals[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 3628-3636.

  11. Anchor-based: Strength and Weakness • Strength – can efficiently generate multiple-scales proposals – can generate reliable confidence score since it takes more global contextual information of all anchors simultaneously • Weakness – it is hard to design the default setting of anchors – usually not temporal precise – not flexible enough to cover various temporal durations

  12. Anchor-free: TAG Temporal Actionness Grouping(TAG) Local Temporal Action Detection with Structured Segment Networks. Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Dahua Lin, Xiaoou Tang. In ICCV 2017.

  13. Anchor-free: BSN Boundary-Sensitive Network (BSN) Local then Global BSN: Boundary sensitive network for temporal action proposal generation. T. Lin, X. Zhao, and S. Haisheng. In European Conference on Computer Vision, 2018.

  14. Anchor-free: BSN – temporal evaluation module

  15. Anchor-free: BSN – proposal generation

  16. Anchor-free: BSN – proposal generation

  17. Anchor-free: BSN – proposal evaluation module

  18. Anchor-free: BSN – NMS

  19. Strength and weakness of BSN • Strength – can generate proposals with flexible duration and precise boundary (Locally) – can generate reliable confidence score using BSP feature (Globally) • Weakness – feature construction and confidence evaluation are conducted to each proposal respectively, leading to inefficiency – the proposal feature is too simple to capture enough temporal context – is a multiple stages method but not an unified framework

  20. How can we improve BSN? • How can we evaluate confidence for all proposals simultaneously with rich context? – top-down methods achieve this via anchor mechanism – anchor mechanism is not suit for bottom-up methods like BSN

  21. Boundary-Matching Network (BMN) Local and Global Lin T, Liu X, Li X, et al. Bmn: Boundary-matching network for temporal action proposal generation[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 3889-3898.

  22. BM confidence map

  23. How can we generate BM confidence map? - BM mechanism! • Pipeline: temporal feature sequence ! " ∈ $ %×' à BM feature map ( " ∈ $ %×)×*×' à BM confidence map ( % ∈ $ *×' • Key issue 1: How to generate BM feature map from temporal feature sequence? – target: convert ! " ∈ $ %×' to ( " ∈ $ %×)×*×' 1 ∈ $ %×) sampling from its temporal scope – for each proposal ( +×, proposals totally), generate - .,0 – precisely and efficiently • Difficulties to achieve this feature sampling procedure – how to sample feature in no-integer point? (precisely) – how to sample feature for all proposals simultaneously? (efficiently)

  24. BM mechanism • Our Solution: – dot product between ! " ∈ $ %×' and sampling weight ( ∈ $ )×'×*×' in temporal dimension – for each proposal + ,,. , construct / ,,. ∈ $ )×' via uniformly sampling N points among temporal region [1 2 −0.258, 1 9 + 0.258] , for a non-integer sampling point 1 < , we define / ,,.,< [1] ∈ $ ' as: – thus, we can get / ,,. ∈ $ )×' for proposal + ,,.

  25. BM mechanism • Our Solution: ( ∈ * +×- : – then, conduct dot product in temporal dimension between ! " and # $,& to generate ' $,&

  26. BM mechanism • Our Solution: – expand ! ",$ ∈ & '×) to * ∈ & '×)×+×) – dot product between , - ∈ & .×) and * ∈ & '×)×+×) to generate / - ∈ & .×'×+×) – sampling feature of all proposals with rich context is generated precisely and efficiently – this procedure is denoted as BM layer in our method

  27. How can we generate BM confidence map? - BM mechanism! • Key issue 2: How to generate BM confidence map from BM feature map? – Target: covert ! " ∈ $ %×'×(×) to ! % ∈ $ (×) – a series of 3D and 2D convolution layers • Key issue 3: What is the label for training BM confidence map? / ∈ [0,1] is maximum IoU between proposal and ground-truths – BM label map * % ∈ $ (×) , where + ,,.

  28. BMN– Network Architecture

  29. BMN – Experiments Results

  30. BMN – Experiments Results

  31. Qualitative Results

  32. Qualitative Results

  33. Lessons Learned • Proposal is a very important for accurate localization • High quality proposals should have three properties: Ø Flexible durations and locations Ø Precise temporal boundaries Ø Reliable confidence score • Boundary-Matching mechanism we proposed can efficiently and effectively generate high-quality temporal action proposals

  34. Applications • Video highlight detection • Dynamic video cover generation • Fine-grained Highlights generation of sport videos, game videos, etc.

  35. Recent Work: CTCN Li X, Lin T, Liu X, et al. Deep Concept-wise Temporal Convolutional Networks for Action Localization[J]. arXiv preprint arXiv:1908.09442, 2019.

  36. Recent Work: Relation-Aware Pyramid Network Gao J, Shi Z, Li J, et al. Accurate Temporal Action Proposal Generation with Relation- Aware Pyramid Network[J]. AAAI, 2020.

  37. Recent Work: PGCN Zeng R, Huang W, Tan M, et al. Graph convolutional networks for temporal action localization[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 7094-7103.

  38. Our Work: StNet N 1 2 T 2D-Conv on Stacked 3D/2D-Conv Temporal 1D- Temporally sampling super-images for blocks for global S-T Xception for long term “super-images” local S-T modeling modeling dynamic modeling He D, Zhou Z, Gan C, et al. Stnet: Local and global spatial-temporal modeling for action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 8401-8408.

  39. Our Work: MARL Wu W, He D, Tan X, et al. Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 6222-6231.

  40. Our Work: Label Graph Superimposing Wang Y, He D, Li F, et al. Multi-Label Classification with Label Graph Superimposing[J]. AAAI, 2020.

  41. Our Work: Dynamic Inference Wu W, He D, Tan X, et al. Dynamic Inference: A New Approach Toward Efficient Video Action Recognition[J]. arXiv preprint arXiv:2002.03342, 2020.

  42. PaddleVideo • Action Recognition: TSN/TSM/StNet/Non-Local/NeXtVLAD… • Action Detection: BSN/BMN/CTCN… • Video Description: ETS • Temporal Localization via Language Query: TALL • https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/PaddleVideo

  43. THANKS

Recommend


More recommend