Tianwei Lin Baidu VIS What is Temporal Action Detection (TAD)? - PowerPoint PPT Presentation

Temporal Action Detection with Local and Global Context Tianwei Lin Baidu VIS

What is Temporal Action Detection (TAD)? Image: Classification Video: Classification Which action? People Dog Cricket Bowling

What is Temporal Action Detection (TAD)? Image: Object Detection Video: Temporal Action Detection Background Background Background People 1. Which action? 2. When does each action start/end? Dog Cricket Cricket Bowling Bowling

What is Temporal Action Detection (TAD)? Image: Object Detection Video: Temporal Action Detection Background Background Background People 1. Which action? 2. When does each action start/end? Dog X √ Cricket Cricket Bowling Bowling

What is Temporal Action Proposal Generation (TAPG)? Image: Object Proposal Video: Temporal Proposal Generation Background Background Background When does each action start/end? X √

What is high-quality proposal? • Flexible temporal duration at flexible position • Locate temporal boundaries precisely • Evaluate reliable confidence scores of proposals

Related Work • Anchor-based Approaches – top-down framework – global-context – define multi-scales anchors with regular interval as proposals – SSAD, SSTAD, CBR, TURN, etc. • Anchor-free Approaches – bottom-up framework – local-context – first evaluate local clues such as boundary probability or actionness, then generate proposals via exploiting these local clues – TAG, BSN, etc.

Anchor-based: SSAD Approach Overview Anchor Mechanism of SSAD Single Shot Temporal Action Detection. Tianwei Lin, Xu Zhao, Zheng Shou. In ACM Multimedia 2017.

Anchor-based: SSAD Single Shot Temporal Action Detection. Tianwei Lin, Xu Zhao, Zheng Shou. In ACM Multimedia 2017.

Anchor-based: TURN Gao J, Yang Z, Chen K, et al. Turn tap: Temporal unit regression network for temporal action proposals[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 3628-3636.

Anchor-based: Strength and Weakness • Strength – can efficiently generate multiple-scales proposals – can generate reliable confidence score since it takes more global contextual information of all anchors simultaneously • Weakness – it is hard to design the default setting of anchors – usually not temporal precise – not flexible enough to cover various temporal durations

Anchor-free: TAG Temporal Actionness Grouping(TAG) Local Temporal Action Detection with Structured Segment Networks. Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Dahua Lin, Xiaoou Tang. In ICCV 2017.

Anchor-free: BSN Boundary-Sensitive Network (BSN) Local then Global BSN: Boundary sensitive network for temporal action proposal generation. T. Lin, X. Zhao, and S. Haisheng. In European Conference on Computer Vision, 2018.

Anchor-free: BSN – temporal evaluation module

Anchor-free: BSN – proposal generation

Anchor-free: BSN – proposal evaluation module

Anchor-free: BSN – NMS

Strength and weakness of BSN • Strength – can generate proposals with flexible duration and precise boundary (Locally) – can generate reliable confidence score using BSP feature (Globally) • Weakness – feature construction and confidence evaluation are conducted to each proposal respectively, leading to inefficiency – the proposal feature is too simple to capture enough temporal context – is a multiple stages method but not an unified framework

How can we improve BSN? • How can we evaluate confidence for all proposals simultaneously with rich context? – top-down methods achieve this via anchor mechanism – anchor mechanism is not suit for bottom-up methods like BSN

Boundary-Matching Network (BMN) Local and Global Lin T, Liu X, Li X, et al. Bmn: Boundary-matching network for temporal action proposal generation[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 3889-3898.

BM confidence map

How can we generate BM confidence map? - BM mechanism! • Pipeline: temporal feature sequence ! " ∈ $ %×' à BM feature map ( " ∈ $ %×)×*×' à BM confidence map ( % ∈ $ *×' • Key issue 1: How to generate BM feature map from temporal feature sequence? – target: convert ! " ∈ $ %×' to ( " ∈ $ %×)×*×' 1 ∈ $ %×) sampling from its temporal scope – for each proposal ( +×, proposals totally), generate - .,0 – precisely and efficiently • Difficulties to achieve this feature sampling procedure – how to sample feature in no-integer point? (precisely) – how to sample feature for all proposals simultaneously? (efficiently)

BM mechanism • Our Solution: – dot product between ! " ∈ $ %×' and sampling weight ( ∈ $ )×'×*×' in temporal dimension – for each proposal + ,,. , construct / ,,. ∈ $ )×' via uniformly sampling N points among temporal region [1 2 −0.258, 1 9 + 0.258] , for a non-integer sampling point 1 < , we define / ,,.,< [1] ∈ $ ' as: – thus, we can get / ,,. ∈ $ )×' for proposal + ,,.

BM mechanism • Our Solution: ( ∈ * +×- : – then, conduct dot product in temporal dimension between ! " and # $,& to generate ' $,&

BM mechanism • Our Solution: – expand ! ",$ ∈ & '×) to * ∈ & '×)×+×) – dot product between , - ∈ & .×) and * ∈ & '×)×+×) to generate / - ∈ & .×'×+×) – sampling feature of all proposals with rich context is generated precisely and efficiently – this procedure is denoted as BM layer in our method

How can we generate BM confidence map? - BM mechanism! • Key issue 2: How to generate BM confidence map from BM feature map? – Target: covert ! " ∈ $ %×'×(×) to ! % ∈ $ (×) – a series of 3D and 2D convolution layers • Key issue 3: What is the label for training BM confidence map? / ∈ [0,1] is maximum IoU between proposal and ground-truths – BM label map * % ∈ $ (×) , where + ,,.

BMN– Network Architecture

BMN – Experiments Results

Qualitative Results

Lessons Learned • Proposal is a very important for accurate localization • High quality proposals should have three properties: Ø Flexible durations and locations Ø Precise temporal boundaries Ø Reliable confidence score • Boundary-Matching mechanism we proposed can efficiently and effectively generate high-quality temporal action proposals

Applications • Video highlight detection • Dynamic video cover generation • Fine-grained Highlights generation of sport videos, game videos, etc.

Recent Work: CTCN Li X, Lin T, Liu X, et al. Deep Concept-wise Temporal Convolutional Networks for Action Localization[J]. arXiv preprint arXiv:1908.09442, 2019.

Recent Work: Relation-Aware Pyramid Network Gao J, Shi Z, Li J, et al. Accurate Temporal Action Proposal Generation with Relation- Aware Pyramid Network[J]. AAAI, 2020.

Recent Work: PGCN Zeng R, Huang W, Tan M, et al. Graph convolutional networks for temporal action localization[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 7094-7103.

Our Work: StNet N 1 2 T 2D-Conv on Stacked 3D/2D-Conv Temporal 1D- Temporally sampling super-images for blocks for global S-T Xception for long term “super-images” local S-T modeling modeling dynamic modeling He D, Zhou Z, Gan C, et al. Stnet: Local and global spatial-temporal modeling for action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 8401-8408.

Our Work: MARL Wu W, He D, Tan X, et al. Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 6222-6231.

Our Work: Label Graph Superimposing Wang Y, He D, Li F, et al. Multi-Label Classification with Label Graph Superimposing[J]. AAAI, 2020.

Our Work: Dynamic Inference Wu W, He D, Tan X, et al. Dynamic Inference: A New Approach Toward Efficient Video Action Recognition[J]. arXiv preprint arXiv:2002.03342, 2020.

PaddleVideo • Action Recognition: TSN/TSM/StNet/Non-Local/NeXtVLAD… • Action Detection: BSN/BMN/CTCN… • Video Description: ETS • Temporal Localization via Language Query: TALL • https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/PaddleVideo

THANKS

Tianwei Lin Baidu VIS What is Temporal Action Detection (TAD)? - PowerPoint PPT Presentation

Temporal Action Detection with Local and Global Context Tianwei Lin Baidu VIS What is Temporal Action Detection (TAD)? Image: Classification Video: Classification Which action? People Dog Cricket Bowling What is Temporal Action Detection

arXiv:1508.01991v1 [cs.CL] 9 Aug 2015 els include LSTM networks, bidirectional layer on the

OU-VIS: Status H.J. McCracken and the OU-VIS team What is OU-VIS for ? From raw VIS data ,

Chapter 2 What is Vis? Why do it? Vis/Visual Analytics, Chap 2 What is Vis? 1 CGGM Lab., CS

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

The European VIS System The European VIS System Copenhagen November 26 th by Nicolas DELVAUX

MAV-Vis: A Notation for Model Uncertainty Design Uncertainty MAV-Vis Michalis Famelis and

Green Action Centre, 2019 Green Action Centre, 2019 Green Action Centre, 2019 Green Action

Commercialize Autonomous Driving From Lab to Production Line Dr. Xing Yuan | Baidu, Inc. Now,

HAMS: Hardware-Aware Model Scheduling on Heterogeneous Platforms Haofeng Kou - Baidu Research

Temporal, Spatial, and Spatio-temporal Granularities Gabriele Pozzani Department of Computer

Temporal Code Temporal Code Temporal Code (Acoustic Front-end) Human Recognition Machine

Temporal Privacy in Wireless Sensor Networks Temporal Privacy in Wireless Sensor Networks

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang , 1 Shiding Lin, 1 Wei

Temporal Planning with Temporal Metric Trajectory Constraints Andrea Micheli Enrico Scala

Activity Identification from GPS Trajectories Using Spatial Temporal POIs Attractiveness Lian

Temporal Gaussian Mixture Layer for Videos AJ Piergiovanni and Michel S. Ryoo Indiana University

State space methods for temporal GPs Arno Solin Assistant Professor in Machine Learning

3 COMP 1 5 9 3 Algorithmic Verification Temporal Logics Dr. Liam OConnor CSE, UNSW (for

The Shapley Value and the Temporal Shapley Value for Algorithm Analysis Lars Kotthofg

Temporal Logics for Representing Agent Communication Protocols Ulle Endriss Institute for Logic,

Improving String Processing for Temporal Relations Tim Fernando David Woods ADAPT Centre