Learning to Detect Activity in Untrimmed Video Prof. Bernard Ghanem
An image is worth a thousand words A video is worth a million words Source: YouTube Image: “a tiger attacking a person on a grass field” Video: “the tiger is being playful ” Bernard Ghanem
Fun facts about video 45% of people watch more than an By 2017, online video will account 55% of people watch videos online hour of Facebook or YouTube for 74% of all online traffic 3 every day 1 videos a week 2 Almost 50% of internet users look 85% of Facebook video is watched for videos related to a product or without sound 5 service before visiting a store 4 Source:Source:1) MWP Statistics, 2015; 2) HubSpot, 2016 3) KPCB, 2016 4) Google, 2016; 5) DIGIDAY, 2016 Bernard Ghanem
Problem: Detecting Human Activities in Video Input … … … … Bernard Ghanem
Problem: Detecting Human Activities in Video Input … … Output … … Class: Pole Vault Bounds: (23.1s, 25.2s) Bernard Ghanem
Why Activity Detection? Bernard Ghanem
Bernard Ghanem
Challenges of Detecting Human Activities Input … … Output … … 1. Not enough large-scale training data 2. Large number of activities 3. Real-time processing is not enough Bernard Ghanem
1. Not enough large-scale training data 1 st Version (R1.1): • ~200 classes • ~850 hours • class hierarchy ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding [CVPR 2015] Bernard Ghanem
1. Not enough large-scale training data At CVPR 2017 (July 26 – afternoon) http://activity-net.org/challenges/2017 Sponsored by: ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding [CVPR 2015] Bernard Ghanem
Classical Activity Detection Pipeline … … Basketball Dunk Classifier . . . Volleyball Spiking Classifier Bernard Ghanem
Classical Activity Detection Pipeline … … Basketball Dunk Classifier . . . Volleyball Spiking Classifier Bernard Ghanem
Using proposals is important … … Action Proposal Basketball Dunk Basketball Dunk Classifier Classifier Volleyball Spiking Volleyball Spiking Classifier Classifier Bernard Ghanem
What have we done? Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos [CVPR 2016] proposals are represented as sparse combinations of STIPs (10FPS on single CPU core) DAPs: Deep Action Proposals for Action Understanding [ECCV 2016] multi-scale (sparse) proposals are output by an LSTM in one pass (130FPS on single GPU) SST: Single-Stream Temporal Action Proposals [CVPR 2017] multi-scale (dense) proposals are scored by a GRU in one pass + streaming (300FPS on single GPU) Bernard Ghanem
SST: Single Stream Temporal Action Proposals Untrimmed Input Video Temporal Action Proposals Localized Action Detections SST classifier Output … c t Proposals output k - proprosals (time step t ) ⬄ … Seq. Encoder … ϕ ϕ ϕ ϕ ϕ ϕ Visual Encoder k · δ maximum proposal size (per output) … Input video δ Time Bernard Ghanem
SS-TAD: Single Stream Temporal Action Detection (a) (b) (c) Action Detections Classifiers Merging/Smoothing SS-TAD Proposals Frame-level Classifiers Untrimmed Video Input End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos [BMVC 2017] multi-scale (dense) detection are scored in one pass + streaming (700FPS on TitanX GPU) Bernard Ghanem
SS-TAD: Single Stream Temporal Action Detection Key Detection Ground-truth Time (Actions are played at 1x speed, Background video is sped up) Bernard Ghanem
2. Large number of activities • Applying activity detectors for large number of activity classes is expensive. • Can we do better than linear computational growth with # of activity classes? Bernard Ghanem
Activity-Object and Activity-Scene Relations SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017] DAPs: Deep Action Proposals for Action Understanding [ECCV 2016] Bernard Ghanem
Typical Activity Detection Pipeline Action Action Video Sequence Action Proposals Proposals Classifiers (Stage 1) (Stage 2) Reject SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017] DAPs: Deep Action Proposals for Action Understanding [ECCV 2016] Bernard Ghanem
SCC: Semantic Context Cascade SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017] Bernard Ghanem
SCC: Semantic Context Cascade SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017] Bernard Ghanem
SCC: Semantic Context Cascade SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017] Bernard Ghanem
3. Real-time processing is not enough • In the past, real- time processing was a “good -to- have”, i.e. 1min video → 1min processing • But, not anymore! • We need to stay ahead of the increasing video upload rate. How? hardware acceleration (GPUs) more efficient implementation do we need to visit every frame? Bernard Ghanem
Do we have to visit every frame? • Log how human annotator moves the time slider instead of throwing it away • Can we learn from how humans move the slider to localize t activities? Search History Action Search: Learning to Search for Human Activities in Untrimmed Videos [arXiv 2017][To be submitted to CVPR2018] Bernard Ghanem
𝑢 𝑢 Action Search: Learning to Search for Human Activities in Untrimmed Videos [arXiv 2017][To be submitted to CVPR2018] Bernard Ghanem
𝑢 𝑔(𝒀 𝑗−3 ) 𝑔(𝒀 𝑗−2 ) 𝑔(𝒀 𝑗 ) 𝑔(𝒀 𝑗+1 ) 𝑔(𝒀 𝑗−1 ) . . . . . . 𝒊 𝑗−3 𝒊 𝑗−2 𝒊 𝑗−1 𝒊 𝑗 𝒊 𝑗+1 𝒘 𝑗−1 𝒘 𝑗 LSTM 𝒘 𝑗−2 𝒘 𝑗+1 . . . . . . 3D ConvNet Target Activity 𝒀: Visual Observation 𝒘: Feature Vector 𝒀 𝑗 𝒀 𝑗−2 𝒀 𝑗−1 𝒀 𝑗+1 𝒊: LSTM State 𝑔 𝒀 : Temporal Location 𝑔(𝒀 𝑗−3 ) 𝑔(𝒀 𝑗−2 ) 𝑔(𝒀 𝑗+1 ) 𝑔(𝒀 𝑗 ) 𝑔(𝒀 𝑗−1 ) 𝑢 Action Search: Learning to Search for Human Activities in Untrimmed Videos [arXiv 2017][To be submitted to CVPR2018] Bernard Ghanem
Action Search or Action Spotting Activity: “shot put” Activity: “basketball dunk” Activity: “shot put” Action Search: Learning to Search for Human Activities in Untrimmed Videos [arXiv 2017][To be submitted to CVPR2018] Bernard Ghanem
SPONSORS Bernard Ghanem
Prof. Bernard Ghanem bernard.ghanem@kaust.edu.sa ivul.kaust.edu.sa baseball throw dunk shoveling washing dishes pole vault dancing Bernard Ghanem
Recommend
More recommend