We Weakly-supe supervise sed d Vid Video eo Rec ecogn gnitio ition Pa Pascal Mettes University of Amsterdam
Video recognition pipeline Learn video labels from spatio-temporal inputs. Do Double scale challenge: : Order(s) of magnitude more inputs, order(s) of magnitude fewer training samples. [Feichtenhofer et al. NeurIPS 2016] 04-04-2019 Weakly-supervised action recognition 1
Spatio-Temporal Action Localization Discover what , when , and where actions occur in videos. Diving Skateboarding 06-02-2019 Understanding Actions with Minimal Supervision 2
The annotation burden of localization Same double burden as video recognition in general. Additional burden from box annotation in each video frame. How can we learn action locations without all these burdens? 04-04-2019 Weakly-supervised action recognition 3
Part I Pa To Towards action lo localiz alizatio tion from video video labels labels
Spatio-Temporal Proposals At test time, actions can be anywhere in a video. Split videos into action tubes, such that at least one tube matches with each action. Apply model to all proposals and select the best ones during testing. Illustration courtesy of Jan van Gemert 06-02-2019 Understanding Actions with Minimal Supervision 5
Pointly-Supervised Action Localization Go Goal: al: Alleviate the need for expensive bounding box annotations per frame. Hy Hypoth othesi esis: s: Spatio-temporal proposals can be used as training example, if guided properly with minimal extra supervision. P. P. Mettes , J.C. van Gemert, and C.G.M. Snoek, “Spot On: Action Localization from Pointly-Supervised Proposals”, ECCV, 2016. P. P. Mettes and C.G.M. Snoek, “Pointly-Supervised Action Localization”, IJCV, 2018 (in press). 06-02-2019 Understanding Actions with Minimal Supervision 6
Multiple Instance Learning negative bags positive bags negative positive Traditional Multiple-instance supervised learning learning [Dietterich et al. 1997] 04-04-2019 Weakly-supervised action recognition 7
Multiple Instance Learning positive positive bags bags negative negative bags bags Compute optimal hyper-plane with Re-weight positive instances sparse MIL [Slide by Vijayanarasimhan and Grauman] 04-04-2019 Weakly-supervised action recognition 8
Proposed approach Proposal mining Point supervision Proposal scoring Annotate actions simply by pointing on action centers. Use point supervision to help find the best proposals to train on. 06-02-2019 Understanding Actions with Minimal Supervision 9
Multiple Instance Learning with priors Iteratively learn to discriminate actions ( likelihood ) and learn the spatio-temporal extent of actions ( point priors ). 06-02-2019 Understanding Actions with Minimal Supervision 10
Point and proposal overlap We introduce a new overlap measure between points and proposals. Proposals should be small and their centers should match with points. < < 06-02-2019 Understanding Actions with Minimal Supervision 11
Experiments Da Datasets UCF-101 UCF Sports J-HMDB Hollywood2Tubes Ev Evaluation Rank detections based on classifier score. Positive if correct class and enough overlap. Results reported in (mean) Average Precision. 06-02-2019 Understanding Actions with Minimal Supervision 12
Can we train on spatio-temporal proposals? UCF Sports Trained on box annotations Trained on box annotations Trained on best proposal Trained on best proposal Trained on point annotations Tr Training on our approach with points is as effective as training on boxes. 06-02-2019 Understanding Actions with Minimal Supervision 13
How much point supervision is required? UCF Sports Sim Similar ilar performan Points are 15 times faster to annotate than boxes. Po ance at 50-100x s 100x speed eed-up up with h fewer anno nnotations ns. 06-02-2019 Understanding Actions with Minimal Supervision 14
Discovered actions Actions from box supervision Actions from point supervision Sim Similar ilar performan ance, dif ifferent behavio vior. 06-02-2019 Understanding Actions with Minimal Supervision 15
Replacing manual point supervision Ne New hypothesis: Point supervision can be replaced with automatic visual cues that correlate with the action location. P. P. Mettes , C.G.M. Snoek, and S.F. Chang, “Localizing Actions from Video Labels and Pseudo-Annotations”, BMVC, 2017. 06-02-2019 Understanding Actions with Minimal Supervision 16
Pseudo-annotations Action Ac Center b Ce on p Object Ob prop opos cts bias osals Actions occur where proposals agree. Actions occur where objects occur. Actions are central. Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV, 2014. Van Gemert et al. “APT: Action localization proposals from dense trajectories, Tseng et al. “Quantifying bias of observers in free viewing of dynamic natural scenes”, JOV, 2004. BMVC, 2015. Indep In Person epen enden on d detection ent motion on Actions occurs where motion deviates from global motion. Action location correlates with the actor location. Ren et al. “Faster R-CNN: Towards real-time object detection with region proposal networks”, NeurIPS, 2015. Jain et al. “Action localization with tubelets from motion”, CVPR, 2014. 06-02-2019 Understanding Actions with Minimal Supervision 17
Individual pseudo-annotation performance Su Supervis visio ion bounds: UCF Sports Upper Full box supervision. Lower Video labels, no priors. Ax Axes: X-axis Minimal overlap threshold for positives. Y-axis Localization performance 06-02-2019 Understanding Actions with Minimal Supervision 18
Individual pseudo-annotation performance All pseudo-annotations are informative. UCF Sports Mo Most helpful: Person detection and motions. Le Least helpful: Center bias and objects. 06-02-2019 Understanding Actions with Minimal Supervision 19
Combining pseudo-annotations We We rank pseudo-an annotatio ions bas ased on correla latio ion to person detectio ion. Performance akin to fu Co Correlation matches with individual performance. full box supervision using g top 3 pseudo-an annotatio ions. 06-02-2019 Understanding Actions with Minimal Supervision 20
On point- and pseudo-annotations The good Th In two papers from box supervision to video label supervision only. Kalogeiton et al. ICCV, 2017. Th The bad Performance gap to state-of-the-art is increasing; spatio-temporal Singh et al. ICCV, 2017. proposals are a limiting factor. State-of St of-th the-ar art appr approac ach: h: Tr Train at box-le level, l, link link bo boxes over tim ime 06-02-2019 Understanding Actions with Minimal Supervision 21
New goal in weakly-supervised localization Localize actions spatio-temporally from video labels. Now, directly from boxes, instead of spatio-temporal proposals. P. P. Mettes and C.G.M. Snoek, “Spatio-Temporal Instance Learning: Action Tubes from Class Supervision”, arXiv, 2019. 06-02-2019 Understanding Actions with Minimal Supervision 22
Spatio-Temporal Instance Learning An action consists of a set of boxes; MIL no longer directly applicable. We propose a new instance learning for actions: Co Conditi tion 1: Each positive video contains at least one positive action instance, which can occur in precisely one tube. Co Conditi tion 2: For each positive video V , the positive action instance is a set of connected boxes of minimal length 1 and maximal length F V , where F V denotes the total video length. tion 3: For each negative video, all tubes and boxes are negative. Co Conditi 06-02-2019 Understanding Actions with Minimal Supervision 23
Spatio-Temporal Instance Learning Max margin optimization Ne New obj bjectiv ive and and optimiz imizatio ion with latent variables No more positive box proposals than number of frames in the video All boxes from one tube Boxes are connected over time 04-04-2019 Weakly-supervised action recognition 24
STIL on boxes versus MIL on tubes St State-of of-th the-ar art result lts, especially ially at hig igh overlap lap threshold lds. 06-02-2019 Understanding Actions with Minimal Supervision 25
Capturing spatio-temporal action extent Ac Acti tion on extent t during g tr training g uncovered for or lon ong g acti tion ons, , prob oblemati tic for or shor ort t on ones. At At test time, success in single-ac actio ion vid ideos, confu fusio ion in in mult lti-ac actio ion vid ideos. 06-02-2019 Understanding Actions with Minimal Supervision 26
Conclusions on action localization Action localization with box annotations is does not scale to large settings. We can adapt Multiple Instance Learning to help solve this problem. Gap to fully supervised approaches still large, remains an open problem. 06-02-2019 Understanding Actions with Minimal Supervision 27
Pa Part II Ac Action localization an and rec ecognitio ition wi witho hout ut exampl ples es
Zero-shot recognition [Slide by Zeynep Akata] 04-04-2019 Weakly-supervised action recognition 29
On objects and actions An action involves an actor trying to act on its environment. Objects serve as tools to make changes . . To what extent can we recognize and localize actions only from objects? 06-02-2019 Understanding Actions with Minimal Supervision 30
Recommend
More recommend