Segments, Residuals and Embeddings for Few-Example Video Event Detection Dennis Koelma and Cees Snoek University of Amsterdam The Netherlands
Pipeline 10Ex 2016 CNN Inception avg ImageNet sample pool 2 / sec Shuffle SVM Videos Frames pool5 10Ex M1 Video Story SVM embedding 10Ex M5 avg pool SVM prob 10Ex M2 Fisher vector dense SVM trajectories 10Ex M3 mfcc0 Fisher vector SVM mfcc1 10Ex M4 mfcc2
Pipeline 10Ex 2017 ResNet + ResNeXt difference ImageNet sample coding 2 / sec Shuffle SVM Videos Frames pool5 10Ex M1 Video Story SVM embedding 10Ex M5 avg pool sliding SVM window 10Ex M2 Fisher vector dense SVM trajectories 10Ex M3 mfcc0 Fisher vector SVM mfcc1 10Ex M4 mfcc2
CNN Features from 22k ImageNet classes - Use as many classes as possible Irrelevant classes - Find a balance between level of abstraction of classes and number of images in a class Example imbalance Siderocyte 296 classes with 1 image Gametophyte 4
CNN training on selection out of 22k ImageNet classes • Idea • Increase level of abstraction of classes • Incorporate classes with less than 200 samples • Heuristics • Roll, Bind, Promote, Subsample N > 2000 : Subsample • Result • 12,988 classes • 13.6M images N < 200 : Promote Roll N < 3000 : Bind The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection, Pascal Mettes and Dennis Koelma and Cees Snoek, International Conference on Multimedia Retrieval, 2016
Feature Difference Coding • K-means clustering (k = 5) on last fully connected layer before probability layers (called flatten) • Fisher like encoding but sigma is based on distance of points assigned to a cluster to its center MAP 2014 Test Set 0.350 0.340 0.330 0.320 0.310 0.300 0.290 flatten-avg flatten-dc ResNet ResNeXt Fusion
Video Story: Embed the story of a video Stunt Bike Motorcycle x i y i s i W A Embedding Joint optimization of W and A to preserve Descriptiveness: preserve video descriptions : L(A,S) Predictability: recognize terms from video content : L(S,W) Videostory: A new multimedia embedding for few-example recognition and translation of events, Amirhossein Habibian and Thomas Mensink and Cees Snoek, Proceedings of the ACM International Conference on Multimedia, 2014
VideoStory Embedding as a Feature MAP 2014 Test Set 0.335 0.330 0.325 0.320 0.315 0.310 0.305 0.300 flatten-avg video story ResNet ResNeXt Fusion
Video Story for 0Ex x i s i W Embedding Attempting a bike trick A 1.0 attempt 0.45 bike Cosine 1.0 bike 0.30 man similarity 1.0 trick
Finding Segments to Expand Training Material Example1 Window Cosine similarity Example1_1 Example1_2 Example1_3
Window based Features MAP 2014 Test Set 0.345 0.340 0.335 0.330 0.325 0.320 0.315 0.310 0.305 0.300 0.295 flatten-avg flatten-window ResNet ResNeXt Fusion
Result Individual Modalities on 2014 Test Set DC is best overfit ? VS > flatten window > avg R < Rx < F 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 flatten-avg softmax trajectories mfcc video story flatten-dc flatten-window ResNet ResNeXt Fusion
Fusion Visual Modalities on 2014 Test Set ResNet + ResNeXt 0.360 0.355 0.350 0.345 0.340 0.335 0.330 0.325 0.320 0.315 VS DC Win DC-VS DC-Win VS-DC-Win
Fusion on 2014 Test Set last year single visual MM + new features mod fusion fusion avg 0.360 0.355 0.350 0.345 0.340 0.335 0.330 0.325 0.320 0.315 AVG2-DT-MFCC-VS DC VS-DC-Win VS-DC-Win-DT-MFCC VS-DC-Win-DT-MFCC-AVG2 ResNeXt ResNet+ResNeXt
Computational Efficiency Feature Extraction Classification MAP 250 0.12 35.8 35.6 0.1 200 35.4 35.2 0.08 150 35 34.8 0.06 34.6 100 0.04 34.4 34.2 50 0.02 34 33.8 0 0 p-visualFusionTwoCNN c-mmFusionTwoCNN c-visualFusionOneCNN c-mmFusionOneCNN c-visualSingle
Our MED Submission Test 2014 PS AH p-visualFusionTwoCNN c-mmFusionTwoCNN c-visualFusionOneCNN c-mmFusionOneCNN c-visualSingle
All MED Submissions PS 50 45 40 35 30 25 20 15 10 5 0 AH 80 70 60 50 40 30 20 10 0 MediaMill MediaMill TokyoTech TokyoTech ITICERTH ITICERTH INF
Conclusions • Visual features are still improving • Fusion still works but other modalities need work • 0ex helps to get more out of your examples
Thank You
Recommend
More recommend