segments residuals and
play

Segments, Residuals and Embeddings for Few-Example Video Event - PowerPoint PPT Presentation

Segments, Residuals and Embeddings for Few-Example Video Event Detection Dennis Koelma and Cees Snoek University of Amsterdam The Netherlands Pipeline 10Ex 2016 CNN Inception avg ImageNet sample pool 2 / sec Shuffle SVM Videos Frames


  1. Segments, Residuals and Embeddings for Few-Example Video Event Detection Dennis Koelma and Cees Snoek University of Amsterdam The Netherlands

  2. Pipeline 10Ex 2016 CNN Inception avg ImageNet sample pool 2 / sec Shuffle SVM Videos Frames pool5 10Ex M1 Video Story SVM embedding 10Ex M5 avg pool SVM prob 10Ex M2 Fisher vector dense SVM trajectories 10Ex M3 mfcc0 Fisher vector SVM mfcc1 10Ex M4 mfcc2

  3. Pipeline 10Ex 2017 ResNet + ResNeXt difference ImageNet sample coding 2 / sec Shuffle SVM Videos Frames pool5 10Ex M1 Video Story SVM embedding 10Ex M5 avg pool sliding SVM window 10Ex M2 Fisher vector dense SVM trajectories 10Ex M3 mfcc0 Fisher vector SVM mfcc1 10Ex M4 mfcc2

  4. CNN Features from 22k ImageNet classes - Use as many classes as possible Irrelevant classes - Find a balance between level of abstraction of classes and number of images in a class Example imbalance Siderocyte 296 classes with 1 image Gametophyte 4

  5. CNN training on selection out of 22k ImageNet classes • Idea • Increase level of abstraction of classes • Incorporate classes with less than 200 samples • Heuristics • Roll, Bind, Promote, Subsample N > 2000 : Subsample • Result • 12,988 classes • 13.6M images N < 200 : Promote Roll N < 3000 : Bind The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection, Pascal Mettes and Dennis Koelma and Cees Snoek, International Conference on Multimedia Retrieval, 2016

  6. Feature Difference Coding • K-means clustering (k = 5) on last fully connected layer before probability layers (called flatten) • Fisher like encoding but sigma is based on distance of points assigned to a cluster to its center MAP 2014 Test Set 0.350 0.340 0.330 0.320 0.310 0.300 0.290 flatten-avg flatten-dc ResNet ResNeXt Fusion

  7. Video Story: Embed the story of a video Stunt Bike Motorcycle x i y i s i W A Embedding Joint optimization of W and A to preserve Descriptiveness: preserve video descriptions : L(A,S) Predictability: recognize terms from video content : L(S,W) Videostory: A new multimedia embedding for few-example recognition and translation of events, Amirhossein Habibian and Thomas Mensink and Cees Snoek, Proceedings of the ACM International Conference on Multimedia, 2014

  8. VideoStory Embedding as a Feature MAP 2014 Test Set 0.335 0.330 0.325 0.320 0.315 0.310 0.305 0.300 flatten-avg video story ResNet ResNeXt Fusion

  9. Video Story for 0Ex x i s i W Embedding Attempting a bike trick A 1.0 attempt 0.45 bike Cosine 1.0 bike 0.30 man similarity 1.0 trick

  10. Finding Segments to Expand Training Material Example1 Window Cosine similarity Example1_1 Example1_2 Example1_3

  11. Window based Features MAP 2014 Test Set 0.345 0.340 0.335 0.330 0.325 0.320 0.315 0.310 0.305 0.300 0.295 flatten-avg flatten-window ResNet ResNeXt Fusion

  12. Result Individual Modalities on 2014 Test Set DC is best overfit ? VS > flatten window > avg R < Rx < F 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 flatten-avg softmax trajectories mfcc video story flatten-dc flatten-window ResNet ResNeXt Fusion

  13. Fusion Visual Modalities on 2014 Test Set ResNet + ResNeXt 0.360 0.355 0.350 0.345 0.340 0.335 0.330 0.325 0.320 0.315 VS DC Win DC-VS DC-Win VS-DC-Win

  14. Fusion on 2014 Test Set last year single visual MM + new features mod fusion fusion avg 0.360 0.355 0.350 0.345 0.340 0.335 0.330 0.325 0.320 0.315 AVG2-DT-MFCC-VS DC VS-DC-Win VS-DC-Win-DT-MFCC VS-DC-Win-DT-MFCC-AVG2 ResNeXt ResNet+ResNeXt

  15. Computational Efficiency Feature Extraction Classification MAP 250 0.12 35.8 35.6 0.1 200 35.4 35.2 0.08 150 35 34.8 0.06 34.6 100 0.04 34.4 34.2 50 0.02 34 33.8 0 0 p-visualFusionTwoCNN c-mmFusionTwoCNN c-visualFusionOneCNN c-mmFusionOneCNN c-visualSingle

  16. Our MED Submission Test 2014 PS AH p-visualFusionTwoCNN c-mmFusionTwoCNN c-visualFusionOneCNN c-mmFusionOneCNN c-visualSingle

  17. All MED Submissions PS 50 45 40 35 30 25 20 15 10 5 0 AH 80 70 60 50 40 30 20 10 0 MediaMill MediaMill TokyoTech TokyoTech ITICERTH ITICERTH INF

  18. Conclusions • Visual features are still improving • Fusion still works but other modalities need work • 0ex helps to get more out of your examples

  19. Thank You

Recommend


More recommend