Query Understanding is Key for Zero-Example Video Search Dennis Koelma and Cees Snoek University of Amsterdam The Netherlands
Pipeline Selected query terms Video Frames 2 / sec window average Closest terms Video Story cosine similarity flatten (word2vec) ResNet term vector 0Ex M1 VS vocabulary ResNeXt ImageNet Shuffle Top N closest percentile dot similarity concept (word2vec) filter scores 0Ex M2 concepts softmax
22k ImageNet classes - Use as many classes as possible Irrelevant classes - Find a balance between level of abstraction of classes and number of images in a class Example imbalance Siderocyte 296 classes with 1 image Gametophyte 3
CNN training on selection out of 22k ImageNet classes • Idea • Increase level of abstraction of classes • Incorporate classes with less than 200 samples • Heuristics • Roll, Bind, Promote, Subsample N > 2000 : Subsample • Result • 12,988 classes • 13.6M images N < 200 : Promote Roll N < 3000 : Bind The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection, Pascal Mettes and Dennis Koelma and Cees Snoek, International Conference on Multimedia Retrieval, 2016
Concept Bank • Two networks • ResNet • ResNeXt • Three datasets (subsets of ImageNet) • Roll Bind (3000) Promote (200) Subsample, 13k classes, training: 1000 images/class • Roll Bind (7000) Promote (1250) Subsample, 4k classes, training: 1706 images/class • Top 4000 classes, Breadth-first search >1200 images, training: 1324 images/class N > 2000 : Subsample N < 200 : Promote Roll N < 3000 : Bind
Video Story: Embed the story of a video Stunt Bike Motorcycle x i y i s i W A Embedding Joint optimization of W and A to preserve Descriptiveness: preserve video descriptions : L(A,S) Predictability: recognize terms from video content : L(S,W) Videostory: A new multimedia embedding for few-example recognition and translation of events, Amirhossein Habibian and Thomas Mensink and Cees Snoek, Proceedings of the ACM International Conference on Multimedia, 2014
Video Story Training Sets • VideoStory46k - www.mediamill.nl • 45826 videos from YouTube based on 2013 MED research set terms • FCVID: Fudan Columbia Video Dataset • 87609 videos • EventNet • 88542 videos • Merged (VideoStory46k, FCVID, EventNet) • Video Story dictionary: Terms that occur more than 10 times in the dataset • Merged : 6440 terms • Using vocabulary of stemmed terms that occur more than 100 times in Wikipedia dump • With stemming: Respect the Video Story dictionary • 267.836 terms • Use word2vec to expand them per video
Query Terms • Experiments show it is important to select the right terms • Instead of just taking the average of the terms in word2vec space • Part-of-Speech tagging • <noun1> , <verb> , <noun2> • <subject> , <predicate> , <remainder> • Query Plan A. Use nouns, verbs, and adjectives in <subject> • unless it concerns a person (noun1 = “person”, ”man”, “woman”, “child”, …) B. Use nouns in <remainder> • u nless it concerns a person or noun is a setting (“indoors”, “outdoors”, …) C. Use <predicate> D. Use all nouns in sentence • Unless noun is a person or a setting
The Effect of Parsing on 2016 Topics • MIAP using only ResNet feature 0.090 0.080 0.070 0.060 0.050 0.040 0.030 0.020 0.010 0.000 EventNet Merged top4000 rbps13k avg parse
(Greedy) Oracle on 2016 Topics • Fuse top (max 5) words/concepts with highest MIAP • MIAP using only ResNet feature 0.250 0.200 0.150 0.100 0.050 0.000 EventNet Merged top4000 rbps13k avg parse oracle
Query Examples : The Good • A person playing drums indoors • VideoStory terms avg : person 0.450 plai 0.400 drum 0.350 indoor 0.300 • VideoStory terms parse : 0.250 drum 0.200 • VideoStory terms oracle : 0.150 beat 0.100 drum 0.050 snare 0.000 vibe Merged rbps13k bng avg parse oracle
Query Examples : The Ambiguous • A person playing drums indoors 0.500 • Concepts top5 avg : 0.400 guitarist, guitar player 0.300 outdoor game 0.200 drum, drumfish 0.100 sitar player 0.000 brake drum, drum Merged rbps13k • Concepts top5 parse : avg parse oracle drum, drumfish brake drum, drum Oracle : barrel, drum percussionist snare drum, snare, side drum cymbal drum, membranophone, tympan drummer drum, membranophone, tympan snare drum, snare, side drum
Query Examples : The Bad • A person sitting down with a laptop visible • VideoStory terms avg : person 0.200 sit laptop 0.150 • VideoStory terms parse : 0.100 laptop • VideoStory terms oracle : 0.050 monitor 0.000 aspir Merged rbps13k acer avg parse oracle alienwar vaio asus laptop (rank 7)
Query Examples : The Difficult • A person wearing a helmet • Concept top5 parse : helmet (a protective headgear made of hard material to resist blows) helmet (armor plate that protects the head) pith hat, pith helmet, sun helmet, topee, topi batting helmet crash helmet 0.500 • Concept top5 oracle : 0.400 hockey skate 0.300 hockey stick 0.200 ice hockey, hockey, hockey game 0.100 field hockey, hockey 0.000 rink, skating rink Merged rbps13k avg parse oracle
Query Examples : The Impossible • A crowd demonstrating in a city street at night • Parsing “fails” 0.350 • Average wouldn’t have helped 0.300 • VS oracle : 0.250 vega 0.200 squar 0.150 gang 0.100 times 0.050 0.000 occupi Merged rbps13k • Concept oracle : avg parse oracle vigil light, vigil candle motorcycle cop, motorcycle policeman, speed cop rider minibike, motorbike freewheel
Results 5 Modalities x 2 Features • VideoStory : ResNeXt is better than ResNet • Concepts : ResNet is better than ResNeXt (overfit?) • VideoStory is better than Concepts 0.090 0.080 0.070 0.060 0.050 0.040 0.030 0.020 0.010 0.000 EventNet Merged top4000 rbps4k rbps13k ResNet ResNeXt ResNet+ResNeXt
Final Fusion • Concept fusion is slightly better than VideoStory • Often complementary, also big difference for many topics • Top 2/4 for concepts is slightly better than top 3/5 0.120 0.100 0.080 0.060 0.040 0.020 0.000 ResNet ResNeXt ResNet+ResNeXt
Our AVS Submission 0.250 0.200 0.150 0.100 0.050 0.000 2016 2017 Fusion top24 Fusion top35 VideoStory Concepts
All Fully Automatic AVS Submissions 0.250 0.200 0.150 0.100 0.050 0.000
0.000 0.050 0.100 0.150 0.200 0.250 All Automatic and Interactive AVS Submissions M_D_Waseda_Meisei.17_1 M_D_Waseda_Meisei.17_3 F_D_MediaMill.17_1 F_D_MediaMill.17_2 M_D_Waseda_Meisei.17_2 M_D_Waseda_Meisei.17_4 F_D_MediaMill.17_4 M_D_VIREO.17_2 M_D_VIREO.17_4 F_D_Waseda_Meisei.17_1 F_D_MediaMill.17_3 M_D_FIU_UM.17_2 M_D_FIU_UM.17_4 F_D_Waseda_Meisei.17_4 F_D_Waseda_Meisei.17_3 F_D_Waseda_Meisei.17_2 M_D_VIREO.17_1 F_D_VIREO.17_2 M_D_VIREO.17_3 F_D_VIREO.17_4 F_D_VIREO.17_3 M_D_FIU_UM.17_3 M_E_ITEC_UNIKLU.17_2 M_E_ITEC_UNIKLU.17_4 F_D_ITI_CERTH.17_3 F_D_EURECOM.17_3 F_D_VIREO.17_1 F_D_ITI_CERTH.17_4 F_D_ITI_CERTH.17_1 F_D_EURECOM.17_1 M_E_ITEC_UNIKLU.17_1 F_D_EURECOM.17_2 M_D_kobe_nict_siegen.17_1 M_D_FIU_UM.17_1 M_E_ITEC_UNIKLU.17_3 F_D_ITI_CERTH.17_2 F_D_NII_Hitachi_UIT.17_1 F_D_NII_Hitachi_UIT.17_2 F_E_ITEC_UNIKLU.17_4 F_E_ITEC_UNIKLU.17_3 F_E_INF.17_2 F_E_ITEC_UNIKLU.17_2 F_D_NII_Hitachi_UIT.17_5 F_D_NII_Hitachi_UIT.17_3 F_E_INF.17_1 M_D_kobe_nict_siegen.17_2 F_E_ITEC_UNIKLU.17_1 M_D_kobe_nict_siegen.17_3 F_E_INF.17_3 F_D_EURECOM.17_4 F_E_INF.17_4 F_D_NII_Hitachi_UIT.17_4
Conclusions • Query parsing is important • VideoStory and Concepts are good but will not “solve” AVS
Thank You
Recommend
More recommend