TRECVID-2006 High-Level Feature task: Overview Wessel Kraaij TNO & Paul Over NIST
Outline Task summary Evaluation details Inferred Average precision vs. mean average precision Participants Evaluation results Pool analysis Results per category Results per feature Significance tests category A comparison with TV2005 Global Observations Issues TRECVID 2006 2
High-level feature task Goal: Build benchmark collection for visual concept detection methods Secondary goals: encourage generic (scalable) methods for detector development feature-indexing could help search/browsing Participants submitted runs for all 39 LSCOM-lite features Used results of 2005 collaborative training data annotation Tools from CMU and IBM (new tool) 39 features and about 100 annotators multiple annotations of each feature for a given shot Range of frequencies in the common development data annotation NIST evaluated 20 (medium frequency) features from the 39 using a 50% random sample of the submission pools (Inferred AP) TRECVID 2006 3
HLF is challenging for machine learning Small imbalanced training collection Large variation in examples Noisy Annotations Decisions to be made: find suitable representations find optimal fusion strategies TRECVID 2006 4
20 LSCOM-lite features evaluated 26 26 an anima mal 1 sp sport rts 27 27 co compu puter er tv tv sc scree een 3 we weath ther 28 us 28 us fl flag 5 of offic ice 29 29 ai airpl plane ne 6 me meeti ting 30 ca 30 car 10 10 de deser ert 32 32 tr truck ck 12 12 mo mount ntain in 35 pe 35 peopl ple m marc rchin ing 17 wa 17 water ersca cape/ e/wat aterf rfron ont 36 ex 36 explo losio ion f fire re 22 co 22 corpo porat ate l lead ader 38 38 ma maps 23 po 23 polic ice s secu curit ity 39 39 ch chart rts 24 mi 24 milit itary ry pe perso sonne nel Note: this is a departure from the numbering scheme used at previous TV’s TRECVID 2006 5
High-level feature evaluation Each feature assumed to be binary: absent or present for each master reference shot Task: Find shots that contain a certain feature, rank them according to confidence measure, submit the top 2000 NIST pooled and judged top results from all submissions Evaluated performance effectiveness by calculating the inferred average precision of each feature result Compared runs in terms of mean inferred average precision across the 20 feature results to be used for comparison between TV2006 HLF runs not comparable with TV2005, TV2004… figures TRECVID 2006 6
Inferred average precision (infAP) Just* developed by Emine Yilmaz and Javed A. Aslam at Northeastern University Estimates average precision surprisingly well using a surprisingly small sample of judgments from the usual submission pools Experiments on TRECVID 2005 feature submissions confirmed quality of the estimate in terms of actual scores and system ranking * J.A. Aslam, V. Pavlu and E. Yilmaz, Statistical Method for System Evaluation Using Incomplete Judgments Proceedings of the 29th ACM SIGIR Conference, Seattle, 2006. TRECVID 2006 7
Inferred average precision (infAP) Experiments with 2005 data Pool submitted results down to at least a depth of 200 items Manually judge pools - forming a base set of judgments (100% judged) Create 4 sampled sets of judgments by randomly marking some results “unjudged” 20% unjudged -> 80% sample 40% unjudged -> 60% sample 60% unjudged -> 40% sample 80% unjudged -> 20% sample Evaluate all systems that submitted results for all features in 2005 using the base and each of the 4 sampled judgment sets using infAP By definition, infAP of a 100% sample of the base judgment set is identical to average precision (AP). Compare measurements of infAP using various sampled judgment sets to standard AP. TRECVID 2006 8
2005 Mean InfAP scoring approximates MAP scoring very closely 60% sample 80% sample mean infAP on sampled judgments MAP 40% sample 20% sample TRECVID 2006 9
2005 system rankings change very little when determined based on infAP versus AP. Kendall's tau (normalizes pairwise swaps) 80% sample 0.9862658 60% sample 0.9871663 40% sample 0.9700546 20% sample 0.951566 Number of significant rank changes (randomization test, p<0.01) Swap Lose Keep Add 80% 0 35 2018 37 60% 0 57 1996 36 40% 0 104 1949 45 20% 0 170 1883 73 TRECVID 2006 10
2006: Inferred average precision (infAP) Submissions for each of 20 features were pooled down to about 120 items (so that each feature pool contained ~ 6500 shots) varying pool depth per feature A 50% random sample of each pool was then judged: 66,769 total judgements (~ 125 hr of video) Judgement process: one assessor per feature, watched complete shot while listening to the audio. infAP was calculated using the judged and unjudged pool by trec_eval TRECVID 2006 11
Frequency of hits varies by feature 1 5 5 6 1600 1 4 9 8 Number of hits in the test 2% 1400 1200 1000 data 7 5 0 800 6 7 9 1% 6 1 2 600 5 1 1 4 7 4 4 2 7 3 4 0 3 2 9 400 2 9 2 2 4 3 2 3 8 2 3 11 6 6 2 2 1 1 7 2 1 6 3 1 5 0 200 2 2 0 1 3 5 6 10 12 17 22 23 24 26 27 28 29 30 32 35 36 38 39 Feature TRECVID 2006 12
Systems can find hits in video from programs not in the training data known 80 68 68 65 new program 70 60 50 35 32 32 % 40 30 20 10 0 Test hours Pooled Hits shots TRECVID 2006 13
2006: 30/54 Participants (2005: 22/42, 2004: 12/33 ) Bilk lkent nt U. U. -- -- FE FE SE SE -- -- Carn rnegi gie M Mell llon n U. . -- -- FE FE SE SE -- -- City ty Un Unive versi sity y of f Hon ong K Kong ng (C (City tyUHK HK) SB SB FE FE SE SE -- -- CLIP IPS-I -IMAG AG SB SB FE FE SE SE -- -- Colu lumbi bia U U. -- -- FE FE SE SE -- -- COST ST292 92 (w (www. w.cos ost29 292.o .org) g) SB FE SE RU Fuda dan U U. -- -- FE FE SE SE -- -- FX P Palo lo Al Alto o Lab abora rator ory I Inc SB SB FE FE SE SE -- -- Hels lsink nki U U. o of T Tech chnol ology gy SB SB FE FE SE SE -- -- IBM M T. . J. . Wat atson on Re Resea earch ch Ce Cente ter -- -- FE FE SE SE RU RU Impe peria ial C Coll llege ge Lo Londo don / / Jo Johns ns Ho Hopki kins s U. -- -- FE FE SE SE -- -- NUS S / I I2R -- -- FE FE SE SE -- -- Inst stitu tut E EURE RECOM OM - -- F FE - -- R RU KDDI DI/To Tokus ushim ima U U./T /Toky kyo U U. o of T Tech chnol ology gy SB SB FE FE -- -- -- -- K-Sp Space ce (k (kspa pace. e.qmu mul.n .net) t) -- FE SE -- TRECVID 2006 14
2006: 30 Participants (continued) LIP6 P6 - - Lab abora ratoi oire e d'I 'Info forma matiq ique e de e Par aris s 6 - -- F FE - -- - -- Medi diami mill l / U U. o of A Amst sterd rdam m -- FE SE -- Micr croso soft t Res esear arch h Asi sia - -- F FE - -- - -- Nati tiona nal T Taiw iwan n U. - -- F FE - -- - -- NII/ I/ISM SM - -- F FE - -- - -- Toky kyo I Inst stitu tute e of f Tec echno nolog ogy S SB F FE - -- - -- Tsin inghu hua U U. SB FE SE RU U. o of B Brem emen n TZI ZI -- FE -- -- U. o of C Cali lifor ornia ia at at Be Berke keley ey -- - FE E -- - -- U. o of C Cent ntral al Fl Flori rida -- FE SE -- U. o of E Elec ectro ro-Co Commu munic icati tions ns -- FE -- -- U. o of G Glas asgow ow / / U. . of f She heffi field ld -- FE SE -- U. o of I Iowa wa -- FE SE -- U. o of O Oxfo ford -- FE SE -- Zhej ejian ang U U. SB FE SE -- HLF keeps attracting more participants, most of them come back the next year. TRECVID 2006 15
Number of runs of each training type Tr-Type 2006 2005 2004 2003 A 86 (68.8%) 79 (71.8%) 45 (54.2%) 22 (36.7%) B 32 (25.6%) 24 (21.8%) 27 (32.5%) 20 (33.3%) C 7 (5.6% 7 (6.3%) 11 (13.3%) 18 (30.0%) Total 125 110 83 60 runs System training type: A - Only on common dev. collection and the common annotation B - Only on common dev. collection but not on (just) the common annotation C - not of type A or B TRECVID 2006 16
% of true shots by source language for each feature 100 Arabic 90 Chinese 80 English 70 % 60 50 40 30 20 10 0 All test shots 10 12 17 22 23 24 26 27 28 29 30 32 35 36 38 39 1 3 5 6 Feature number TRECVID 2006 17
5 8 1 0 1 5 0 8 3 5 3 0 3 2 6 2 3 2 7 1 0 1 5 1 True shots contributed uniquely by team for each feature TRECVID 2006 18
0,05 0,15 0,25 0,1 0,2 0 tsinghua IBM.MBWN TRECVID 2006 IBM.MRF IBM.MAAR IBM.MBW CMU.Return_of_The_Jedi IBM.UB CMU.The_Empire_Strikes_back CMU.A_New_Hope CMU.Attack_of_The_Clones COL3 UCF.CEC.PROD Category A results (top half) IBM.VB COL1 UCF.CE.PROD COL2 COL4 ucb_1best COL5 UCF.CE.PROB COL6 CMU.Revenge_of_The_Sith ucb_vision KSpace-base CityUHK1 ucb_concat MSRA_TRECVID MSRA_TRECVID icl.jhu_Sys2 ucb_fusion UCF.MIX CMU.The_Phantom_Menace KSpace-SC CityUHK2 icl.jhu_Sys1 CityUHK5 UCF.CM MSRA_TRECVID KSpace-DS1 19 CityUHK3 ucb_text KSpace-DS2 NTU
Recommend
More recommend