conclusions trecvid 2008 conclusions trecvid 2008
play

Conclusions TRECVID 2008 Conclusions TRECVID 2008 Good settings - PDF document

MediaMill TRECVID 2009 17 11 2009 Multi Multi- -Frame, Multi Frame, Multi- -Modal, and Multi Modal, and Multi- -Kernel Kernel Concept Detection in Video Concept Detection in Video Cees Cees G.M. Snoek , G.M. Snoek , Koen


  1. MediaMill TRECVID 2009 17 ‐ 11 ‐ 2009 Multi Multi- -Frame, Multi Frame, Multi- -Modal, and Multi Modal, and Multi- -Kernel Kernel Concept Detection in Video Concept Detection in Video Cees Cees G.M. Snoek¹ , G.M. Snoek¹ , Koen Koen E.A. van de Sande¹ , Jasper R.R. Uijlings¹ , E.A. van de Sande¹ , Jasper R.R. Uijlings¹ , Miguel Bugalho² , Isabel Trancoso² , Miguel Bugalho² , Isabel Trancoso² , Fei Fei Yan³ , Yan³ , Muhammed Muhammed A. Tahir³ , A. Tahir³ , Krystian Krystian Mikolajczyk³ , Josef Kittler³ , Theo Gevers¹ , Dennis C. Koelma¹ , Mikolajczyk³ , Josef Kittler³ , Theo Gevers¹ , Dennis C. Koelma¹ , Arnold W.M. Smeulders¹ Arnold W.M. Smeulders¹ ¹ ² ² ³ ³ Conclusions TRECVID 2008 Conclusions TRECVID 2008 • Good settings for Bag • Good settings for Bag- -of of- -Words Words – SIFT + SIFT + colorSIFT colorSIFT improves ~ 8% improves ~ 8% – Soft codebook assignment improves ~ 7% Soft codebook assignment improves ~ 7% – Multi Multi- -frame analysis improves ~ 20% frame analysis improves ~ 20% http://www.MediaMill.nl 1

  2. MediaMill TRECVID 2009 17 ‐ 11 ‐ 2009 Myth: TRECVID incremental only Myth: TRECVID incremental only > 100% improvement in just 3 years Snoek et al, TRECVID 2008 Van de Sande et al, PAMI 2010 Van Gemert et al, PAMI 2010 State- State -of of- -the the- -Art Art http://www.MediaMill.nl 2

  3. MediaMill TRECVID 2009 17 ‐ 11 ‐ 2009 Snoek et al, TRECVID 2008 Van de Sande et al, PAMI 2010 Van Gemert et al, PAMI 2010 State State- -of of- -the the- -Art Art Software available for download at http://colordescriptors.com Our TRECVID 2009 focus Our TRECVID 2009 focus Spatio ‐ Visual Codebook temporal feature transform sampling extraction Kernel ‐ based learning learning Audio concept detection http://www.MediaMill.nl 3

  4. MediaMill TRECVID 2009 17 ‐ 11 ‐ 2009 Our TRECVID 2009 focus Our TRECVID 2009 focus Spatio ‐ Visual Codebook temporal feature transform sampling extraction Multi- Kernel ‐ based learning learning kernel k l Audio concept detection Roadmap Roadmap Spatio ‐ Visual Codebook temporal feature transform sampling extraction Kernel ‐ based learning learning Audio concept detection http://www.MediaMill.nl 4

  5. MediaMill TRECVID 2009 17 ‐ 11 ‐ 2009 Snoek et al, ICME 2005 1,000,000 1,000,000 frames analyzed frames analyzed • Multi Multi- -frame biggest improvement in 2008 frame biggest improvement in 2008 – Extend further by analyzing up to 10 extra Extend further by analyzing up to 10 extra i i- -frames/shot frames/shot – Yields 1M frames to analyze for the Yields 1M frames to analyze for the test set test set collection collection • Need to speed Need to speed- -up by up by being “smart and strong” being “smart and strong” – Speed Speed up feature extraction Speed-up feature extraction Speed up feature extraction up feature extraction – Speed Speed- -up quantization up quantization – Speed Speed- -up kernel up kernel- -based learning based learning – Speed Speed- -up by computing up by computing Roadmap Roadmap Spatio ‐ Visual Codebook temporal feature transform sampling extraction Kernel ‐ based learning learning Audio concept detection http://www.MediaMill.nl 5

  6. MediaMill TRECVID 2009 17 ‐ 11 ‐ 2009 Uijlings et al, CIVR 2009 Fast Fast dense dense descriptors descriptors R A x = Image Patch Pixel-wise Final Linear Interpolation Responses R Descriptor Reuse subregions T A x R x B 2x speed-up 16x speed-up Roadmap Roadmap Spatio ‐ Visual Codebook temporal feature transform sampling extraction Kernel ‐ based learning learning Audio concept detection http://www.MediaMill.nl 6

  7. MediaMill TRECVID 2009 17 ‐ 11 ‐ 2009 Moosman, PAMI 2008 Uijlings et al, CIVR 2009 Fast quantization Fast quantization • • Random Random forests forests – Randomized process makes it very fast to build Randomized process makes it very fast to build – Tree structure allows fast vector quantization Tree structure allows fast vector quantization – Logarithmic rather than linear projection time Logarithmic rather than linear projection time • Real • • Real • Real time Real-time time BoW time BoW BoW BoW – When When used used with with fast fast dense dense sampling sampling – SURF 2x2 descriptor SURF 2x2 descriptor instead instead of 4x4 of 4x4 – RBF RBF kernel kernel Van de Sande et al, ASCI 2009 GPU GPU- -empowered quantization empowered quantization • • Achieve data Achieve data Achieve data parallelism by writing Euclidean Achieve data-parallelism by writing Euclidean parallelism by writing Euclidean parallelism by writing Euclidean distance in vector form distance in vector form 14,00 CPU Xeon (3,4GHz) 12,00 CPU Opteron 250 (2,4GHz) 10,00 CPU Core 2 Duo 6400 (2,13GHz) CPU Core i7 (2,66GHz) e (s) 8,00 Time Per Image GPU Geforce 8800GTX (128 cores) 6,00 GPU Geforce GTX260 (216 cores) CPU 4,00 17x speed-up 2,00 GPU 0,00 0 5000 10000 15000 20000 Number of SIFT Descriptors Per Image http://www.MediaMill.nl 7

  8. MediaMill TRECVID 2009 17 ‐ 11 ‐ 2009 Roadmap Roadmap Spatio ‐ Visual Codebook temporal feature transform sampling extraction Kernel ‐ based learning learning Audio concept detection SVM pre- SVM pre -computed computed kernel trick kernel trick • • Use distance between feature vectors Use distance between feature vectors – Feature length easily > 100,000 Feature length easily > 100,000 • Increase efficiency significantly • Increase efficiency significantly – Pre Pre- -compute the SVM kernel matrix compute the SVM kernel matrix – Long vectors possible as we only need 2 in memory Long vectors possible as we only need 2 in memory – Parameter optimization re Parameter optimization re- -uses pre uses pre- -computed matrix computed matrix http://www.MediaMill.nl 8

  9. MediaMill TRECVID 2009 17 ‐ 11 ‐ 2009 Van de Sande et al, ASCI 2009 GPU GPU- -empowered empowered pre- pre -computed kernel computed kernel 1. 1. 1 Compute average distances per N² Compute average distances per Compute average distances per N² Compute average distances per N² N² kernel sub- kernel sub kernel sub kernel sub- -block -block block block 2. 2. Compute kernel function values Compute kernel function values 1800 1 CPU 1600 1x Core i7 (2,66GHz) 4 CPU 1x Opteron 250 1400 (2,4GHz) 4x Opteron 250 1200 (2,4GHz) 16x Opteron 250 65x speed-up p p (2,4GHz) 1000 s) 25x Opteron 250 25x Opteron 250 Time ( (2,4GHz) 800 600 3x speed-up 400 200 GPU 0 0 20000 40000 60000 80000 100000 120000 140000 Total Feature Vector Length Computing Computing • • 2009 system much 2009 system much more efficient than more efficient than 2008 2008 system system – 6x more visual data analyzed using less 6x more visual data analyzed using less compute power compute power • • Some best estimates: Some best estimates: – Visual feature extraction: Visual feature extraction: Visual feature extraction: 8400 Visual feature extraction: 8400 8400 Processor 8400 Processor rocessor Node rocessor-Node ode Hours ode-Hours ours ours – Training concept detectors: 4000 PNH Training concept detectors: 4000 PNH – Applying concept detectors: ~ 1 week GPU Applying concept detectors: ~ 1 week GPU http://www.MediaMill.nl 9

  10. MediaMill TRECVID 2009 17 ‐ 11 ‐ 2009 Roadmap Roadmap Spatio ‐ Visual Codebook temporal feature transform sampling extraction Kernel ‐ based learning learning Audio concept detection Bugalho et al, InterSpeech 2009 Trancoso et al, ICME 2009 Audio concept detection Audio concept detection External sound corpus: External sound corpus: ~ 100 ~ 100 concepts concepts Feature SVM sirens, water,… Non Speech extraction classification speech, female voice ,.. Speech Reasoning Audio Speaker ID monologue, dialogue,… Segmentation Music music events Detector Telephone Telephone low frequency low frequency detector • Early fusion of features Early fusion of features – MFCCs (+ deltas), PLPs (+ deltas), Brightness, Bandwidth, MFCCs (+ deltas), PLPs (+ deltas), Brightness, Bandwidth, ZCR, Pitch, ZCR, Pitch, Harmonicity Harmonicity, Shifted delta , Shifted delta cepstra cepstra, Audio , Audio spectrum envelope and flatness spectrum envelope and flatness – 0.50s 0.50s window length, with window length, with 0.25s 0.25s spacing spacing http://www.MediaMill.nl 10

Recommend


More recommend