Columbia-UCF MED2010: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching Yu-Gang Jiang 1 , Xiaohong Zeng 1 , Guangnan Ye 1 , Subh Bhattacharya 2 , Dan Ellis 1 , Mubarak Shah 2 , Shih-Fu Chang 1 1 Department of EE, Columbia University 2 Department of EECS, University of Central Florida TRECVID 2010 workshop, NIST, Gaithersburg, MD
The target… Making a Making a cake cake Assembling Assembling a shelter a shelter Batting a Batting a run in run in
Overview: 4 major components & 6 runs 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Classifiers SIFT χ2 Semantic 6 4 5 SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point 3 2 Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature Batter detection Batter detection 1 Re-Rank 3
Overview: overall performance Run1: Run2 + “Batter” Reranking 1.40 Run2: Run3 + Scene/Audio/Action Context Run3: Run6 + EMD Temporal Matching Run4: Run6 + Scene/Audio/Action Context 1.20 Mean Mimimal Normalized Cost Run5: Run6 + Scene/Audio Context Run6: Baseline Classification with 3 features 1.00 0.80 0.60 0.40 0.20 0.00 r2 r3 r4 r5 r6 r1 4
Overview: per-event performance Batting a run in (MNC) Assembling a shelter (MNC) 1.000 1.600 0.900 1.400 0.800 1.200 0.700 1.000 0.600 0.500 0.800 0.400 0.600 0.300 0.400 0.200 0.200 0.100 0.000 0.000 Making a cake (MNC) 1.000 0.900 Run1: Run2 + “Batter” Reranking 0.800 0.700 Run2: Run3 + Scene/Audio/Action Context 0.600 Run3: Run6 + EMD Temporal Matching 0.500 Run4: Run6 + Scene/Audio/Action Context 0.400 Run5: Run6 + Scene/Audio Context 0.300 Run6: Baseline Classification with 3 features 0.200 0.100 0.000
Roadmap > multiple modalities 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Classifiers Classifiers SIFT χ2 Semantic 6 4 5 SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point 3 2 Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature Batter detection Batter detection 1 Re-Rank 6
Three Feature Modalities… • SIFT (visual) – D. Lowe, IJCV 04. • STIP (visual) – I. Laptev, IJCV 05. • MFCC (audio) … 16ms 16ms 7
Bag-of- X Representation • X = SIFT or STIP or MFCC • Soft weighting ( Jiang, Ngo and Yang, ACM CIVR 2007) Bag-of-SIFT 8
Soft-weighting in Bag-of-X • Soft weighting is used for all the three Bag-of-X representations -- Assign a feature to multiple visual words -- weights are determined by feature-to-word similarity Details in: Jiang, Ngo and Yang, ACM CIVR 2007. 9 Image source: http://www.cs.joensuu.fi/pages/franti/vq/lkm15.gif
Results on Dry-run Validation Set • Measured by Average Precision (AP) Assembling a Batting a run Making a Mean AP shelter in cake Visual STIP 0.468 0.719 0.476 0.554 Visual SIFT 0.353 0.787 0.396 0.512 Audio MFCC 0.249 0.692 0.270 0.404 STIP+SIFT 0.508 0.796 0.476 0.593 STIP+SIFT+MFCC 0.533 0.873 0.493 0.633 • STIP works best for event detection • The 3 features are highly complementary! Should be jointly used for multimedia event detection • 10
Roadmap > temporal matching 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Feature extraction Classifiers SIFT χ2 Semantic 6 4 5 SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point 3 2 Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature Batter detection Batter detection 1 Re-Rank 11
Temporal Matching With EMD Kernel • Earth Mover’s Distance (EMD) 0.14 … P 0.1 0.04 … … Q Given two frame sets P = {( p 1 , w p 1 ), ... , ( p m , w pm )} and Q = Given two frame sets P = {( p 1 , w p 1 ), ... , ( p m , w pm )} and Q = {( q 1 , w q 1 ), ... , ( q n , w qn )} , the EMD is computed as {( q 1 , w q 1 ), ... , ( q n , w qn )} , the EMD is computed as EMD (P, Q) = Σ i Σ j f ij d ij / Σ i Σ j f ij EMD (P, Q) = Σ i Σ j f ij d ij / Σ i Σ j f ij d ij is the χ 2 visual feature distance of frames p i and q j . f ij (weight d ij is the χ 2 visual feature distance of frames p i and q j . f ij (weight transferred from p i and q j ) is optimized by minimizing the overall transferred from p i and q j ) is optimized by minimizing the overall transportation workload Σ i Σ j f ij d ij transportation workload Σ i Σ j f ij d ij • EMD Kernel: K(P,Q)= exp -ρ EMD (P,Q) Y. Rubner, C. Tomasi, L. J. Guibas, “A metric for distributions with applications to image databases”, ICCV, 1998. D. Xu, S.-F. Chang, “Video event recognition using kernel methods with multi-level temporal alignment”, PAMI, 2008. 12
Temporal Matching Results • EMD is helpful for two events – results measured by minimal normalized cost (lower is better) 0.8 5% gain Minimal Normalized Cost r6-baseline 0.7 r3-base+EMD 0.6 0.5 0.4 0.3 0.2 0.1 0 13
Roadmap > contextual diffusion 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Feature extraction Classifiers Classifiers SIFT χ2 Semantic 6 4 5 SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point 3 2 Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature Batter detection Batter detection 1 Re-Rank 14
Event Context • Events generally occur under particular scene settings with certain audio sounds! – Understanding contexts may be helpful for event detection Action Action Batting a run in Scene Concepts Scene Concepts Concepts Concepts running grass walking Baseball field Speech comprehensible sky Cheering/Clapping Audio Audio Concepts Concepts 15
Contextual Concepts • 21 concepts are defined and annotated over MED development set. Human Action Concepts Scene Concepts Audio Concepts Person walking Indoor kitchen Outdoor rural Person running Outdoor with grass/trees Outdoor urban Person squatting visible Indoor quiet Person standing up Baseball field Indoor noisy Person making/assembling Crowd (a group of 3+ Original audio stuffs with hands (hands people) Dubbed audio visible) Cakes (close-up view) Speech comprehensible Person batting baseball Music Cheering Clapping • SVM classifier for concept detection – STIP for action concepts, SIFT for scene concepts, and MFCC for audio concepts Jingen Liu, Jiebo Luo & Mubarak Shah, Recognizing Realistic Actions from Videos "in the Wild“, CVPR 2009 Shih-Fu Chang et al. Columbia University/VIREO-CityU/IRIT TRECVID2008 High-Level Feature Extraction and Interactive Video Search. TRECVID Workshop, 2008 16
Concept Detection: example result Baseball field Cakes (close-up view) Crowd (3+ people) Grass/trees Indoor kitchen 17
Contextual Diffusion Model • Semantic Diffusion Baseball field 0.9 [Jiang, Wang, Chang & Ngo, ICCV 2009] – Semantic graph • Nodes are concepts/events • Edges represent 0.5 Batting a run in concept/event correlation – Graph diffusion 0.8 • Smooth detection scores 0.7 w.r.t. the correlation Running Cheering Project page and source code: http://www.ee.columbia.edu/ln/dvmm/researchProjects/MultimediaIndexing/DASD/dasd.htm 18
Contextual Diffusion Results • Context is slightly helpful for two events – results measured by minimal normalized cost (lower is better) 0.800 r3-baseEMD r2-baseEMDSceAudAct Minimal Normalized Cost 0.700 0.600 2-3% gain 0.500 0.400 0.300 0.200 0.100 0.000 19
Contextual Diffusion Results • … but the improvement is much higher when context is perfect (on a validation set) − results measured by average precision (higher is better) baseline context diffusion 1 0.9 0.8 Average Precision 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20
Roadmap > reranking with event- specific object detector 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Feature extraction Classifiers Classifiers SIFT χ2 Semantic 6 4 5 SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point 3 2 Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature Batter detection Batter detection 1 Re-Rank 21
Reranking with Event-Specific Object Detector • “Batter” detector is trained by AdaBoost framework
Reranking with Event-Specific Object Detector • “Batter” detector is trained by AdaBoost framework Reranking Based on the Ratio of Initial Ranking “Batter” Detection detected objects
Lessons learned 1. STIP is powerful for event detection. 2. Combining multiple audio-visual features is very effective! 3. Temporal Matching with EMD is useful for some events 4. Diffusion with Contextual Concepts is promising, and deserves deeper research Future Work 1. Explore deep joint audio-visual representation, e.g., Audio-Visual Atoms [Jiang et al, ACMMM09] 2. Another interesting research direction is to investigate an adaptive method to find the best components for each event
Recommend
More recommend