Recognizing Complex Events in Internet Videos with Audio-Visual Features Yu-Gang Jiang yjiang@ee.columbia.edu In collaboration with Xiaohong Zeng 1 , Guangnan Ye 1 , Subh Bhattacharya 2 , Dan Ellis 1 , Mubarak Shah 2 , Shih-Fu Chang 1 , Alexander C. Loui 3 Columbia University 1 University of Central Florida 2 Kodak Research Labs 3 1
We take photos/videos We take photos/videos everyday/everywhere... everyday/everywhere... 2 Barack Obama Rally, Texas, 2008. http://www.paulridenour.com/Obama14.JPG Barack Obama Rally, Texas, 2008. http://www.paulridenour.com/Obama14.JPG
Outline • A System for Recognizing Events in Internet Videos – Best performance in TRECVID 2010 Multimedia Event Detection Task – Features, Kernels, Context, etc. • Internet Consumer Video Analysis – A Benchmark Database – An Evaluation of Human & Machine Performance 3
Outline • A System for Recognizing Events in Internet Videos – Best performance in TRECVID 2010 Multimedia Event Detection Task – Features, Kernels, Context, etc. • Internet Consumer Video Analysis – A Benchmark Database – An Evaluation of Human & Machine Performance 4
The TRECVID Multimedia Event Detection Task • Target: Find videos containing an event of interest • Data: unconstrained Internet videos – 1700+ training videos (~50 positive each event); 1700+ test videos Making a Making a cake cake Assembling Assembling a shelter a shelter Batting a Batting a run in run in 5
The system: 3 major components 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Classifiers SIFT χ2 Semantic SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, S. Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , in TRECVID 2010. 6
Best performance in TRECVID2010 Multimedia event detection (MED) task Run1: Run2 + “Batter” Reranking 1.40 Run2: Run3 + Scene/Audio/Action Context Run3: Run6 + EMD Temporal Matching Run4: Run6 + Scene/Audio/Action Context 1.20 Mean Mimimal Normalized Cost Run5: Run6 + Scene/Audio Context Run6: Baseline Classification with 3 features 1.00 0.80 0.60 0.40 0.20 0.00 r2 r3 r4 r5 r6 r1 7
Per-event performance Batting a run in (MNC) Assembling a shelter (MNC) 1.000 1.600 0.900 1.400 0.800 1.200 0.700 1.000 0.600 0.500 0.800 0.400 0.600 0.300 0.400 0.200 0.200 0.100 0.000 0.000 Making a cake (MNC) 1.000 0.900 Run1: Run2 + “Batter” Reranking 0.800 0.700 Run2: Run3 + Scene/Audio/Action Context 0.600 Run3: Run6 + EMD Temporal Matching 0.500 Run4: Run6 + Scene/Audio/Action Context 0.400 Run5: Run6 + Scene/Audio Context 0.300 Run6: Baseline Classification with 3 features 0.200 0.100 0.000 8
Roadmap > audio-visual features 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Classifiers Classifiers SIFT χ2 Semantic SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature 9
Three audio-visual features… • SIFT (visual) – D. Lowe, IJCV 04. • STIP (visual) – I. Laptev, IJCV 05. • MFCC (audio) … 16ms 16ms 10
Bag-of- X representation X = SIFT / STIP / MFCC • Soft weighting ( Jiang, Ngo and Yang, ACM CIVR 2007) • Bag-of-SIFT Bag-of-SIFT 11
Results of audio-visual features • Measured by Average Precision (AP) Assembling a Batting a run Making a Mean AP shelter in cake Visual STIP 0.468 0.719 0.476 0.554 Visual SIFT 0.353 0.787 0.396 0.512 Audio MFCC 0.249 0.692 0.270 0.404 STIP+SIFT 0.508 0.796 0.476 0.593 STIP+SIFT+MFCC 0.533 0.873 0.493 0.633 • STIP works the best for event detection • The 3 features are highly complementary! 12
Roadmap > temporal matching 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Feature extraction Classifiers SIFT χ2 Semantic SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature 13
Temporal matching with EMD kernel • Earth Mover’s Distance (EMD) Distance time time Q P Given two clip sets P = {( p 1 , w p 1 ), ... , ( p m , w pm )} and Q = {( q 1 , w q 1 ), ... , ( q n , w qn )} , Given two clip sets P = {( p 1 , w p 1 ), ... , ( p m , w pm )} and Q = {( q 1 , w q 1 ), ... , ( q n , w qn )} , the EMD is computed as the EMD is computed as EMD (P, Q) = Σ i Σ j f ij d ij / Σ i Σ j f ij EMD (P, Q) = Σ i Σ j f ij d ij / Σ i Σ j f ij d ij is the χ 2 visual feature distance of video clips p i and q j . f ij (weight transferred from d ij is the χ 2 visual feature distance of video clips p i and q j . f ij (weight transferred from p i and q j ) is optimized by minimizing the overall transportation workload Σ i Σ j f ij d ij p i and q j ) is optimized by minimizing the overall transportation workload Σ i Σ j f ij d ij • EMD Kernel: K(P,Q)= exp -ρ EMD (P,Q) Y. Rubner, C. Tomasi, L. J. Guibas, “A metric for distributions with applications to image databases”, ICCV, 1998. 14 D. Xu, S.-F. Chang, “Video event recognition using kernel methods with multi-level temporal alignment”, PAMI, 2008.
Temporal matching results • EMD is helpful for two events – results measured by minimal normalized cost (lower is better) 0.8 5% gain Minimal Normalized Cost r6-baseline 0.7 r3-base+EMD 0.6 0.5 0.4 0.3 0.2 0.1 0 15
Roadmap > contextual diffusion 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Feature extraction Classifiers Classifiers SIFT χ2 Semantic SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature 16
Event context • Events generally occur under particular scene settings with certain audio sounds! – Understanding contexts may be helpful for event detection Action Action Batting a run in Scene Concepts Scene Concepts Concepts Concepts running grass walking Baseball field Speech comprehensible sky Cheering/Clapping Audio Audio Concepts Concepts 17
Contextual concepts • 21 concepts are defined and annotated over TRECVID MED development set. Human Action Concepts Scene Concepts Audio Concepts Person walking Indoor kitchen Outdoor rural Person running Outdoor with grass/trees Outdoor urban Person squatting visible Indoor quiet Person standing up Baseball field Indoor noisy Person making/assembling Crowd (a group of 3+ Original audio stuffs with hands (hands people) Dubbed audio visible) Cakes (close-up view) Speech comprehensible Person batting baseball Music Cheering Clapping • SVM classifier for concept detection – STIP for action concepts, SIFT for scene concepts, and MFCC for audio concepts 18
Concept detection: example results Baseball field Cakes (close-up view) Crowd (3+ people) Grass/trees Indoor kitchen 19
Contextual diffusion model • Semantic diffusion Baseball field 0.9 [Y.-G. Jiang, J. Wang, S.F. Chang & C.W. Ngo, ICCV 2009] – Semantic graph • Nodes are concepts/events • Edges represent concept/event 0.5 Batting a run in correlation – Graph diffusion 0.8 • Smooth detection scores 0.7 w.r.t. the correlation Running Cheering Project page and source code: http://www.ee.columbia.edu/ln/dvmm/researchProjects/MultimediaIndexing/DASD/dasd.htm 20
Contextual diffusion results • Context is slightly helpful for two events – results measured by minimal normalized cost (lower is better) 0.800 r3-baseEMD r2-baseEMDSceAudAct Minimal Normalized Cost 0.700 0.600 3% gain 0.500 0.400 0.300 0.200 0.100 0.000 21
Outline • A System for Recognizing Events in Internet Videos – Best performance in TRECVID 2010 Multimedia Event Detection Task – Features, Kernels, Context, etc. • Internet Consumer Video Analysis – A Benchmark Database – An Evaluation of Human & Machine Performance Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance , in ACM ICMR 2011. 22
What are Consumer Videos? • Original unedited videos captured by ordinary consumers Interesting and very diverse contents Very weakly indexed On average, 3 tags per consumer video on YouTube vs. 9 tags each YouTube video has Original audio tracks are preserved; good for audio- visual joint analysis … 23
Columbia Consumer Video (CCV) Database Basketball Non-music Performance Skiing Dog Wedding Reception Baseball Swimming Bird Wedding Ceremony Parade Soccer Biking Graduation Wedding Dance Beach Playground Cat Birthday Celebration Music Performance 24 Ice Skating
Recommend
More recommend