Consumer Video Understanding A Benchmark Database + An Evaluation of Human & Machine Performance Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui Columbia University Kodak Research ACM ICMR 2011, Trento, Italy, April 2011
We (Consumers) take photos/videos everyday/everywhere... 2 Barack Obama Rally, Texas, 2008. http://www.paulridenour.com/Obama14.JPG
What are Consumer Videos? • Original unedited videos captured by ordinary consumers Interesting and very diverse contents Very weakly indexed 3 tags per consumer video on YouTube vs. 9 tags each YouTube video has on average Original audio tracks are preserved; good for audio- visual joint analysis … 3
Part I: A Database Columbia Consumer Video (CCV) Database
Columbia Consumer Video (CCV) Database Basketball Non-music Performance Skiing Dog Wedding Reception Baseball Swimming Bird Wedding Ceremony Parade Soccer Biking Graduation Wedding Dance Beach Playground Cat Birthday Celebration Music Performance 5 Ice Skating
CCV Snapshot • # videos: 9,317 wedding ceremony wedding reception – (210 hrs in total) biking graduation • video genre baseball – unedited consumer videos birthday soccer • video source playground bird – YouTube.com wedding dance • average length basketball beach – 80 seconds ice skating • # defined categories cat parade – 20 skiing swimming • annotation method dog – Amazon Mechanical Turk non-music perf. music perf. The trick of digging out consumer videos from YouTube: 0 100 200 300 400 500 600 700 800 Use default filename prefix of many digital cameras: “ MVI and parade”. 6
Existing Database? CCV Database • Human Action Recognition – KTH & Weizmann Unconstrained YouTube videos • (constrained environment) 2004-05 – Hollywood Database • (12 categories, movies) 2008 Higher-level complex – UCF Database events • (50 categories, YouTube Videos) 2010 • Kodak Consumer Video More videos & better defined categories • (25 classes, 1300+ videos) 2007 • LabelMe Video More videos & larger content variations • (many classes, 1300+ videos) 2009 • TRECVID MED 2010 More videos & categories • (3 classes, 3400+ videos) 2010
Crowdsourcing: Amazon Mechanical Turk A web services API that allows developers to easily integrate human intelligence directly into their processing What can I do for you? Is this a “parade” video? o Yes o No Task $?.?? financial rewards Internet-scale workforce 8
MTurk: Annotation Interface $ 0.02 Reliability of Labels: each video was assigned to four MTurk workers
Part II: …not Just A Database An Evaluation of Human & Machine Performance
Human Recognition Performance • How to measure human (MTurk workers) recognition accuracy? – We manually and carefully labeled 896 videos • Golden ground truth! Consolidation of the 4 sets of labels • 1 0.8 0.6 0.4 0.2 precison recall 0 1-vote 2-votes 3-votes 4-votes Plus additional manual filtering of 6 positive sample sets: 94% final precision 11
Human Recognition Performance (cont.) precision recall 1 1 3 3 3 3 1 27 77 248 255 446 694 770 4 36 0.8 25 17 0.6 2 0.4 0.2 1 5 0 workers (sorted by # of submitted HITs) precision recall 7 10 13 14 31 37 1 25 20 21 16 17 16 20 36 23 0.8 14 95 0.6 160 0.4 0.2 40 76 0 workers (sorted by average labeling time per HIT) Time is shown in seconds on top of the bars 12
Confusion Matrices Ground-truth Labels Human Recognition
Machine Recognition System Feature extraction Classifier SIFT Average χ 2 Spatial-temporal kernel Late interest points SVM Fusion MFCC audio feature Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , NIST TRECVID Workshop, 2010. 14
Best Performance in TRECVID-2010 Multimedia event detection (MED) task Run1: Run2 + “Batter” Reranking 1.40 Run2: Run3 + Scene/Audio/Action Context Run3: Run6 + EMD Temporal Matching Run4: Run6 + Scene/Audio/Action Context 1.20 Mean Mimimal Normalized Cost Run5: Run6 + Scene/Audio Context Run6: Baseline Classification with 3 features 1.00 0.80 0.60 0.40 0.20 0.00 r2 r3 r4 r5 r6 r1 15
Three Audio-Visual Features… • SIFT (visual) – D. Lowe, IJCV ‘04 • STIP (visual) – I. Laptev, IJCV ‘05 … 16ms 16ms • MFCC (audio) 16
Bag-of- X Representation X = SIFT / STIP / MFCC • Soft weighting (Jiang, Ngo and Yang, ACM CIVR 2007) • Bag-of-SIFT Bag of audio words / bag of frames: K. Lee and D. Ellis, Audio-Based Semantic Concept 17 Classification for Consumer Video , IEEE Trans on Audio, Speech, and Language Processing, 2010
Machine Recognition Accuracy • Measured by average precision SIFT works the best for event detection • The 3 features are highly complementary! • 0.9 Prior MFCC STIP SIFT SIFT+STIP SIFT+STIP+MFCC 0.8 0.7 average precision 0.6 0.5 0.4 0.3 0.2 0.1 0 18
Human vs. Machine Human has much better recall, and is much better for non-rigid objects • Machine is close to human on top-list precision • 1 0.8 0.6 Precision @90% 0.4 recall 0.2 0 1 0.8 0.6 Precision @59% 0.4 recall 0.2 0 machine human
Human vs. Machine: Confusion Matrices Human Recognition Machine Recognition
Human vs. Machine: Result Examples true positives false positives found by found by found by found by found by human&machine human only machine only human only machine only wedding dance (93.3% vs. 92.9%) soccer (87.5% n/a vs. 53.8%) cat (93.5% n/a vs. 46.8%) 21
Download - Unique YouTube Video IDs, - Labels, - Training/Test Partition, - Three Audio/Visual Features http://www.ee.columbia .edu/dvmm/CCV/ Fill out this …
Thank you!
Recommend
More recommend