adaptive feature discovery for trecvid broadcast news
play

Adaptive Feature Discovery for TRECVID Broadcast News Video Story - PowerPoint PPT Presentation

Adaptive Feature Discovery for TRECVID Broadcast News Video Story Segmentation @TRECVID Workshop 2004, Nov. 15-16 1 , Lyndon Kennedy 1 , Shih-Fu Chang 1 , Winston Hsu 3 , John Smith 2 , Giridharan Iyengar 3 Martin Franz 1 Dept. of Electrical


  1. Adaptive Feature Discovery for TRECVID Broadcast News Video Story Segmentation @TRECVID Workshop 2004, Nov. 15-16 1 , Lyndon Kennedy 1 , Shih-Fu Chang 1 , Winston Hsu 3 , John Smith 2 , Giridharan Iyengar 3 Martin Franz 1 Dept. of Electrical Engineering, Columbia University, New York, NY 2 IBM T. J. Watson Research Center, Hawthorne, NY 3 IBM T. J. Watson Research Center, Yorktown Heights, NY http://www.ee.columbia.edu/~winston digital video | multimedia lab - Winston H.-M. Hsu -

  2. -2- trecvid workshop, 11/15/2004 Outlines � Features and Fusion Strategies � Multi-modal features at different observation windows (e.g., prosody, visual cues, text) � Fusion with Support Vector Machines � New focus in 2004: � Automatic Visual Cue Cluster Construction (VC 3 framework) � Ability to handle diverse production events � Thorough error analysis for different genres � Brief comparison with last year results digital video | multimedia lab

  3. -4- trecvid workshop, 11/15/2004 Story Segmentation Model Determine the candidate points � union of pauses and shot boundaries with fuzzy window 2.5 sec � digital video | multimedia lab

  4. -5- trecvid workshop, 11/15/2004 Story Segmentation Model Determine the candidate points � union of pauses and shot boundaries with fuzzy window 2.5 sec � Extract and aggregate relevant features from surrounding windows � take into account asynchronous multi-modal futures; e.g., text, audio � digital video | multimedia lab

  5. -6- trecvid workshop, 11/15/2004 Story Segmentation Model ? Post-processing Determine the candidate points � union of pauses and shot boundaries with fuzzy window 2.5 sec � Extract and aggregate relevant features from surrounding windows � take into account asynchronous multi-modal futures; e.g., text, audio � Classify the candidate points as “boundary” or “non-boundary” � SVMs with RBF kernels � Post-processing � digital video | multimedia lab

  6. -7- trecvid workshop, 11/15/2004 Raw Multi-Modal Features Modality Raw Features Dim. Visual Visual Cues Clusters 15~40 2 commercial 2 motion Audio pause 1 prosody features 30 speaker change 1 * before taking into account speech rapidity 1 different observation windows Text text story seg. scores 1 digital video | multimedia lab

  7. -8- trecvid workshop, 11/15/2004 Visual Cue Cluster Construction (VC 3 ) � Motivation � News channels usually have different visual production events across channels or time and are statistically relevant to story boundaries � Usually try different ways to manually enumerate all the production events from inspections, and then train the classifiers � e.g. ANCHOR, STUDIO, WEATHER, CNN_HEADLINE, …, etc. � Problems -> deploying on multiple channels of multiple countries … � We hope to discover a systematic work to catch “visual cue clusters” � Analogously, text -> cue words or cue word clusters � Automatically, rather than by human inspection � Avoid time-consuming news production annotations via Information Bottleneck Clustering! digital video | multimedia lab

  8. -9- trecvid workshop, 11/15/2004 VC 3 : the Information Bottleneck Principle � Cluster to but still trying to preserve the mutual information with label space � If , a hard partitioning; we only care about maximizing ; that’s to minimize digital video | multimedia lab

  9. -10- trecvid workshop, 11/15/2004 VC 3 Overview: a Simple Example digital video | multimedia lab

  10. -11- trecvid workshop, 11/15/2004 VC 3 Overview: a Simple Example c 1 c 2 c 3 c 3 c 2 c 1 •Items (features) in the same cluster tend to be with similar probability distributions over the event labels Y ->semantic consistency!! •MI contributions from different clusters -> feature selection digital video | multimedia lab

  11. -12- trecvid workshop, 11/15/2004 VC 3 Overview: Joint Probability Approximation � For IB clustering, we essentially need � However, video features are not discrete but continuous! � Approximate joint probability via kernel density estimation from existent feature observations Gaussian Kernel with specific kernel bandwidth observed event probability conditioning on the feature � Embed prior knowledge on kernels functions and the kernel bandwidth ( D -dimensional) � Gaussian Kernel (diagonal): � Raw features: autocorrelogram, color moments, and Gabor texture digital video | multimedia lab

  12. -13- trecvid workshop, 11/15/2004 VC 3 Overview: Cluster Examples-I � ABC VCs for story seg. cluster selection/feature reduction!! digital video | multimedia lab

  13. -14- trecvid workshop, 11/15/2004 VC 3 Overview: Cluster Examples-II � CNN VCs for story seg. digital video | multimedia lab

  14. -15- trecvid workshop, 11/15/2004 VC 3 Overview: Cluster Examples-III � CNN VCs for text association TEMPERATURE, SHOWER, RAIN, THUDERSTORM, PRESSURE, … POINT, WIN, PLAY, MICHAEL, GAME, … POINT, DOLLAR, PERCENT, WORLD, DOW, NASDAQ, STREET SPORT, HEADLINE, JAMES, GAMES, … PRESIDENT, CLINTON, WHITE, DOLLAR, LEWINSKY, HOUSE, … digital video | multimedia lab

  15. -16- trecvid workshop, 11/15/2004 VC 3 Overview: Feature Projection � In feature extraction, project an image to those induced cue clusters by calculating the membership probabilities K -dim. VC Features digital video | multimedia lab

  16. -17- trecvid workshop, 11/15/2004 Performance Overview (A+V, Validation Set) A+V CNN A+V ABC digital video | multimedia lab

  17. -18- trecvid workshop, 11/15/2004 Performance Overview (A+V, Validation Set) 35.0 32.0 30.4 30.2 29.4 Ratio (Overall) 30.0 ME 25.0 VCs 21.3 A+V 20.0 15.0 15.0 12.9 8.8 10.0 7.1 7.6 7.6 8.2 7.9 6.2 7.0 5.8 6.9 6.3 6.3 6.1 5.4 5.9 3.7 3.2 2.9 4.6 2.5 5.0 2.6 2.1 2.2 2.1 2.0 2.0 0.9 0.1 0.3 0.0 cont. shrt anch. led 2nd anch. in anch. sprt bref. sprt->comm msc/anim prev->comm weather bref. • Annotate 749 stories into 9 types from 22 CNN videos ::story types • Fixed 0.71 precision; VC(*) evaluated at shot boundaries ONLY digital video | multimedia lab

  18. -19- trecvid workshop, 11/15/2004 Performance Overview ( A+V+T, Validation Set ) Revised A+V+T Fusion approach Over-fitting in the training set!! V >> V A >>> A >>> A T T : SVM fusion digital video | multimedia lab

  19. -20- trecvid workshop, 11/15/2004 TRECIV04 Test 04 Result TRECVID 2004 Story Segmentation NIST Submission 10 Columbia_IBM submissions 0.80 0.69 0.65 0.70 0.61 0.57 0.60 0.50 F1 0.40 0.30 0.20 0.10 0.00 dT AV_efc+efc AV_efc+ec AV_fc+fc AVmT AVmT_fc+fc AVdT_fc+c AVdT_fc+fc AVmT_fc_c mT best_of_others Significant degradation (10%) comparing with our two validation sets (A+V, � A+V+T: 0.72+) Probably due to that (1) visual patterns or raw feature had changed a lot in � the test set; (2) the fusion strategy; (3) the selection of decision threshold digital video | multimedia lab

  20. -21- trecvid workshop, 11/15/2004 Summary � Develop a novel information-theoretical framework to � discover visual cue clusters automatically � adapt to diverse production events of different channel � avoid manual specification/annotation of salient visual cues � Results confirm the effectiveness of VCs in the validation set � But the performance degrades in the test set due to time gap � Multi-modal fusion � Fusion of A and V has significant improvement � Fusion of AV and T improves performance in ABC only � Strategies for fusion are critical – simultaneous fusion is better � Major remaining errors � Short sports briefings � Suggest merging them to a continuous story in the ground truth digital video | multimedia lab

  21. -22- trecvid workshop, 11/15/2004 < the end; thanks > digital video | multimedia lab

Recommend


More recommend