Multimedia Event Detection using GS-SVMs and Audio-HMMs Shunsuke - PowerPoint PPT Presentation

TRECVID 2011 TokyoTech+Canon Multimedia Event Detection using GS-SVMs and Audio-HMMs Shunsuke Sato Nakamasa Inoue, Yusuke Kamishima, Canon Inc. Koichi Shinoda, Department of Computer Science, Tokyo Institute of Technology

TRECVID 2011 TokyoTech+Canon Outline  Motivation  System Overview  Method  Features extraction  GS-SVM  Audio HMMs  Results  Best result: Minimum NDC = 0.525 1 1

TRECVID 2011 TokyoTech+Canon Motivation  Two event feature categories:  Features that appear in every frame  Features that appear only in some frames  Their combination can improve the detection performance. ex.) Flash Mob Gathering clips Every frame: • Outdoor • Dancers • Road • Crowd Some frames: • Crowd buzz • Dancing … • Dance music • Cheering voice 2

TRECVID 2011 TokyoTech+Canon Method Overview  For every-frame features: GS-SVM (GMM-Supervector Support Vector Machine)  Use several visual and audio features  Soft clustering - robust against quantization errors  Based on our system of TRECVID 2010 SIN task  For some-frame features: HMM (Hidden Markov model)  Model temporal features in sound  Apply word-spotting in speech recognition  Use only audio, not video 3

TRECVID 2011 TokyoTech+Canon System Overview A clip of test data 1. Feature Extraction 2. GS-SVM SIFT-Hes STIP HOG SIFT-Har MFCC GS-SVM GS-SVM GS-SVM GS-SVM GS-SVM MFCC- Score Fusion HMM 3. Audio-HMM Detection Result 4

TRECVID 2011 TokyoTech+Canon Feature Extraction  5 types of features, from 3 kinds of sources • SIFT(Harris) Still images • SIFT(Hessian) frames every • HOG 2 seconds clip Spatio- • STIP temporal image t • MFCC Audio 6

TRECVID 2011 TokyoTech+Canon List of Features source feature description Scale-Invariant Feature Transform SIFT with Harris-affine regions (Harris) and Hessian-affine regions SIFT Still images [Mikolajczyk, 2004] (Hessian) 32 dimensional HOG HOG Dense sampling (every 4 pixels) Space-Time Interest Points Spatio-temporal STIP HOG and HOF features extracted images [Laptev, 2005] Mel-frequency cepstral coefficients Audio MFCC Audio features for speech recognition 7

TRECVID 2011 TokyoTech+Canon System Overview A clip of test data 1. Feature Extraction 2. GS-SVM SIFT-Hes STIP HOG SIFT-Har MFCC GS-SVM GS-SVM GS-SVM GS-SVM GS-SVM MFCC-HM Score Fusion M 3. Audio-HMM Detection Result 8

TRECVID 2011 TokyoTech+Canon GMM Supervector SVM (GS-SVM)  Represent the distribution of each feature  Each clip is modeled by a GMM (Gaussian Mixture Model)  Derive a supervector from the GMM parameters  Train SVM (Support Vector Machine) of the supervectors Features Gaussian Mixture Model Supervector SVM Score 9

TRECVID 2011 TokyoTech+Canon GMM Estimation  Estimated by using maximum a posteriori (MAP) adaptation for mean vectors: where UBM’s mean adapted mean UBM* MAP adaptation *Universal background model (UBM): a prior GMM which is estimated by using all video data. 10

TRECVID 2011 TokyoTech+Canon GMM Supervector  GMM Supervector: combination of the mean vectors. where normalized mean UBM MAP supervector adaptation 11

TRECVID 2011 TokyoTech+Canon Score Fusion in GS-SVM  GS-SVMs use RBF-kernels:  Score: Weighted Average of SVM outputs: where = {SIFT-Her, SIFT-Hes, HOG, STIP, MFCC} are decided by 2-fold cross validation based on  Minimum Normalized Detection Cost - Run 1 & Run 2  Average Precision - Run 3  In Run 4, is equal for all features  12

TRECVID 2011 TokyoTech+Canon System Overview A clip of test data 1. Feature Extraction 2. GS-SVM SIFT-Hes STIP HOG SIFT-Har MFCC GS-SVM GS-SVM GS-SVM GS-SVM GS-SVM MFCC-HM Score Fusion M 3. Audio-HMM Detection Result 13

TRECVID 2011 TokyoTech+Canon Audio HMM Training: 1. Label an event period manually for each event clip 2. Train an event HMM using MFCC Test: 1. Find likelihood L E of the event period by word-spotting 2. Find likelihood L G of the event period for a garbage model estimated from all video data 3. Calculate likelihood ratio L E / L G as the detection score likelihood detect train HMM Score Garbage with Event HMM model 14 Event Period Period labels

TRECVID 2011 TokyoTech+Canon Preliminary result of Audio HMMs  Fuse HMM score with GS-SVM by weighted average.  Audio HMMs are effective in 3 events – Use them in Run1. Birthday party Changing a vehicle tire (*) Flash mob gathering Getting a vehicle unstuck Grooming an animal Making a sandwich (*) Parade GS-SVM only Parkour GS-SVM (*) Repairing an appliance +HMM Working on a sewing project 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 15

TRECVID 2011 TokyoTech+Canon Experiments  Run3 was the best. GS-SVM was effective  Run1 (Audio-HMM) did not show good performance  Run2, weights decided by Minimum NDC, is not good  Simple cross validation may have failed. 3rd among participated teams Run3 (Actual Precision weighting) – 7th 1.5 Run4 (No weighting) – 8th Mean Minimum Run2 (Minimum NDC weighting) – 10th 1 Run1 (Run2 & Audio-HMM) – primary -12th NDC 0.5 0 17 TRECVID 2011 MED runs

TRECVID 2011 TokyoTech+Canon Effect of each feature in GS-SVM  STIP and HOG had better performance.  MFCC was effective when combined with STIP and HOG. Mean Minimum NDC 1 type 2 types 4 types all 3 types 1 STIP STIP+HOG STIP+HOG 0.8 +MFCC 0.6 SIFT-Har                 0.4 SIFT-Hes                 MFCC                 STIP                 0.2 HOG                 18 Checked: used Black: not used

TRECVID 2011 TokyoTech+Canon Why Audio HMM did not work?  It failed to capture temporal features  Each state represents a specific sound such as drum, cheering, which may appear in non-event and/or at random.  Test data include many sounds not appear in training and development data Flash mob gathering Preliminary Experiment Parade Official Evaluation Repairing an appliance 0.1 0.05 0 -0.05 -0.1 Difference of Minimum NDC between with and without Audio HMMs 19

TRECVID 2011 TokyoTech+Canon Conclusion  We combine GS-SVM and Audio HMM  GS-SVMs are effective for MED.  STIP, HOG, and MFCC are important  Audio HMMs are not effective  It cannot capture temporal features  Variety of sounds are larger than expected  Future works  Include other features, such as Dense SIFT  Improve the HMM-based sound detection  Model event subclasses and their relationship 20

Multimedia Event Detection using GS-SVMs and Audio-HMMs Shunsuke - PowerPoint PPT Presentation

TRECVID 2011 TokyoTech+Canon Multimedia Event Detection using GS-SVMs and Audio-HMMs Shunsuke Sato Nakamasa Inoue, Yusuke Kamishima, Canon Inc. Koichi Shinoda, Department of Computer Science, Tokyo Institute of Technology TRECVID 2011

Multiclass Classification using SVMs on GPUs Sergio Herrero 6.338J Applied Parallel Computing

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Algorithms for NLP IITP, Spring 2020 HMMs, POS tagging, NER Yulia Tsvetkov 1 Plan POS

HMMs for Acoustic Modeling (Part II) Lecture 3 CS 753 Instructor: Preethi Jyothi Recap: HMMs

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs.

Chapter 1 Introduction to Multimedia 1.1 What is Multimedia? 1.2 Multimedia and Hypermedia 1.3

Multimedia Systems Definition of Multimedia System A Multimedia System is a system capable of

Multimedia Applications Multimedia Applications Srinidhi Varadarajan Multimedia Applications

An introduction to Patterns, An introduction to Patterns, Profiles, HMMs and Profiles, HMMs and

CMPT 365 Multimedia Systems Media Representations - Audio Spring 2017 CMPT365 Multimedia

Audio Device Client Better and Faster Audio I/O on Web Hongchan Choi Google Chrome Web Audio

? Dr. Gerald Friedland Director Audio and Multimedia Lab International Computer Science

Multimedia Information Retrieval 1 What is multimedia information retrieval? 2 Basic Multimedia

Distributed Multimedia Systems 8. Multimedia Applications Multimedia Applications - 1 Lszl

Summary User-centric Social Social Multimedia Multimedia Computing From Users: user-perceptive

Streaming Multimedia Applications Multimedia Networking Multimedia Applications? What are

Lecture 3: Euler-Equation Estimation Simon Gilchrist Boston Univerity and NBER EC 745 Fall,

Lecture 3 Gaussian Mixture Models and Introduction to HMMs Michael Picheny, Bhuvana

A Course in Applied Econometrics Outline Lecture 16 1. Introduction 2. Generalized Method

K-Means + GMMs Clustering Readings: EM, GMM Readings: Matt Gormley Murphy 25.5

Optimal transport for Gaussian mixture models Yongxin Chen, Tryphon T. Georgiou and Allen

Direct Fitting of Gaussian Mixture Models Leonid Keselman , Martial Hebert Robotics Institute

Expectation-Maximization L eon Bottou NEC Labs America COS 424 3/9/2010 Agenda

Density Estimation Parametric techniques Maximum Likelihood Maximum A Posteriori