CMU @ TRECVID Event Detection @ Ming-yu Chen & Alex Hauptmann School of Computer Science Carnegie Mellon University Carnegie Mellon University Carnegie Mellon
CMU @ TRECVID 2009 E CMU @ TRECVID 2009 Event Detection t D t ti � CMU submitted all 10 event detection tasks � Part-based generic approach • Local features extracted from videos - Local features describe both appearance and motion - Bag of word features represent video content Bag of word features represent video content • Robust to action deformation, occlusion and illumination � Sliding window detection approach lidi i d d i h • Extend part-based method to detection tasks • False alarm reduction is a critical task False alarm reduction is a critical task 2
3 System overviw i t S
M SIFT MoSIFT – feature detection f t d t ti � MoSIFT detects spatial interest points in multiple scales p p p • Local maximum of Difference of Gaussian (DoG) � MoSIFT computes optical flow to detect moving areas � MoSIFT detects video interest areas by local maximum of DoG and optical flows 4
M SIFT MoSIFT – feature description f t d i ti � Descriptor of shape p p • Histogram of Gradient (HoG) • Aggregate neighbor areas as 4x4 grids; each grid is described as 8 orientations • 4x4x8 = 128 dimensional vector to describe shape of interest areas 4 4 8 128 di i l t t d ib h f i t t � Descriptor of motion • Histogram of Optical Flow (HoF); the same format as HoG Histogram of Optical Flow (HoF); the same format as HoG • 128 dimensional vector to describe motion of interest areas � 256 dimensional vectors as feature descriptors 5
E Event detection t d t ti � K-mean cluster algorithm is applied to quantize feature points extracted g pp q p from videos • K is chosen by cross-validation � A video codebook is built by clustering result id d b k i b il b l i l • A visual code is a category of similar video interest points � Bag of word (BoW) feature is constructed for each video sequence � Bag of word (BoW) feature is constructed for each video sequence • Soft weight is used to construct BoW feature � Event models are trained by Support Vector Machine (SVM) y pp ( ) • X 2 kernel is applied � Sliding window approach creates video sequence in both training and t testing sets ti t 6
E Evaluation metric - DCR l ti t i DCR � Normalized Detection Cost Rate (NDCR) is used to evaluate performances. ( ) p ( ) ( ) = • + • ( , ) , , DetectionC ost S E Cost P S E Cost R S E Miss Miss FA FA [0,1] [0, ∞ ) � Strongly penalize false alarms • NDCR doesn’t encourage to detect more positive examples as much as reducing false alarms alarms • Reducing false alarms is then extremely important to improve NDCR scores 7
F l False alarm reduction l d ti � Cascade architecture is highly used to reduce false alarm in detect g y tasks � We applied the idea of cascade algorithm in test phase to reduce false alarm • Two positive biased classifiers are built (due to computation, it can extend to more layers) y ) • Windows pass both classifiers will be predicted as positive All windows T T Detected windows M1 M2 F F Rejected windows 8
False alarm reduction (Cont.) F l l d ti (C t ) � Lesson from last year, multi-scale sliding window approach has a lot y , g pp of false alarm � We do not apply multi-scale this year � Instead of several short positive predictions, we aggregated consecutive positive predictions as a long positive segment • Reduce number of positive predictions • Reduce number of positive predictions � Performance improves 80% by cascade algorithm � Performance improves 40% by concatenating short predictions to long Performance improves 40% by concatenating short predictions to long predictions 9
S System set up t t � MoSIFT features are extracted via 3 different scales every 5 frames y • approximate 2160 hours for a single core to extract MoSIFT features � A sliding window (25 frames) slides every 5 frames � 1000 video codes � Soft weighted BoW feature representation (4 nearest clusters) � One against all SVM model for each action of each camera view • 50 models are built (10 actions * 5 camera views) 10
P Performance comparison f i 1.4 1.2 1 0.8 DCR CMU Min D Median Median Best 0.6 0.4 0.2 0 y s t g r e w t p e u a r e n t c U n r P o n e E u u a i t l M t t E R t o r c F i n c l b T o e g p n i e i o P m N n S l j l o b P l p e e E i e s r s O o C o k l r o e p e t a a p o P P T v v p p e e e e O O P P l E Actions 11
C Correct detection comparison t d t ti i 700 600 500 CMU CorDet Median 400 Num of C Max M 300 200 100 0 y t s r e w t g e u p r e a n t n c U r P o E n e u u a i l t E t M t R t F o r c i n c l b T o e g p n i e i o P m N n S l j l o P l b p e e i e s r E O s o k C o r l o e p e a t a p P o P T v p e e O P l E Action 12
13 2009) Performance (2008 v.s. 2009) (2008 f P
Hi h l High level feature extraction l f t t ti � Motion related high level features g • 7 motion related concepts • Airplane flying, Person playing soccer, Hand, Person playing a musical instrument Person riding a bicycle Person eating People dacing instrument, Person riding a bicycle, Person eating, People dacing MAP MM 0.24 PKU 0.21 TITG 0.20 CMU 0.18 FTRD FTRD 0 18 0.18 VIREO 0.18 Eurecom 0.18 14
C Conclusion & future work l i & f t k � Conclusion: • A generic approach to detect events • MoSIFT features captures both shape and motion information • Perform robust over all tasks P f b t ll t k • False alarm reduction is critical to improve DCR � Future work: • The approach can’t localize where the action is • The approach can further fuse with people tracking and global features • Bag of word representation is lack of spatial constraints 15
Recommend
More recommend