AT&T Research at TRECVID 2013: Surveillance Event Detection Xiaodong Yang †* , Zhu Liu ‡ , Eric Zavesky ‡ , David Gibbon ‡ , Behzad Shahraray ‡ † City College of New York, CUNY ‡ AT&T Labs - Research *This work is carried out when the author worked as a research intern at AT&T Labs – Research.
Team Members Xiaodong Zhu Eric David Behzad Yang Liu Zavesky Gibbon Shahraray
Outline System Overview Low-Level Features Video Representation CascadeSVMs Human Interactions Performance Evaluation Conclusion
Outline System Overview Low-Level Features Video Representation CascadeSVMs Human Interactions Performance Evaluation Conclusion
System Overview
Outline System Overview Low-Level Features Video Representation CascadeSVMs Human Interactions Performance Evaluation Conclusion
System Overview
Low-Level Feature Extraction STIP-HOG/HOF MoSIFT ActionHOG Dense Trajectories (DT) Trajectory HOG HOF Motion Boundary Histogram (MBH)
Low-Level Feature Extraction STIP 3D Harris corner detector HOG-HOF descriptor I. Laptev. On Space-Time Interest Points. IJCV , 2005.
Low-Level Feature Extraction MoSIFT SIFT detector + motion SIFT descriptor image gradient optical flow M. Chen and A. Hauptmann. MoSIFT: Recognizing Human Actions in Surveillance Videos. CMU-CS-09-161 , 2009.
Low-Level Feature Extraction ActionHOG SURF detector + motion HOG image gradient motion history image optical flow X. Yang, C. Yi, L. Cao, and Y. Tian. MediaCCNY at TRECVID 2012: Surveillance Event Detection. NIST TRECVID Workshop , 2012.
Low-Level Feature Extraction Dense Trajectories dense sampling + tracking Trajectory HOG HOF MBH H. Wang, A. Klaser, C. Schmid, and C. Liu. Action Recognition by Dense Trajectories. CVPR , 2011.
Outline System Overview Low-Level Features Video Representation CascadeSVMs Human Interactions Performance Evaluation Conclusion
System Overview
Video Representation Fisher Vector low-level features GMM gradient wrt. mean gradient wrt. variance F. Perronnin, J. Sanchez, and T. Mensink. Improving The Fisher Kernel for Large-Scale Image Classification. ECCV , 2010.
Video Representation Fisher Vector concatenation of and dimension of GMM-128 Feature STIP MoSIFT ActionHOG DT-HOG DT-HOF DT-MBH DT-Traj Feat-Dim 162 256 216 96 108 192 30 FV-Dim 330K 520K 440K 200K 220K 400K 60K
Video Representation Spatial Pyramids S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bag of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. CVPR , 2006.
Outline System Overview Low-Level Features Video Representation CascadeSVMs Human Interactions Performance Evaluation Conclusion
System Overview
CascadeSVMs Imbalanced Data
CascadeSVMs Imbalanced Data % 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
CascadeSVMs Sample Model-1 Model-2 Model-3 Model-C positive prediction negative prediction X. Yang, C. Yi, L. Cao, and Y. Tian. MediaCCNY at TRECVID 2012: Surveillance Event Detection. NIST TRECVID Workshop , 2012.
CascadeSVMs Feature Fusion
Outline System Overview Low-Level Features Video Representation CascadeSVMs Human Interactions Performance Evaluation Conclusion
System Overview
Human Interactions High Throughput UI
Human Interactions Triage UI
Outline System Overview Low-Level Features Video Representation CascadeSVMs Human Interactions Performance Evaluation Conclusion
Performance Evaluation Experimental Setup PersonRuns Fisher Vector CascadeSVMs 40-hour videos for training 10-hour videos for testing
Performance Evaluation Number of Gaussian Components STIP
Performance Evaluation Comparisons of Low-Level Features STIP MoSIFT ActionHOG DT-Trajectory DT-HOG DT-HOF DT-MBH
Performance Evaluation How A Larger Training Set Helps 40 vs. 90 hours training videos
Performance Evaluation Feature Fusion 90 hours training videos STIP, DT-Trajectory, DT-MBH Early Fusion Late Fusion Early + Late Fusion
Performance Evaluation Formal Evaluation Comparative Results
Outline System Overview Low-Level Features Video Representation CascadeSVMs Human Interactions Performance Evaluation Conclusion
Conclusion Best ADCR
Conclusion Best ADCR Single Multiple Multiple Multiple Person Single Person Person People People People Object Person Object
Conclusion Multiple Features fusion scheme ranking and selection event-specific investigation Fisher Vector accuracy and computation Human Interaction collaborative mode cross-event mode static gesture detection
Recommend
More recommend