Learning Spatiotemporal Features with 3D Convolutional Networks Du - PowerPoint PPT Presentation
Learning Spatiotemporal Features with 3D Convolutional Networks Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri ada BAK 29.03.16 Effective Video Descriptor Generic Can represent different types Compact
Learning Spatiotemporal Features with 3D Convolutional Networks Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri Çağdaş BAK 29.03.16
Effective Video Descriptor • Generic – Can represent different types • Compact – Processing, storage • Efficient – computation • Simple – implementation
3D Convolution and Pooling • 3D Convolution is better than 2D Convolution to model temporal information. – 2D CONV : performed only spatially, lose temporal information. – 3D CONV : performed spatio-temporally, preserve temporal information. • Same phenomena is applicable for pooling.
2D Convolution On 1-ch Input • Result : 2D Image.
2D Convolution On n-ch Input • Result : 2D Image.
3D Convolution On n-ch Input • Result : Volume
Identify Best Architecture For 3D ConvNets (On UCF101) • Common network settings – All video frames resized into 128x171. – Videos are split into non-overlapped 16 frame clip. – Input : 3x16x128x171. – 5 Convolution and Pooling layer – 2 Fully Connected layer – Softmax Loss layer to predict action labels
Identify Best Architecture For 3D ConvNets (On UCF101) • Varying Network Architecture – Homogeneous temporal depth. • Depth –d for 1,3,5,7 – Varying temporal depth. • Increasing : 3-3-5-5-7 • Decreasing : 7-7-5-5-3-3
3D Convolution Kernel Temporal Depth Search
Spatiotemporal Feature Learning • Best Network Architecture – With 3x3x3 kernel
Spatiotemporal Feature Learning • Dataset for training – Sports 1M Dataset • Largest video classification benchmark • 1.1 million sports videos • 487 categories
Sports 1M Classification Results
C3D Video Descriptor • C3D Model can be used as a feature extractor for various video analysis tasks. – Action recognition – Action similarity – Scene and Object recognition • Using with fc6 activations – 4096 dimension
Action Recognition • Dataset : UCF101 – 13.320 video – 101 human action
Action Similarity Labeling • Dataset : ASLAN – 3,631 video – 432 action class
Scene Object Recognition • Dataset : YUPENN – 420 video – 14 scene • Dataset : Maryland – 130 video – 13 scene
Why C3D Features? • Generic • Compact • Efficient • Simple
What Does C3D Learn ?
Useful Links • http://vlg.cs.dartmouth.edu/c3d/ • https://github.com/facebook/C3D
Learning Spatiotemporal Features with 3D Convolutional Networks Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri Çağdaş BAK 29.03.16
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.