learning spatiotemporal features with 3d convolutional
play

Learning Spatiotemporal Features with 3D Convolutional Networks Du - PowerPoint PPT Presentation

Learning Spatiotemporal Features with 3D Convolutional Networks Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri ada BAK 29.03.16 Effective Video Descriptor Generic Can represent different types Compact


  1. Learning Spatiotemporal Features with 3D Convolutional Networks Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri Çağdaş BAK 29.03.16

  2. Effective Video Descriptor • Generic – Can represent different types • Compact – Processing, storage • Efficient – computation • Simple – implementation

  3. 3D Convolution and Pooling • 3D Convolution is better than 2D Convolution to model temporal information. – 2D CONV : performed only spatially, lose temporal information. – 3D CONV : performed spatio-temporally, preserve temporal information. • Same phenomena is applicable for pooling.

  4. 2D Convolution On 1-ch Input • Result : 2D Image.

  5. 2D Convolution On n-ch Input • Result : 2D Image.

  6. 3D Convolution On n-ch Input • Result : Volume

  7. Identify Best Architecture For 3D ConvNets (On UCF101) • Common network settings – All video frames resized into 128x171. – Videos are split into non-overlapped 16 frame clip. – Input : 3x16x128x171. – 5 Convolution and Pooling layer – 2 Fully Connected layer – Softmax Loss layer to predict action labels

  8. Identify Best Architecture For 3D ConvNets (On UCF101) • Varying Network Architecture – Homogeneous temporal depth. • Depth –d for 1,3,5,7 – Varying temporal depth. • Increasing : 3-3-5-5-7 • Decreasing : 7-7-5-5-3-3

  9. 3D Convolution Kernel Temporal Depth Search

  10. Spatiotemporal Feature Learning • Best Network Architecture – With 3x3x3 kernel

  11. Spatiotemporal Feature Learning • Dataset for training – Sports 1M Dataset • Largest video classification benchmark • 1.1 million sports videos • 487 categories

  12. Sports 1M Classification Results

  13. C3D Video Descriptor • C3D Model can be used as a feature extractor for various video analysis tasks. – Action recognition – Action similarity – Scene and Object recognition • Using with fc6 activations – 4096 dimension

  14. Action Recognition • Dataset : UCF101 – 13.320 video – 101 human action

  15. Action Similarity Labeling • Dataset : ASLAN – 3,631 video – 432 action class

  16. Scene Object Recognition • Dataset : YUPENN – 420 video – 14 scene • Dataset : Maryland – 130 video – 13 scene

  17. Why C3D Features? • Generic • Compact • Efficient • Simple

  18. What Does C3D Learn ?

  19. Useful Links • http://vlg.cs.dartmouth.edu/c3d/ • https://github.com/facebook/C3D

  20. Learning Spatiotemporal Features with 3D Convolutional Networks Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri Çağdaş BAK 29.03.16

Recommend


More recommend