beyond short snippets deep networks
play

Beyond Short Snippets: Deep Networks for Video Classification Joe - PowerPoint PPT Presentation

Beyond Short Snippets: Deep Networks for Video Classification Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici zge Yalnkaya Introduction Many attempts to apply CNNs to action


  1. Beyond Short Snippets: Deep Networks for Video Classification Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici Özge Yalçınkaya

  2. Introduction ✤ Many attempts to apply CNNs to action recognition ✤ Treating video frames as images, using CNN for description ✤ Average predictions at the video level ✤ However, complete action information is missing

  3. Introduction ✤ For accurate video classification, learning a global description of the video’s temporal information is important ✤ Using increasing number of frames improves classification ✤ Moreover, optical flow images may provide additional information

  4. Introduction ✤ Two approaches are introduced: ➡ Feature Pooling ➡ LSTM ✤ State-of-the-art performances on Sports-1M and UCF101 ✤ AlexNet and GoogLeNet are used

  5. Approach: Feature Pooling Architectures ✤ Conv Pooling: ➡ Performs max-pooling over final CNN layer across the frames (blue) ➡ Feeds them to FC layer (yellow)

  6. Approach: Feature Pooling Architectures ✤ Late Pooling: ➡ Performs max-pooling(blue) after two FC layers(yellow) ➡ Compared to Conv Pooling, it directly combines high-level information

  7. Approach: Feature Pooling Architectures ✤ Slow Pooling: ➡ First, max-pooling(blue) is applied over 10 frame after CNN(like size-10 filter) ➡ Each one is followed by a FC layer(yellow) ➡ A single max-pooling combines outputs ➡ Groups local features before combining high level information

  8. Approach: Feature Pooling Architectures ✤ Local Pooling: ➡ Combines frame level features locally as Slow Pooling(blue) ➡ Softmax(orange) layer is connected to all FC(yellow) layers for final prediction

  9. Approach: Feature Pooling Architectures ✤ Time-Domain Convolution: ➡ Extra time-domain conv layer(green) ➡ Max-pooling across frames on temporal domain(blue) ➡ Captures local relationships between frames

  10. Approach: Feature Pooling Architectures ✤ GoogLeNet Conv Pooling: ➡ Max-pooling is applied in network ➡ Then, this layer is connected to softmax layer ➡ Enhancement is done by adding FC layers

  11. Approach: LSTM Architecture

  12. Approach: LSTM Architecture LSTM takes input from CNN layer at each video frame. A softmax layer predicts the class for each time step

  13. Implementation Details ✤ Experiments done with both AlexNet and GoogLeNet ✤ Parameters are initialized from pre-trained Imagenet model, fine-tuned on Sports-1M ✤ Single-frame networks are expanded to 30 and 120-frames ✤ Optical flow images are used

  14. Results: Sports-1M ✤ 1 Million YouTube sports videos annotated with 487 classes ✤ 1000-3000 videos in per class ✤ Optical flow quality varies wildly between videos ✤ First 5 minutes of each video is sampled to obtain 300 frames

  15. Results: Sports-1M Feature-pooling architecture comparisons CNN network comparisons

  16. Results: Sports-1M Effect of the number of frames in model used in GoogLeNet Optical flow effect

  17. Results: Sports-1M Comparison with the work of Karpathy et al.

  18. Results: UCF-101 ✤ 13,320 videos with 101 activity classes ✤ More constrained camera movements, hand-crafted dataset UCF-101 accuracy results for different frame numbers

  19. Results: UCF-101 State-of -the-art UCF-101 results

  20. Conclusion and Future Work ✤ They presented two video-classification methods that aggregate frame-level CNN outputs to video-level ✤ Feature pooling and LSTM for video classification is introduced ✤ Using optical flow is beneficial ✤ State-of-the-art results are obtained on two benchmark dataset ✤ Learning should take place over the entire video rather than short clips

Recommend


More recommend