Beyond Short Snippets: Deep Networks for Video Classification Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici Özge Yalçınkaya
Introduction ✤ Many attempts to apply CNNs to action recognition ✤ Treating video frames as images, using CNN for description ✤ Average predictions at the video level ✤ However, complete action information is missing
Introduction ✤ For accurate video classification, learning a global description of the video’s temporal information is important ✤ Using increasing number of frames improves classification ✤ Moreover, optical flow images may provide additional information
Introduction ✤ Two approaches are introduced: ➡ Feature Pooling ➡ LSTM ✤ State-of-the-art performances on Sports-1M and UCF101 ✤ AlexNet and GoogLeNet are used
Approach: Feature Pooling Architectures ✤ Conv Pooling: ➡ Performs max-pooling over final CNN layer across the frames (blue) ➡ Feeds them to FC layer (yellow)
Approach: Feature Pooling Architectures ✤ Late Pooling: ➡ Performs max-pooling(blue) after two FC layers(yellow) ➡ Compared to Conv Pooling, it directly combines high-level information
Approach: Feature Pooling Architectures ✤ Slow Pooling: ➡ First, max-pooling(blue) is applied over 10 frame after CNN(like size-10 filter) ➡ Each one is followed by a FC layer(yellow) ➡ A single max-pooling combines outputs ➡ Groups local features before combining high level information
Approach: Feature Pooling Architectures ✤ Local Pooling: ➡ Combines frame level features locally as Slow Pooling(blue) ➡ Softmax(orange) layer is connected to all FC(yellow) layers for final prediction
Approach: Feature Pooling Architectures ✤ Time-Domain Convolution: ➡ Extra time-domain conv layer(green) ➡ Max-pooling across frames on temporal domain(blue) ➡ Captures local relationships between frames
Approach: Feature Pooling Architectures ✤ GoogLeNet Conv Pooling: ➡ Max-pooling is applied in network ➡ Then, this layer is connected to softmax layer ➡ Enhancement is done by adding FC layers
Approach: LSTM Architecture
Approach: LSTM Architecture LSTM takes input from CNN layer at each video frame. A softmax layer predicts the class for each time step
Implementation Details ✤ Experiments done with both AlexNet and GoogLeNet ✤ Parameters are initialized from pre-trained Imagenet model, fine-tuned on Sports-1M ✤ Single-frame networks are expanded to 30 and 120-frames ✤ Optical flow images are used
Results: Sports-1M ✤ 1 Million YouTube sports videos annotated with 487 classes ✤ 1000-3000 videos in per class ✤ Optical flow quality varies wildly between videos ✤ First 5 minutes of each video is sampled to obtain 300 frames
Results: Sports-1M Feature-pooling architecture comparisons CNN network comparisons
Results: Sports-1M Effect of the number of frames in model used in GoogLeNet Optical flow effect
Results: Sports-1M Comparison with the work of Karpathy et al.
Results: UCF-101 ✤ 13,320 videos with 101 activity classes ✤ More constrained camera movements, hand-crafted dataset UCF-101 accuracy results for different frame numbers
Results: UCF-101 State-of -the-art UCF-101 results
Conclusion and Future Work ✤ They presented two video-classification methods that aggregate frame-level CNN outputs to video-level ✤ Feature pooling and LSTM for video classification is introduced ✤ Using optical flow is beneficial ✤ State-of-the-art results are obtained on two benchmark dataset ✤ Learning should take place over the entire video rather than short clips
Recommend
More recommend