AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua
Convolutional networks are dominant C3D [ICCV 2015] I3D [CVPR 2017] S3D [ECCV 2018] SlowFast [ICCV 2019]
What’s missing from convolution? • Where to focus in images/videos • Long-range dependencies The same convolutional kernel Long-range dependencies are is applied at every position. modeled by large receptive fields. Photo credit: [Convolution arithmetic] [Receptive field arithmetic]
Attention is complementary to convolution • Map-based Attention • Dot-product Attention CBAM [ECCV 2018] Attention is All You Need [NeurIPS 2017] Where to focus : learn a pointwise Long-range dependencies : compute pairwise weighting factor for each position similarity between all the positions
Many design choices need to be determined Challenge : to apply attention to videos • How to compose multiple • What is the right dimension attention operations? to apply attention to videos? Output Output ! Temporal Attention Spatial Temporal # Attention Attention Spatial Attention " Input Input Three dimensions in video data: spatial, temporal or spatiotemporal? Sequential, parallel, or others?
Automatically search for attention cells in a Proposal : driven manner data-dr dat Sink Node Combine Spatial Spatial Temporal Temporal Op3 Op2 Spatial Spatial Temporal Temporal Op1 Input Novel Attention Cell Search Space Efficient Differentiable Search Method
Attention Cell Search Space Attention Cell • Composed of multiple attention operations Combine • Input shape == output shape; can be inserted anywhere in existing backbones Search Space Op3 Op2 • Cell Level Search Space: Connectivity between the operations within the cell • Operation Level Search Space: Choices to instantiate an individual attention Op1 operation
Cell Level Search Space Output of the cell Select input to each operation • Input to the 1 "# operation is fixed to Combine • Input to the $ #% operation is a weighted sum of selected feature maps from Op3 Op2 Combine Op1 • Concatenate channels + CONV Input to the cell
Operation Level Search Space ! # Map-based Dot-product Attention Attention " 1. Spatial 2. Temporal 3. Spatiotemporal Attention Dimension Attention Operation Type
Map-based Attention and Dot-product Attention Map-based Attention Where to focus : learn a pointwise weighting factor for each position Dot-product Attention Long-range dependencies : compute pairwise similarity between all the positions Assume attention dimension = temporal
Search Space Summary Spatial • Map-based attention • Combine Temporal • Dot-product attention • Spatiotemporal • Attention Dimension Attention Operation Type Op3 Op2 None • ReLU • Input to each operation • Softmax • Op1 Sigmoid • Activation Function Connectivity between Operations
Insert Attention Cells into Backbone Networks Combine Convolutional Convolutional Op3 Op2 Layers Layers Attention Cell Op1
Differentiable Formulation of Search Space • Search algorithm : differentiable architecture search • Search cost : equals to the cost of training one network Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Input
Supergraph and Connection Weights Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Supergraph: ! levels; each level " nodes Input Node : an attention operation of a pre- defined attention dimension and type
Differentiable Search • Jointly train the network weights and connection weights with gradient descent Sink Node Convolutional Convolutional Spatial Spatial Temporal Temporal Layers Layers Spatial Spatial Temporal Temporal Input Supergraph
Attention Cell Design Derivation Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Input How to derive the attention cell design from the learned weights?
Attention Cell Design Derivation Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Input Choose top ! (e.g., 3) nodes based on
Attention Cell Design Derivation Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Input Choose top ! (e.g., 2) predecessors of each selected code recursively based on until we reach the first level
Attention Cell Design Derivation Combine Solid connection (no weights) Level connection weights Spatial Temporal Sink connection weights Map-based Attention Spatial Temporal Temporal Dot-product Attention Input
Experimental Setup • Backbones • Inception-based • Insert 5 cells I3D [CVPR 2017] S3D [ECCV 2018] • Datasets: Kinetics-600 and Moments in Time (MiT)
Comparison with Non-local Blocks
Generalization across Modalities RGB to optical flow
Generalization across Backbones
Generalization across Datasets
Comparison with State-of-the-art
Contributions • Extend NAS beyond discovering convolutional cells to attention cells • Search space for spatiotemporal attention cells • A differentiable formulation of the search space • State-of-the-art performance; outperforms non-local blocks • Strong generalization across modalities, backbones, or datasets • More analysis and visualizations of attention cells available in the paper
Recommend
More recommend