AttentionNAS: Spatiotemporal Attention Cell Search for Video - - PowerPoint PPT Presentation

attentionnas spatiotemporal attention cell search for
SMART_READER_LITE
LIVE PREVIEW

AttentionNAS: Spatiotemporal Attention Cell Search for Video - - PowerPoint PPT Presentation

AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua Convolutional networks are dominant C3D [ICCV


  • AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua

  • Convolutional networks are dominant C3D [ICCV 2015] I3D [CVPR 2017] S3D [ECCV 2018] SlowFast [ICCV 2019]

  • What’s missing from convolution? • Where to focus in images/videos • Long-range dependencies The same convolutional kernel Long-range dependencies are is applied at every position. modeled by large receptive fields. Photo credit: [Convolution arithmetic] [Receptive field arithmetic]

  • Attention is complementary to convolution • Map-based Attention • Dot-product Attention CBAM [ECCV 2018] Attention is All You Need [NeurIPS 2017] Where to focus : learn a pointwise Long-range dependencies : compute pairwise weighting factor for each position similarity between all the positions

  • Many design choices need to be determined Challenge : to apply attention to videos • How to compose multiple • What is the right dimension attention operations? to apply attention to videos? Output Output ! Temporal Attention Spatial Temporal # Attention Attention Spatial Attention " Input Input Three dimensions in video data: spatial, temporal or spatiotemporal? Sequential, parallel, or others?

  • Automatically search for attention cells in a Proposal : driven manner data-dr dat Sink Node Combine Spatial Spatial Temporal Temporal Op3 Op2 Spatial Spatial Temporal Temporal Op1 Input Novel Attention Cell Search Space Efficient Differentiable Search Method

  • Attention Cell Search Space Attention Cell • Composed of multiple attention operations Combine • Input shape == output shape; can be inserted anywhere in existing backbones Search Space Op3 Op2 • Cell Level Search Space: Connectivity between the operations within the cell • Operation Level Search Space: Choices to instantiate an individual attention Op1 operation

  • Cell Level Search Space Output of the cell Select input to each operation • Input to the 1 "# operation is fixed to Combine • Input to the $ #% operation is a weighted sum of selected feature maps from Op3 Op2 Combine Op1 • Concatenate channels + CONV Input to the cell

  • Operation Level Search Space ! # Map-based Dot-product Attention Attention " 1. Spatial 2. Temporal 3. Spatiotemporal Attention Dimension Attention Operation Type

  • Map-based Attention and Dot-product Attention Map-based Attention Where to focus : learn a pointwise weighting factor for each position Dot-product Attention Long-range dependencies : compute pairwise similarity between all the positions Assume attention dimension = temporal

  • Search Space Summary Spatial • Map-based attention • Combine Temporal • Dot-product attention • Spatiotemporal • Attention Dimension Attention Operation Type Op3 Op2 None • ReLU • Input to each operation • Softmax • Op1 Sigmoid • Activation Function Connectivity between Operations

  • Insert Attention Cells into Backbone Networks Combine Convolutional Convolutional Op3 Op2 Layers Layers Attention Cell Op1

  • Differentiable Formulation of Search Space • Search algorithm : differentiable architecture search • Search cost : equals to the cost of training one network Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Input

  • Supergraph and Connection Weights Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Supergraph: ! levels; each level " nodes Input Node : an attention operation of a pre- defined attention dimension and type

  • Differentiable Search • Jointly train the network weights and connection weights with gradient descent Sink Node Convolutional Convolutional Spatial Spatial Temporal Temporal Layers Layers Spatial Spatial Temporal Temporal Input Supergraph

  • Attention Cell Design Derivation Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Input How to derive the attention cell design from the learned weights?

  • Attention Cell Design Derivation Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Input Choose top ! (e.g., 3) nodes based on

  • Attention Cell Design Derivation Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Input Choose top ! (e.g., 2) predecessors of each selected code recursively based on until we reach the first level

  • Attention Cell Design Derivation Combine Solid connection (no weights) Level connection weights Spatial Temporal Sink connection weights Map-based Attention Spatial Temporal Temporal Dot-product Attention Input

  • Experimental Setup • Backbones • Inception-based • Insert 5 cells I3D [CVPR 2017] S3D [ECCV 2018] • Datasets: Kinetics-600 and Moments in Time (MiT)

  • Comparison with Non-local Blocks

  • Generalization across Modalities RGB to optical flow

  • Generalization across Backbones

  • Generalization across Datasets

  • Comparison with State-of-the-art

  • Contributions • Extend NAS beyond discovering convolutional cells to attention cells • Search space for spatiotemporal attention cells • A differentiable formulation of the search space • State-of-the-art performance; outperforms non-local blocks • Strong generalization across modalities, backbones, or datasets • More analysis and visualizations of attention cells available in the paper