Gate-Shift Networks for Video Action Recognition Swathikiran Sudhakaran 1 Sergio Escalera 2,3 Oswald Lanz 1 1 Fondazione Bruno Kessler, Italy 2 Computer Vision Center, Spain 3 Universitat de Barcelona, Spain
Motivation Video action recognition requires spatio-temporal reasoning Putting something similar to other things that are already on the table Taking one of many similar things on the table
Contribution Large number of parameters in 3D CNNs require large scale annotated data for training Existing approaches address this problem by a hard-wired decomposition of the 3D kernels which is suboptimal GSM leverages spatial gating for adaptive feature propagation HxW T T HxW HxW T W T x H 0 0 1 0 -1 1 0 0 1 C C 1 0 0 C 0 1 0 C 0 1 0 1 -1 0 1 0 0 . . . S3D / C3D GSM TSM R(2+1)D
HxW T T HxW HxW T W T x H 0 0 1 0 -1 1 0 0 1 C C 1 0 0 C 0 1 0 C 0 1 0 1 -1 0 1 0 0 . . . S3D / C3D GSM TSM R(2+1)D
GSM develops a flexible and data dependent decomposition of 3D kernels with reduced parameters and computational overhead S3D / C3D GSM TSM R(2+1)D
Gate-Shift TSN TSN Effectiveness of GSM Ablation study on Sth-V1 10.50M 16.46G Putting sth similar to other things that are +29% already on the table 10.45M 16.37G TSN Gate-Shift TSN Unfolding sth
State-of-the-art recognition accuracy of 55% on Something Something-V1
Gate-Shift Networks for Video Action Recognition Swathikiran Sudhakaran 1 Sergio Escalera 2,3 Oswald Lanz 1 1 Fondazione Bruno Kessler, Italy 2 Computer Vision Center, Spain 3 Universitat de Barcelona, Spain
Recommend
More recommend