GPU Accelerated Sequence Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn 2018-03 1
Background Action Recognition Object Recognition (Video Classification) (Image Classification)  Action Recognition VS Object Recognition  Temporal domain, Long-term dependence, High computational complexity.  General methods are not good enough for action recognition.  Existing methods are still far from practical use
Research Trends Datasets Year Actions Videos Annotations Source Localization HMDB51 2011 51 7K 7K YouTube/ No Movie UCF101 2012 101 13K 13K YouTube No Sports 1M 2014 487 1.1M 1.1M YouTube No THUMOS 15 2014 101 24K 21K YouTube Yes ActivityNet 2015 200 20K 23K YouTube Yes Charades 2016/ 157 10K 67K 267 Yes 2017 Homes AVA 2017 80 214 197K Movie Yes Kinetics 2017 400 305K 305K YouTube No MIT 2017 339 1M 1M 10 sources No SLAC 2017 200 520K 1.75M YouTube Yes
Action Recognition  Modeling temporal domain is one of the most important target of action recognition.  Shortcomings of existing methods:  Action have long duration: High complexity  LSTM is not good enough.  Therefore, we need :  Some more efficient sequence learning model to improve the ability of modeling temporal information.
Overview Hand-crafted Temporal Attentive Network Features Hand-crafted Features Importance of Each Frame and Deep Features The Ability of Modeling Temporal Domain Deep Trajectory Descriptor One-shot Action Recognition shuttleNet Open-set Action Recognition Open Deep Network Hierarchical Temporal Memory Enhanced One-shot Distance Learning
Overview  Sequence learning for action recognition  Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network
Overview  Sequence learning for action recognition  Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network
Deep Trajectory Descriptor  Problems and Solutions  Hand-crafted feature can hardly describe movement process; CNNs are good at describe structure.  Integrate hand-crafted feature and CNN to improve performance. Hand-crafted feature: CNN: More statistics, less structure. Structure is important.
Deep Trajectory Descriptor  Improve Dense Trajectory with Background Subtraction  Only extract trajectories and optical flow on foreground. Where 𝑇 𝑔𝑝𝑠𝑓 is the sum of the foreground square area. (𝑗, 𝑘) index around the square area. Videos Masks Foreground
Deep Trajectory Descriptor  Main Idea  Trajectory Texture Image: Project trajectories onto a canvas. Projection in an Dense trajectories adaptative duration Trajectory Project into 2D space Input video Texture Image  CNN is employed for structural feature learning. … … Conv Pooling LRN Conv FC
Deep Trajectory Descriptor
Deep Trajectory Descriptor  Improve trajectory projection method
Deep Trajectory Descriptor  DTD with LSTM  Treat each Trajectory Texture Image as one time step input, LSTM is used to model temporal domain.  Improve the ability of DTD to model complex action. 𝑦 𝑢 is the input at time t. ℎ 𝑢 is the hidden state at time t. 𝑢 , 𝑑 𝑢 , 𝑝 𝑢 are the input Gate 、 𝑗 𝑢 , 𝑔 forget gate 、 memory cell and output gate at time t. Our LSTM Model
Deep Trajectory Descriptor  Learn long-term action description ApplyEyeMakeup  CNN for DTD feature learning;  Sequential DTD for long- term action representation;  RNN(LSTM) for temporal domain modeling. Softmax Loss Loss function : 𝑛 𝑙 𝑈 𝑦 𝑗 𝑓 θ 𝑘 𝐾 𝜄 = − 1 1 𝑧 𝑗 = 𝑘 log 𝑛   𝑈 𝑦 𝑗 𝑙 𝑓 𝜄 𝑚 σ 𝑚=1 𝑗=1 𝑘=1 𝑙 𝑜 + 𝜇 2 2   𝜄 𝑗𝑘 𝑋𝑓𝑗 ℎ 𝑢 R 𝑓𝑣𝑚 ar 𝑗𝑨𝑓 r 𝑗=1 𝑘=0
Deep Trajectory Descriptor  Three-stream Framework
Deep Trajectory Descriptor  Experiment results
Overview  Sequence learning for action recognition  Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network
Temporal Attentive Network  Problems and solutions  Not all postures contribute equally to the successful recognition of an action.  Texture and motion are not independent from each other.  The most important frames for RGB and optical flow may not be corresponding (not the same frame id).
Temporal Attentive Network  Attention mechanism
Temporal Attentive Network Spatial domain Temporal Domain 𝑓 𝑗𝑘 = 𝑤 𝑈 tanh 𝑋 𝑘𝑗 = 𝑣 𝑈 tanh 𝑋 ′ ℎ 𝑗 + 𝑋 ′  𝑘 ′  𝑗 + 𝑋 ′ ℎ 𝑘 𝑔 1 2 3 4 exp 𝑓 𝑗𝑘 exp 𝑔 𝑘𝑗 𝛽 𝑗𝑘 = 𝛾 𝑘𝑗 = 𝑈 𝑈 σ 𝑙=1 σ 𝑙=1 exp 𝑓 𝑙𝑘 exp 𝑔 𝑘𝑙 𝑈 𝑈 Weight for each input  =  ℎ =  𝑝 𝛽 𝑗𝑘 ℎ 𝑗 𝑝 𝑗 𝛾 𝑘𝑗  𝑘 𝑘 𝑗=1 𝑘=1 Weighted sum for all inputs
Temporal Attentive Network  Experiment results
Overview  Sequence learning for action recognition  Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network
shuttleNet  Problems and solutions  Most deep neural networks are generated by only feed-forward connections.  Existing RNN are still not good enough in practice. V2 V4 Blue arrow: feed-forward connection Red arrow: feed-back connection V1 TEO IT Visual Cortical Pathways [Siegelbaum’00 ] [Siegelbaum’00 ] Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.
shuttleNet  Problems and solutions  Most deep neural networks are generated by only feed-forward connections.  Existing RNN are still not good enough in practice. V2 V4 Blue arrow: feed-forward connection Red arrow: feed-back connection V1 TEO IT Visual Cortical Pathways [Siegelbaum’00 ] [Siegelbaum’00 ] Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.
shuttleNet Input Projection Output Loop Selection Connections
shuttleNet  Experiment results Comparing with existing RNNs Comparing with other action recognition methods
Overview  Sequence learning for action recognition  Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network
Motivation  Videos are complicated because of temporal complexity and variation  Distance learning can decrease intra-class distance while increasing inter-class distance.  Method: Triplet loss  Not all frames equally contribute to recognition  The harder to predict one frame, the more representative it is.  Method: Hierarchical Temporal Memory (HTM) Hawkins, Jeff (2004). On Intelligence (1st ed.). Times Books. p. 272. ISBN 0805074562.
Framework
Seen-class Stage  Matching Network training  Sample a target video and a support set video from seen classes, maximize the probability of the class that the target video belongs to.  HTM training  Make HTM accustomed to seen class videos.
Unseen-class Stage  Triplet loss
Experiments
Overview  Sequence learning for action recognition  Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network
Open Deep Network  Motivation  Action recognition in the real world is essentially an open-set problem  Impossible to know all action categories beforehand;  Infeasible to prepare sufficient training samples for those emerging categories.  Most of recognition systems are designed for a static closed world  Primary assumption: all categories are known as priori. Train Test Train/Test Known Known Unknown
Open Deep Network  Multi-class unknown category detection  The multi-class triplet thresholding method  Consider the inter-class relation for unknown category detection, accept the knowns and reject the unknowns  Training a triplet threshold [ 𝜃 𝑗 , 𝜈 𝑗 , 𝜀 𝑗 ] per category  Applying the triplet threshold on each sample during the detection process Define: [ 𝜃 𝑗 , 𝜈 𝑗 , 𝜀 𝑗 ] Accept threshold : 𝜃 𝑗 = alpha ∗ 𝑁𝑓𝑏𝑜 σ 𝑘=1 𝑌 𝑔 𝑗,𝑘 Reject threshold : 𝜈 𝑗 = beta ∗ 𝜃 𝑗 Distance threshold : 𝜀 𝑗 = sigma ∗ 𝑁𝑓𝑏𝑜(σ 𝑘=1 𝑌 (𝑔 𝑗,𝑘 − 𝑡 𝑗,𝑘 )) where: 𝑗,𝑘 is the maximal score of the i-th category 𝑔 𝑡 𝑗,𝑘 is the second maximal score of the i-th category
Recommend
More recommend