GPU Accelerated Sequence Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn 2018-03 1
Background Action Recognition Object Recognition (Video Classification) (Image Classification) Action Recognition VS Object Recognition Temporal domain, Long-term dependence, High computational complexity. General methods are not good enough for action recognition. Existing methods are still far from practical use
Research Trends Datasets Year Actions Videos Annotations Source Localization HMDB51 2011 51 7K 7K YouTube/ No Movie UCF101 2012 101 13K 13K YouTube No Sports 1M 2014 487 1.1M 1.1M YouTube No THUMOS 15 2014 101 24K 21K YouTube Yes ActivityNet 2015 200 20K 23K YouTube Yes Charades 2016/ 157 10K 67K 267 Yes 2017 Homes AVA 2017 80 214 197K Movie Yes Kinetics 2017 400 305K 305K YouTube No MIT 2017 339 1M 1M 10 sources No SLAC 2017 200 520K 1.75M YouTube Yes
Action Recognition Modeling temporal domain is one of the most important target of action recognition. Shortcomings of existing methods: Action have long duration: High complexity LSTM is not good enough. Therefore, we need : Some more efficient sequence learning model to improve the ability of modeling temporal information.
Overview Hand-crafted Temporal Attentive Network Features Hand-crafted Features Importance of Each Frame and Deep Features The Ability of Modeling Temporal Domain Deep Trajectory Descriptor One-shot Action Recognition shuttleNet Open-set Action Recognition Open Deep Network Hierarchical Temporal Memory Enhanced One-shot Distance Learning
Overview Sequence learning for action recognition Deep Trajectory Descriptor Temporal Attentive Network shuttleNet Hierarchical Temporal Memory Enhanced One- shot Distance Learning Open Deep Network
Overview Sequence learning for action recognition Deep Trajectory Descriptor Temporal Attentive Network shuttleNet Hierarchical Temporal Memory Enhanced One- shot Distance Learning Open Deep Network
Deep Trajectory Descriptor Problems and Solutions Hand-crafted feature can hardly describe movement process; CNNs are good at describe structure. Integrate hand-crafted feature and CNN to improve performance. Hand-crafted feature: CNN: More statistics, less structure. Structure is important.
Deep Trajectory Descriptor Improve Dense Trajectory with Background Subtraction Only extract trajectories and optical flow on foreground. Where 𝑇 𝑔𝑝𝑠𝑓 is the sum of the foreground square area. (𝑗, 𝑘) index around the square area. Videos Masks Foreground
Deep Trajectory Descriptor Main Idea Trajectory Texture Image: Project trajectories onto a canvas. Projection in an Dense trajectories adaptative duration Trajectory Project into 2D space Input video Texture Image CNN is employed for structural feature learning. … … Conv Pooling LRN Conv FC
Deep Trajectory Descriptor
Deep Trajectory Descriptor Improve trajectory projection method
Deep Trajectory Descriptor DTD with LSTM Treat each Trajectory Texture Image as one time step input, LSTM is used to model temporal domain. Improve the ability of DTD to model complex action. 𝑦 𝑢 is the input at time t. ℎ 𝑢 is the hidden state at time t. 𝑢 , 𝑑 𝑢 , 𝑝 𝑢 are the input Gate 、 𝑗 𝑢 , 𝑔 forget gate 、 memory cell and output gate at time t. Our LSTM Model
Deep Trajectory Descriptor Learn long-term action description ApplyEyeMakeup CNN for DTD feature learning; Sequential DTD for long- term action representation; RNN(LSTM) for temporal domain modeling. Softmax Loss Loss function : 𝑛 𝑙 𝑈 𝑦 𝑗 𝑓 θ 𝑘 𝐾 𝜄 = − 1 1 𝑧 𝑗 = 𝑘 log 𝑛 𝑈 𝑦 𝑗 𝑙 𝑓 𝜄 𝑚 σ 𝑚=1 𝑗=1 𝑘=1 𝑙 𝑜 + 𝜇 2 2 𝜄 𝑗𝑘 𝑋𝑓𝑗 ℎ 𝑢 R 𝑓𝑣𝑚 ar 𝑗𝑨𝑓 r 𝑗=1 𝑘=0
Deep Trajectory Descriptor Three-stream Framework
Deep Trajectory Descriptor Experiment results
Overview Sequence learning for action recognition Deep Trajectory Descriptor Temporal Attentive Network shuttleNet Hierarchical Temporal Memory Enhanced One- shot Distance Learning Open Deep Network
Temporal Attentive Network Problems and solutions Not all postures contribute equally to the successful recognition of an action. Texture and motion are not independent from each other. The most important frames for RGB and optical flow may not be corresponding (not the same frame id).
Temporal Attentive Network Attention mechanism
Temporal Attentive Network Spatial domain Temporal Domain 𝑓 𝑗𝑘 = 𝑤 𝑈 tanh 𝑋 𝑘𝑗 = 𝑣 𝑈 tanh 𝑋 ′ ℎ 𝑗 + 𝑋 ′ 𝑘 ′ 𝑗 + 𝑋 ′ ℎ 𝑘 𝑔 1 2 3 4 exp 𝑓 𝑗𝑘 exp 𝑔 𝑘𝑗 𝛽 𝑗𝑘 = 𝛾 𝑘𝑗 = 𝑈 𝑈 σ 𝑙=1 σ 𝑙=1 exp 𝑓 𝑙𝑘 exp 𝑔 𝑘𝑙 𝑈 𝑈 Weight for each input = ℎ = 𝑝 𝛽 𝑗𝑘 ℎ 𝑗 𝑝 𝑗 𝛾 𝑘𝑗 𝑘 𝑘 𝑗=1 𝑘=1 Weighted sum for all inputs
Temporal Attentive Network Experiment results
Overview Sequence learning for action recognition Deep Trajectory Descriptor Temporal Attentive Network shuttleNet Hierarchical Temporal Memory Enhanced One- shot Distance Learning Open Deep Network
shuttleNet Problems and solutions Most deep neural networks are generated by only feed-forward connections. Existing RNN are still not good enough in practice. V2 V4 Blue arrow: feed-forward connection Red arrow: feed-back connection V1 TEO IT Visual Cortical Pathways [Siegelbaum’00 ] [Siegelbaum’00 ] Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.
shuttleNet Problems and solutions Most deep neural networks are generated by only feed-forward connections. Existing RNN are still not good enough in practice. V2 V4 Blue arrow: feed-forward connection Red arrow: feed-back connection V1 TEO IT Visual Cortical Pathways [Siegelbaum’00 ] [Siegelbaum’00 ] Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.
shuttleNet Input Projection Output Loop Selection Connections
shuttleNet Experiment results Comparing with existing RNNs Comparing with other action recognition methods
Overview Sequence learning for action recognition Deep Trajectory Descriptor Temporal Attentive Network shuttleNet Hierarchical Temporal Memory Enhanced One- shot Distance Learning Open Deep Network
Motivation Videos are complicated because of temporal complexity and variation Distance learning can decrease intra-class distance while increasing inter-class distance. Method: Triplet loss Not all frames equally contribute to recognition The harder to predict one frame, the more representative it is. Method: Hierarchical Temporal Memory (HTM) Hawkins, Jeff (2004). On Intelligence (1st ed.). Times Books. p. 272. ISBN 0805074562.
Framework
Seen-class Stage Matching Network training Sample a target video and a support set video from seen classes, maximize the probability of the class that the target video belongs to. HTM training Make HTM accustomed to seen class videos.
Unseen-class Stage Triplet loss
Experiments
Overview Sequence learning for action recognition Deep Trajectory Descriptor Temporal Attentive Network shuttleNet Hierarchical Temporal Memory Enhanced One- shot Distance Learning Open Deep Network
Open Deep Network Motivation Action recognition in the real world is essentially an open-set problem Impossible to know all action categories beforehand; Infeasible to prepare sufficient training samples for those emerging categories. Most of recognition systems are designed for a static closed world Primary assumption: all categories are known as priori. Train Test Train/Test Known Known Unknown
Open Deep Network Multi-class unknown category detection The multi-class triplet thresholding method Consider the inter-class relation for unknown category detection, accept the knowns and reject the unknowns Training a triplet threshold [ 𝜃 𝑗 , 𝜈 𝑗 , 𝜀 𝑗 ] per category Applying the triplet threshold on each sample during the detection process Define: [ 𝜃 𝑗 , 𝜈 𝑗 , 𝜀 𝑗 ] Accept threshold : 𝜃 𝑗 = alpha ∗ 𝑁𝑓𝑏𝑜 σ 𝑘=1 𝑌 𝑔 𝑗,𝑘 Reject threshold : 𝜈 𝑗 = beta ∗ 𝜃 𝑗 Distance threshold : 𝜀 𝑗 = sigma ∗ 𝑁𝑓𝑏𝑜(σ 𝑘=1 𝑌 (𝑔 𝑗,𝑘 − 𝑡 𝑗,𝑘 )) where: 𝑗,𝑘 is the maximal score of the i-th category 𝑔 𝑡 𝑗,𝑘 is the second maximal score of the i-th category
Recommend
More recommend