learning for action recognition
play

Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn - PowerPoint PPT Presentation

GPU Accelerated Sequence Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn 2018-03 1 Background Action Recognition Object Recognition (Video Classification) (Image Classification) Action Recognition VS Object Recognition


  1. GPU Accelerated Sequence Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn 2018-03 1

  2. Background Action Recognition Object Recognition (Video Classification) (Image Classification)  Action Recognition VS Object Recognition  Temporal domain, Long-term dependence, High computational complexity.  General methods are not good enough for action recognition.  Existing methods are still far from practical use

  3. Research Trends Datasets Year Actions Videos Annotations Source Localization HMDB51 2011 51 7K 7K YouTube/ No Movie UCF101 2012 101 13K 13K YouTube No Sports 1M 2014 487 1.1M 1.1M YouTube No THUMOS 15 2014 101 24K 21K YouTube Yes ActivityNet 2015 200 20K 23K YouTube Yes Charades 2016/ 157 10K 67K 267 Yes 2017 Homes AVA 2017 80 214 197K Movie Yes Kinetics 2017 400 305K 305K YouTube No MIT 2017 339 1M 1M 10 sources No SLAC 2017 200 520K 1.75M YouTube Yes

  4. Action Recognition  Modeling temporal domain is one of the most important target of action recognition.  Shortcomings of existing methods:  Action have long duration: High complexity  LSTM is not good enough.  Therefore, we need :  Some more efficient sequence learning model to improve the ability of modeling temporal information.

  5. Overview Hand-crafted Temporal Attentive Network Features Hand-crafted Features Importance of Each Frame and Deep Features The Ability of Modeling Temporal Domain Deep Trajectory Descriptor One-shot Action Recognition shuttleNet Open-set Action Recognition Open Deep Network Hierarchical Temporal Memory Enhanced One-shot Distance Learning

  6. Overview  Sequence learning for action recognition  Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network

  7. Overview  Sequence learning for action recognition  Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network

  8. Deep Trajectory Descriptor  Problems and Solutions  Hand-crafted feature can hardly describe movement process; CNNs are good at describe structure.  Integrate hand-crafted feature and CNN to improve performance. Hand-crafted feature: CNN: More statistics, less structure. Structure is important.

  9. Deep Trajectory Descriptor  Improve Dense Trajectory with Background Subtraction  Only extract trajectories and optical flow on foreground. Where 𝑇 𝑔𝑝𝑠𝑓 is the sum of the foreground square area. (𝑗, 𝑘) index around the square area. Videos Masks Foreground

  10. Deep Trajectory Descriptor  Main Idea  Trajectory Texture Image: Project trajectories onto a canvas. Projection in an Dense trajectories adaptative duration Trajectory Project into 2D space Input video Texture Image  CNN is employed for structural feature learning. … … Conv Pooling LRN Conv FC

  11. Deep Trajectory Descriptor

  12. Deep Trajectory Descriptor  Improve trajectory projection method

  13. Deep Trajectory Descriptor  DTD with LSTM  Treat each Trajectory Texture Image as one time step input, LSTM is used to model temporal domain.  Improve the ability of DTD to model complex action. 𝑦 𝑢 is the input at time t. ℎ 𝑢 is the hidden state at time t. 𝑢 , 𝑑 𝑢 , 𝑝 𝑢 are the input Gate 、 𝑗 𝑢 , 𝑔 forget gate 、 memory cell and output gate at time t. Our LSTM Model

  14. Deep Trajectory Descriptor  Learn long-term action description ApplyEyeMakeup  CNN for DTD feature learning;  Sequential DTD for long- term action representation;  RNN(LSTM) for temporal domain modeling. Softmax Loss Loss function : 𝑛 𝑙 𝑈 𝑦 𝑗 𝑓 θ 𝑘 𝐾 𝜄 = − 1 1 𝑧 𝑗 = 𝑘 log 𝑛 ෍ ෍ 𝑈 𝑦 𝑗 𝑙 𝑓 𝜄 𝑚 σ 𝑚=1 𝑗=1 𝑘=1 𝑙 𝑜 + 𝜇 2 2 ෍ ෍ 𝜄 𝑗𝑘 𝑋𝑓𝑗𝑕 ℎ 𝑢 R 𝑓𝑕𝑣𝑚 ar 𝑗𝑨𝑓 r 𝑗=1 𝑘=0

  15. Deep Trajectory Descriptor  Three-stream Framework

  16. Deep Trajectory Descriptor  Experiment results

  17. Overview  Sequence learning for action recognition  Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network

  18. Temporal Attentive Network  Problems and solutions  Not all postures contribute equally to the successful recognition of an action.  Texture and motion are not independent from each other.  The most important frames for RGB and optical flow may not be corresponding (not the same frame id).

  19. Temporal Attentive Network  Attention mechanism

  20. Temporal Attentive Network Spatial domain Temporal Domain 𝑓 𝑗𝑘 = 𝑤 𝑈 tanh 𝑋 𝑘𝑗 = 𝑣 𝑈 tanh 𝑋 ′ ℎ 𝑗 + 𝑋 ′ 𝑕 𝑘 ′ 𝑕 𝑗 + 𝑋 ′ ℎ 𝑘 𝑔 1 2 3 4 exp 𝑓 𝑗𝑘 exp 𝑔 𝑘𝑗 𝛽 𝑗𝑘 = 𝛾 𝑘𝑗 = 𝑈 𝑈 σ 𝑙=1 σ 𝑙=1 exp 𝑓 𝑙𝑘 exp 𝑔 𝑘𝑙 𝑈 𝑈 Weight for each input 𝑕 = ෍ ℎ = ෍ 𝑝 𝛽 𝑗𝑘 ℎ 𝑗 𝑝 𝑗 𝛾 𝑘𝑗 𝑕 𝑘 𝑘 𝑗=1 𝑘=1 Weighted sum for all inputs

  21. Temporal Attentive Network  Experiment results

  22. Overview  Sequence learning for action recognition  Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network

  23. shuttleNet  Problems and solutions  Most deep neural networks are generated by only feed-forward connections.  Existing RNN are still not good enough in practice. V2 V4 Blue arrow: feed-forward connection Red arrow: feed-back connection V1 TEO IT Visual Cortical Pathways [Siegelbaum’00 ] [Siegelbaum’00 ] Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.

  24. shuttleNet  Problems and solutions  Most deep neural networks are generated by only feed-forward connections.  Existing RNN are still not good enough in practice. V2 V4 Blue arrow: feed-forward connection Red arrow: feed-back connection V1 TEO IT Visual Cortical Pathways [Siegelbaum’00 ] [Siegelbaum’00 ] Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.

  25. shuttleNet Input Projection Output Loop Selection Connections

  26. shuttleNet  Experiment results Comparing with existing RNNs Comparing with other action recognition methods

  27. Overview  Sequence learning for action recognition  Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network

  28. Motivation  Videos are complicated because of temporal complexity and variation  Distance learning can decrease intra-class distance while increasing inter-class distance.  Method: Triplet loss  Not all frames equally contribute to recognition  The harder to predict one frame, the more representative it is.  Method: Hierarchical Temporal Memory (HTM) Hawkins, Jeff (2004). On Intelligence (1st ed.). Times Books. p. 272. ISBN 0805074562.

  29. Framework

  30. Seen-class Stage  Matching Network training  Sample a target video and a support set video from seen classes, maximize the probability of the class that the target video belongs to.  HTM training  Make HTM accustomed to seen class videos.

  31. Unseen-class Stage  Triplet loss

  32. Experiments

  33. Overview  Sequence learning for action recognition  Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network

  34. Open Deep Network  Motivation  Action recognition in the real world is essentially an open-set problem  Impossible to know all action categories beforehand;  Infeasible to prepare sufficient training samples for those emerging categories.  Most of recognition systems are designed for a static closed world  Primary assumption: all categories are known as priori. Train Test Train/Test Known Known Unknown

  35. Open Deep Network  Multi-class unknown category detection  The multi-class triplet thresholding method  Consider the inter-class relation for unknown category detection, accept the knowns and reject the unknowns  Training a triplet threshold [ 𝜃 𝑗 , 𝜈 𝑗 , 𝜀 𝑗 ] per category  Applying the triplet threshold on each sample during the detection process Define: [ 𝜃 𝑗 , 𝜈 𝑗 , 𝜀 𝑗 ] Accept threshold : 𝜃 𝑗 = alpha ∗ 𝑁𝑓𝑏𝑜 σ 𝑘=1 𝑌 𝑔 𝑗,𝑘 Reject threshold : 𝜈 𝑗 = beta ∗ 𝜃 𝑗 Distance threshold : 𝜀 𝑗 = sigma ∗ 𝑁𝑓𝑏𝑜(σ 𝑘=1 𝑌 (𝑔 𝑗,𝑘 − 𝑡 𝑗,𝑘 )) where: 𝑗,𝑘 is the maximal score of the i-th category 𝑔 𝑡 𝑗,𝑘 is the second maximal score of the i-th category

Recommend


More recommend