Video-based Action Recognition Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208
Outline Introduction The Task of Action Recognition Main Challenges in Action Recognition Categorization of Existing Methods Common-used Action Datasets Action Recognition by Appearance Representation - I On Space-Time Interest Points Action Recognition by Appearance Representation - II Recognizing Human Actions: A Local SVM Approach Action Recognition by Dynamic Modeling Coupled Hidden Markov Models for Complex Action Recognition
What is an Action? ◮ Action : Atomic motion(s) that can be unambiguously distinguished (e.g. sitting down, running). ◮ An activity is composed of several actions performed in succession (e.g. dining, meeting a person). ◮ Event is a combination of activities (e.g. football match, traffic accident).
What is Action Recognition? ◮ What is Recognition? ◮ Verification : Is the walking man Obama? ◮ Identification : Who is the walking man? ◮ Recognition : What is the man doing? ◮ The recognition of action is to match the observation (e.g. videos) with previously defined patterns and then assign it a label, i.e. action type. ◮ Input : an action video; ◮ Output : an action label;
Why Need Action Recognition? ◮ Expensive human effort to handle rapidly increasing amount of video records; ◮ Large number of potential applications: ◮ visual surveillance ◮ crowd behavior analysis ◮ human-machine interfaces ◮ sports video analysis ◮ video retrieval ◮ etc.
Main Challenges in Action Recognition ◮ Different scales ◮ People may appear at different scales in different videos, yet perform the same action. ◮ Movement of the camera ◮ Background “clutter” ◮ Other objects/humans present in the video frame. ◮ Partial Occlusions ◮ Human/Action variation (large intra-class variation) ◮ Walking movements can differ in speed and stride length. ◮ Etc.
Categories of Action Recognition Methods Appearance Representation ◮ Focus on extracting “better” appearance representation from action video; ◮ hand-crafted features : HOG [7], HOF [4], MBH [18] or combinations [18]; ◮ learned features : deep neural network [20, 5, 16]
Categories of Action Recognition Methods Dynamic Modeling ◮ Focus on molding the dynamics and motions in action video; ◮ Deterministic models : dynamic time warping [24], maximum margin temporal warping [19], actom sequence model [6], graphs [3] and deep neural architectures [14, 17]; ◮ Generative models : HMM [10], coupled HMM [2], CRF [21] and dynamic Bayes nets [23].
Small Size Datasets ◮ The KTH Dataset [13] ◮ 6 actions (walking, jogging, running, boxing, hand waving and hand clapping) ◮ The Weizmann Dataset [1] ◮ 10 actions (walk, run, jump, gallop sideways, bend, one-hand wave, two-hands wave, jump in place, jumping jack and skip) ◮ The UCF Sports Action Dataset ◮ 9 actions (diving, golf swinging, kicking, weightlifting, horseback riding, running, skating, swinging a baseball bat and walking)
Large Size Datasets ◮ The IXMAS Dataset [22] ◮ 14 actions (check watch, cross arms, scratch head, sit down, get up, turn around, walk, wave, punch, kick, point, pick up, throw over head and throw from bottom up) ◮ Hollywood Human Action Dataset [11] ◮ 12 actions (answer phone, get out of car, handshake, hug, kiss, sit down, sit up, stand up, drive car, eat, fight and run) ◮ The UCF50 Dataset [12]: 50 different actions/activities ◮ The HMDB51 Dataset [8]: 51 different actions/activities ◮ The UCF101 Dataset [15]: 101 different actions/activities
Data Samples (a) KTH Dataset (b) Hollywood Dataset
Outline Introduction The Task of Action Recognition Main Challenges in Action Recognition Categorization of Existing Methods Common-used Action Datasets Action Recognition by Appearance Representation - I On Space-Time Interest Points Action Recognition by Appearance Representation - II Recognizing Human Actions: A Local SVM Approach Action Recognition by Dynamic Modeling Coupled Hidden Markov Models for Complex Action Recognition
Overview ◮ Title : On Space-Time Interest Points (2005) 1 ◮ Motivated by the idea of Harris and Forstner spatial interest point operators, extended into the spatio-temporal domain; ◮ Aim to find the “good” spatio-temporal positions in a sequence for feature extraction; ◮ Distinct and stable descriptors are extracted from the obtained interest points; ◮ Author : Ivan Laptev 1 I. Laptev. On space-time interest points. International Journal of Computer Vision , 64(2-3):107–123, 2005
Spatio-Temporal Interest Points ◮ The points that have large variations along both the spatial and the temporal directions in local spatio temporal volumes. Figure : Detecting the strongest spatio-temporal interest points in a football sequence with a player heading the ball.
Spatio-Temporal Interest Point Detection ◮ In the spatial domain, we can model an image f sp by its linear scale-space representation L sp : ∗ f sp ( x , y ) L sp � x , y ; σ 2 = g sp � x , y ; σ 2 � � l l ◮ Like the operation for image, we can model the sequence by a linear scale-space representation L : � · ; σ 2 l , τ 2 � � · ; σ 2 l , τ 2 � L = g ∗ f ( · ) l l = exp ( − ( x 2 + y 2 ) / 2 σ 2 l − t 2 / 2 τ 2 l ) x , y , t ; σ 2 l , τ 2 � � g l � (2 π ) 3 σ 4 l τ 2 l
Spatio-Temporal Interest Point Detection ◮ Construct a 3 × 3 spatio-temporal second-moment matrix: L 2 L x L y L x L t x · ; σ 2 i , τ 2 L 2 � � µ = g ∗ L x L y L y L t i y L 2 L x L t L y L t t ◮ The first-order derivatives are defined as: ( ξ = { x,y,t } ) � · ; σ 2 l , τ 2 � L ξ = ∂ ξ ( g ∗ f ) l ◮ Compute the three eigenvalues λ 1 , λ 2 and λ 3 of µ , the Harris corner function is then defined as: H = det ( µ ) − k · trace 3 ( µ ) = λ 1 λ 2 λ 3 − k ( λ 1 + λ 2 + λ 3 ) 3 ◮ Detect the interest points by calculating the positive local maxima of H ;
Space-Time Interest Points: Examples (a) Action : clapping hands (b) The detected interst points
Spatio-Temporal Scale Adaptation ◮ Let’s recall the scale-space representation L ( · ; σ 2 l , τ 2 l ), the two scale factors σ 2 l and τ 2 l influence the result a lot; ◮ The larger the τ 2 l is, the easier the space-time structures with long temporal extents are detected; ◮ The larger the σ 2 l is, the easier the space-time structures with long spatial extents are detected;
Spatio-Temporal Scale Adaptation ◮ By finding the extrema of ▽ 2 norm L over both spatial and temporal scales, we can automatically determine the scale factors.
Result Figure : Results of spatial/spatio-temporal interest point detection for a zoom-in sequence of a walking person.
Result Figure : (top): Correct matches in sequences with leg actions; (bottom): Correct matches in sequences with arm actions;
Outline Introduction The Task of Action Recognition Main Challenges in Action Recognition Categorization of Existing Methods Common-used Action Datasets Action Recognition by Appearance Representation - I On Space-Time Interest Points Action Recognition by Appearance Representation - II Recognizing Human Actions: A Local SVM Approach Action Recognition by Dynamic Modeling Coupled Hidden Markov Models for Complex Action Recognition
Overview ◮ Title : Recognizing Human Actions: A Local SVM Approach (2004) 2 ◮ Use local space-time features to represent video sequences that contain actions. ◮ Classification is done via an SVM. ◮ Author : Christian Schuldt, Ivan Laptev and Barbara Caputo 2 C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on , volume 3, pages 32–36. IEEE, 2004
Local Space-time Features Figure : Local space-time features detected for a walking pattern 3 3 C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on , volume 3, pages 32–36. IEEE, 2004
Representation of Features ◮ Spatial-temporal “jets” (4th order) are computed at each feature center: j = ( L x , L y , L t , L xx , · · · , L tttt ) | σ 2 =˜ σ 2 i ,τ 2 =˜ τ 2 i L x m y n t k = σ m + n τ k ( ∂ x m y n t k g ) ∗ f ◮ Using k-means clustering over j , a vocabulary consisting of words h i is created from the jet descriptors; ◮ Finally, a given video is represented by a histogram of counts of occurrences of features corresponding to h i in that video: H = ( h 1 , ..., h n )
Recognition by Support Vector Machines ◮ For action recognition, we combine the obtained local space-time features with SVM; ◮ Given a set of training data from different action classes { ( H i , y i ) } n i =1 , a SVM classifier for each action class is learned: � n � � f ( H ) = sgn α i y i H i + b i =1 ◮ Easy to extend to a kernelized version;
Results Figure : Results of action recognition for different methods and scenarios on KTH dataset.
Recommend
More recommend