motion and human actions
play

Motion and Human Actions Ivan Laptev ivan.laptev@inria.fr INRIA, - PowerPoint PPT Presentation

Reconnaissance dobjets et vision artificielle 2013 Motion and Human Actions Ivan Laptev ivan.laptev@inria.fr INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548 Laboratoire dInformatique , Ecole Normale Suprieure, Paris Class overview Motivation


  1. Reconnaissance d’objets et vision artificielle 2013 Motion and Human Actions Ivan Laptev ivan.laptev@inria.fr INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548 Laboratoire d’Informatique , Ecole Normale Supérieure, Paris

  2. Class overview Motivation Historic review Modern applications Appearance-based methods Motion history images Active shape models Tracking and motion priors Motion-based methods Generic and parametric Optical Flow Motion templates Space-time methods Local space-time features Action classification and detection Weakly-supervised action learning

  3. What we have seen so far ? Temporal templates: Active shape models: Tracking with motion priors: + simple, fast + shape regularization + improved tracking and - sensitive to simultaneous action recognition - sensitive to initialization and - sensitive to initialization and segmentation errors tracking failures tracking failures Motion-based recognition: + generic descriptors; less depends on appearance - sensitive to localization/tracking errors

  4. Motivation Goal: Interpreting complex dynamic scenes Common methods: Common problems: • Segmentation ? • Complex & changing BG • Changing appearance ? • Tracking  No global assumptions about the scene

  5. Space-time No global assumptions  Consider local spatio-temporal neighborhoods hand waving boxing

  6. Actions == Space-time objects?

  7. Local approach: Bag of Visual Words Airplanes Motorbikes Faces Wild Cats Leaves People Bikes

  8. Space-time local features

  9. Space-Time Interest Points: Detection What neighborhoods to consider? Look at the High image Distinctive   distribution of the variation in space neighborhoods gradient and time Definitions: Original image sequence Space-time Gaussian with covariance Gaussian derivative of Space-time gradient Second-moment matrix

  10. Space-Time Interest Points: Detection Properties of : defines second order approximation for the local distribution of within neighborhood  1D space-time variation of , e.g. moving bar  2D space-time variation of , e.g. moving ball  3D space-time variation of , e.g. jumping ball Large eigenvalues of  can be detected by the local maxima of H over (x,y,t): (similar to Harris operator [Harris and Stephens, 1988])

  11. Space-Time interest points appearance/ Velocity split/merge disappearance changes

  12. Space-Time Interest Points: Examples Motion event detection

  13. Spatio-temporal scale selection Local features can be adapted scale changes Selection of temporal scales captures the frequency of events

  14. Relative camera motion Local features can be adapted to motion changes time time

  15. Local features for human actions

  16. Local features for human actions boxing walking hand waving

  17. Local space-time descriptor: HOG/HOF Multi-scale space-time patches Histogram of Histogram oriented spatial of optical  grad. (HOG) ‏ flow (HOF) ‏ 3x3x2x5bins HOF 3x3x2x4bins HOG descriptor descriptor

  18. Visual Vocabulary: K-means clustering  Group similar points in the space of image descriptors using K-means clustering  Select significant clusters Clustering c1 c2 c3 c4 Classification

  19. Visual Vocabulary: K-means clustering  Group similar points in the space of image descriptors using K-means clustering  Select significant clusters Clustering c1 c2 c3 c4 Classification

  20. Local features: Matching  Finds similar events in pairs of video sequences

  21. Action Classification: Overview Bag of space-time features + multi-channel SVM [Laptev’ 03 , Schuldt’ 04 , Niebles’ 06 , Zhang’ 07] Collection of space-time patches Histogram of visual words Multi-channel HOG & HOF SVM patch Classifier descriptors

  22. Action classification results KTH dataset Hollywood-2 dataset AnswerPhone GetOutCar HandShake StandUp DriveCar Kiss [Laptev, Marszałek , Schmid, Rozenfeld 2008]

  23. Action classification Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”

  24. Evaluation of local feature detectors and descriptors Four types of detectors: • Harris3D [Laptev 2003] • Cuboids [Dollar et al. 2005] • Hessian [Willems et al. 2008] • Regular dense sampling Four types of descriptors: • HoG/HoF [Laptev et al. 2008] • Cuboids [Dollar et al. 2005] • HoG3D [Kläser et al. 2008] • Extended SURF [ Willems’et al. 2008] Three human actions datasets: • KTH actions [Schuldt et al. 2004] • UCF Sports [Rodriguez et al. 2008] • Hollywood 2 [ Marszałek et al. 2009]

  25. Space-time feature detectors Harris3D Hessian Cuboids Dense

  26. Results on KTH Actions 6 action classes, 4 scenarios, staged Detectors Harris3D Cuboids Hessian Dense HOG3D 89.0% 90.0% 84.6% 85.3% Descriptors HOG/HOF 91.8% 88.7% 88.7% 86.1% 80.9% 82.3% 77.7% 79.0% HOG 92.1% 88.2% 88.6% 88.0% HOF Cuboids - 89.1% - - E-SURF - - 81.4% - (Average accuracy scores) • Best results for sparse Harris3D + HOF • Dense features perform relatively poor compared to sparse features [Wang, Ullah, Kläser, Laptev, Schmid, 2009]

  27. Results on Diving Kicking Walking UCF Sports Skateboarding High-Bar-Swinging Golf-Swinging 10 action classes, videos from TV broadcasts Detectors Harris3D Cuboids Hessian Dense Descriptors 79.7% 82.9% 79.0% 85.6% HOG3D HOG/HOF 78.1% 77.7% 79.3% 81.6% HOG 71.4% 72.7% 66.0% 77.4% HOF 75.4% 76.7% 75.3% 82.6% - 76.6% - - Cuboids - - 77.3% - E-SURF (Average precision scores) • Best results for dense + HOG3D [Wang, Ullah, Kläser, Laptev, Schmid, 2009]

  28. Results on Hollywood-2 AnswerPhone GetOutCar Kiss HandShake StandUp DriveCar 12 action classes collected from 69 movies Detectors Harris3D Cuboids Hessian Dense Descriptors HOG3D 43.7% 45.7% 41.3% 45.3% HOG/HOF 45.2% 46.2% 46.0% 47.4% 32.8% 39.4% 36.2% 39.4% HOG 43.3% 42.9% 43.0% 45.5% HOF Cuboids - 45.0% - - E-SURF - - 38.2% - (Average precision scores) • Best results for dense + HOG/HOF [Wang, Ullah, Kläser, Laptev, Schmid, 2009]

  29. Other recent local representations • Y. and L. Wolf, "Local Trinary Patterns for Human Action Recognition ", ICCV 2009 • P. Matikainen, R. Sukthankar and M. Hebert "Trajectons: Action Recognition Through the Motion Analysis of Tracked Features" ICCV VOEC Workshop 2009, • • H. Wang, A. Klaser, C. Schmid, C.-L. Liu, "Action Recognition by Dense Trajectories", CVPR 2011 • Recognizing Human Actions by Attributes J. Liu, B. Kuipers, S. Savarese, CVPR 2011

  30. Dense trajectory descriptors [Wang et al. CVPR’ 11]

  31. Dense trajectory descriptors [Wang et al. CVPR’11] [Wang et al.] [Wang et al.] [Wang et al.] [Wang et al.]

  32. Dense trajectory descriptors [Wang et al. CVPR’11] Computational cost:

  33. Highly-efficient video descriptors Optical flow from MPEG video compression

  34. Highly-efficient video descriptors Evaluation on Hollywood2 [Wang et al.’11] Evaluation on UCF50 [Wang et al.’ 11] [Kantorov & Laptev, 2013]

  35. Beyond BOF: Temporal structure • Modeling Temporal Structure of Decomposable Motion Segments for Activity Classication, J.C. Niebles, C.-W. Chen and L. Fei-Fei, ECCV 2010 • Learning Latent Temporal Structure for Complex Event Detection. Kevin Tang, Li Fei-Fei and Daphne Koller, CVPR 2012

  36. Beyond BOF: Social roles • T. Yu, S.-N. Lim, K. Patwardhan, and N. Krahnstoever. Monitoring, recognizing and discovering social networks. In CVPR, 2009. • L. Ding and A. Yilmaz. Learning relations among movie characters: A social network perspective. In ECCV, 2010 • V. Ramanathan, B. Yao, and L. Fei-Fei. Social Role Discovery in Human Events. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2013.

  37. Beyond BOF: Egocentric activities • A. Fathi, A. Farhadi, and J. M. Rehg. Understanding egocentric activities. In ICCV, 2011. • H. Pirsiavash, D. Ramanan. Recognizing Activities of Daily Living in First-Person Camera Views, In CVPR, 2012.

  38. Beyond BOF: Action localization Manual annotation of drinking actions in movies: “Coffee and Cigarettes”; “Sea of Love” “ Drinking ”: 159 annotated samples “ Smoking ”: 149 annotated samples Temporal annotation First frame Keyframe Last frame Spatial annotation head rectangle torso rectangle

  39. Action representation Hist. of Gradient Hist. of Optic Flow

  40. Action learning selected features boosting weak classifier • Efficient discriminative classifier [Freund&Schapire’ 97] AdaBoost: • Good performance for face detection [Viola&Jones’ 01] pre-aligned Haar optimal threshold samples features Fisher discriminant Histogram features [Laptev, Perez 2007]

  41. Action Detection Test episodes from the movie “Coffee and cigarettes” [Laptev, Perez 2007]

  42. 20 most confident detections

  43. Where to get training data? Weakly-supervised learning

  44. Actions in movies • Realistic variation of human actions • Many classes and many examples per class • Typically only a few class-samples per movie • Manual annotation is very time consuming

Recommend


More recommend