of human actions
play

of human actions Ivan Laptev ivan.laptev@inria.fr WILLOW, - PowerPoint PPT Presentation

ENS/INRIA CVML Summer School 45 rue dUlm , Paris July 26, 2013 Modeling and visual recognition of human actions Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris Objects: Actions: cars, glasses, drinking, running, people,


  1. ENS/INRIA CVML Summer School 45 rue d’Ulm , Paris July 26, 2013 Modeling and visual recognition of human actions Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris

  2. Objects: Actions: cars, glasses, drinking, running, people, etc… door exit, car enter, etc… constraints Scene categories: Geometry: indoors, outdoors, Street, wall, field, street scene, stair, etc… etc…

  3. Human Actions: Why do we care?

  4. Why video analysis? Data: TV-channels recorded since 60’s >34K hours of video uploads every day ~30M surveillance cameras in US => ~700K video hours/day

  5. Why video analysis? Applications: First appearance of Sociology research: Education: How do I N. Sarkozy on TV Influence of character make a pizza? smoking in movies Predicting crowd behavior Where is my cat? Motion capture and animation Counting people

  6. Why human actions? How many person-pixels are in the video? Movies TV YouTube

  7. Why human actions? How many person-pixels are in the video? 35% 34% Movies TV 40% YouTube

  8. How many person pixels in our daily life?  Wearable camera data: Microsoft SenseCam dataset

  9. How many person pixels in our daily life?  Wearable camera data: Microsoft SenseCam dataset ~4%

  10. Why do we prefer to watch other people?  Why do we watch TV, Movies, … at all?  Why do we read books? “… books teach us new patterns of behavior…” Olga Slavnikova Russian journalist and writer

  11. Why action recognition is difficult?

  12. Challenges  Large variations in appearance: occlusions, non-rigid motion, view- … point changes, clothing… Action Hugging :  Manual collection of training samples is prohibitive: many … action classes, rare occurrence  Action vocabulary is not well-defined … Action Open :

  13. How to recognize actions?

  14. Activities characterized by a pose Slide credit: A. Zisserman

  15. Activities characterized by a pose Examples from VOC action recognition challenge ?

  16. Human pose estimation (1990-2000) Finding People by Sampling Ioffe & Forsyth, ICCV 1999 Pictorial Structure Models for Object Recognition Felzenszwalb & Huttenlocher, 2000 Learning to Parse Pictures of People Ronfard, Schmid & Triggs, ECCV 2002

  17. Human pose estimation Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In Proc. CVPR 2011 Extension of LSVM model of Felzenszwalb et al. Y. Wang, D. Tran and Z. Liao. Learning Hierarchical Poselets for Human Parsing. In Proc. CVPR 2011 . Builds on Poslets idea of Bourdev et al. S. Johnson and M. Everingham. Learning Effective Human Pose Estimation from Inaccurate Annotation. In Proc. CVPR 2011 . Learns from lots of noisy annotations B. Sapp, D.Weiss and B. Taskar. Parsing Human Motion with Stretchable Models. In Proc. CVPR 2011 . Explores temporal continuity

  18. Human pose estimation J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman and A. Blake. Real-Time Human Pose Recognition in Parts from Single Depth Images. (Best paper award at CVPR 2011)

  19. Pose estimation is still a hard problem • occlusions Issues: • clothing and pose variations

  20. Appearance methods: Shape [A.F. Bobick and J.W. Davis, PAMI 2001] Idea: summarize motion in video in a Motion History Image (MHI) : L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as spacetime shapes. 2007

  21. Appearance methods: Shape Pros: + Simple and fast + Works in controlled settings Cons: - Prone to errors of background subtraction Variations in light, shadows, clothing… What is the background here? - Does not capture interior Structure and motion Silhouette tells little about actions

  22. Appearance methods: Motion Learning Parameterized Models of Image Motion M.J. Black, Y. Yacoob, A.D. Jepson and D.J. Fleet, 1997 Recognizing action at a distance A.A. Efros, A.C. Berg, G. Mori, and J. Malik., 2003.     blurred , , , F F F F x x y y

  23. Action recognition with local features

  24. Local space-time features + No segmentation needed + No object detection/tracking needed - Loss of global structure [Laptev 2005]

  25. Local approach: Bag of Visual Words Airplanes Motorbikes Faces Wild Cats Leaves People Bikes

  26. Space-Time Interest Points: Detection What neighborhoods to consider? Look at the High image Distinctive   distribution of the variation in space neighborhoods gradient and time Definitions: Original image sequence Space-time Gaussian with covariance Gaussian derivative of Space-time gradient Second-moment matrix [Laptev 2005]

  27. Local features: Proof of concept  Finds similar events in pairs of video sequences

  28. Bag-of-Features action recogntion space-time patches Extraction of Local features K-means clustering Occurrence histogram (k=4000) of visual words Feature description Non-linear SVM with χ 2 Feature kernel quantization [Laptev, Marszałek , Schmid, Rozenfeld 2008]

  29. Action classification results KTH dataset Hollywood-2 dataset AnswerPhone GetOutCar HandShake StandUp DriveCar Kiss [Laptev, Marszałek , Schmid, Rozenfeld 2008]

  30. Action classification Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”

  31. Evaluation of local feature detectors and descriptors Four types of detectors: • Harris3D [Laptev 2003] • Cuboids [Dollar et al. 2005] • Hessian [Willems et al. 2008] • Regular dense sampling Four types of descriptors: • HoG/HoF [Laptev et al. 2008] • Cuboids [Dollar et al. 2005] • HoG3D [Kläser et al. 2008] • Extended SURF [ Willems’et al. 2008] Three human actions datasets: • KTH actions [Schuldt et al. 2004] • UCF Sports [Rodriguez et al. 2008] • Hollywood 2 [ Marszałek et al. 2009]

  32. Space-time feature detectors Harris3D Hessian Cuboids Dense

  33. Results on KTH Actions 6 action classes, 4 scenarios, staged Detectors Harris3D Cuboids Hessian Dense HOG3D 89.0% 90.0% 84.6% 85.3% Descriptors HOG/HOF 91.8% 88.7% 88.7% 86.1% 80.9% 82.3% 77.7% 79.0% HOG 92.1% 88.2% 88.6% 88.0% HOF Cuboids - 89.1% - - E-SURF - - 81.4% - (Average accuracy scores) • Best results for sparse Harris3D + HOF • Dense features perform relatively poor compared to sparse features [Wang, Ullah, Kläser, Laptev, Schmid, 2009]

  34. Results on Diving Kicking Walking UCF Sports Skateboarding High-Bar-Swinging Golf-Swinging 10 action classes, videos from TV broadcasts Detectors Harris3D Cuboids Hessian Dense Descriptors 79.7% 82.9% 79.0% 85.6% HOG3D HOG/HOF 78.1% 77.7% 79.3% 81.6% HOG 71.4% 72.7% 66.0% 77.4% HOF 75.4% 76.7% 75.3% 82.6% - 76.6% - - Cuboids - - 77.3% - E-SURF (Average precision scores) • Best results for dense + HOG3D [Wang, Ullah, Kläser, Laptev, Schmid, 2009]

  35. Results on Hollywood-2 AnswerPhone GetOutCar Kiss HandShake StandUp DriveCar 12 action classes collected from 69 movies Detectors Harris3D Cuboids Hessian Dense Descriptors HOG3D 43.7% 45.7% 41.3% 45.3% HOG/HOF 45.2% 46.2% 46.0% 47.4% 32.8% 39.4% 36.2% 39.4% HOG 43.3% 42.9% 43.0% 45.5% HOF Cuboids - 45.0% - - E-SURF - - 38.2% - (Average precision scores) • Best results for dense + HOG/HOF [Wang, Ullah, Kläser, Laptev, Schmid, 2009]

  36. Other recent local representations • Y. and L. Wolf, "Local Trinary Patterns for Human Action Recognition ", ICCV 2009 • P. Matikainen, R. Sukthankar and M. Hebert "Trajectons: Action Recognition Through the Motion Analysis of Tracked Features" ICCV VOEC Workshop 2009, • • H. Wang, A. Klaser, C. Schmid, C.-L. Liu, "Action Recognition by Dense Trajectories", CVPR 2011 • Recognizing Human Actions by Attributes J. Liu, B. Kuipers, S. Savarese, CVPR 2011

  37. Dense trajectory descriptors [Wang et al. CVPR’11]

  38. Dense trajectory descriptors [Wang et al. CVPR’11] [Wang et al.] [Wang et al.] [Wang et al.] [Wang et al.]

  39. Dense trajectory descriptors [Wang et al. CVPR’11] Computational cost:

  40. Highly-efficient video descriptors Optical flow from MPEG video compression

  41. Highly-efficient video descriptors Evaluation on Hollywood2 [Wang et al.’11] Evaluation on UCF50 [Wang et al.’11] [Kantorov & Laptev, 2013]

  42. Beyond BOF: Temporal structure • Modeling Temporal Structure of Decomposable Motion Segments for Activity Classication, J.C. Niebles, C.-W. Chen and L. Fei-Fei, ECCV 2010 • Learning Latent Temporal Structure for Complex Event Detection. Kevin Tang, Li Fei-Fei and Daphne Koller, CVPR 2012

  43. Beyond BOF: Social roles • T. Yu, S.-N. Lim, K. Patwardhan, and N. Krahnstoever. Monitoring, recognizing and discovering social networks. In CVPR, 2009. • L. Ding and A. Yilmaz. Learning relations among movie characters: A social network perspective. In ECCV, 2010 • V. Ramanathan, B. Yao, and L. Fei-Fei. Social Role Discovery in Human Events. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2013.

Recommend


More recommend