ENS/INRIA CVML Summer School 45 rue d’Ulm , Paris July 26, 2013 Modeling and visual recognition of human actions Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris
Objects: Actions: cars, glasses, drinking, running, people, etc… door exit, car enter, etc… constraints Scene categories: Geometry: indoors, outdoors, Street, wall, field, street scene, stair, etc… etc…
Human Actions: Why do we care?
Why video analysis? Data: TV-channels recorded since 60’s >34K hours of video uploads every day ~30M surveillance cameras in US => ~700K video hours/day
Why video analysis? Applications: First appearance of Sociology research: Education: How do I N. Sarkozy on TV Influence of character make a pizza? smoking in movies Predicting crowd behavior Where is my cat? Motion capture and animation Counting people
Why human actions? How many person-pixels are in the video? Movies TV YouTube
Why human actions? How many person-pixels are in the video? 35% 34% Movies TV 40% YouTube
How many person pixels in our daily life? Wearable camera data: Microsoft SenseCam dataset
How many person pixels in our daily life? Wearable camera data: Microsoft SenseCam dataset ~4%
Why do we prefer to watch other people? Why do we watch TV, Movies, … at all? Why do we read books? “… books teach us new patterns of behavior…” Olga Slavnikova Russian journalist and writer
Why action recognition is difficult?
Challenges Large variations in appearance: occlusions, non-rigid motion, view- … point changes, clothing… Action Hugging : Manual collection of training samples is prohibitive: many … action classes, rare occurrence Action vocabulary is not well-defined … Action Open :
How to recognize actions?
Activities characterized by a pose Slide credit: A. Zisserman
Activities characterized by a pose Examples from VOC action recognition challenge ?
Human pose estimation (1990-2000) Finding People by Sampling Ioffe & Forsyth, ICCV 1999 Pictorial Structure Models for Object Recognition Felzenszwalb & Huttenlocher, 2000 Learning to Parse Pictures of People Ronfard, Schmid & Triggs, ECCV 2002
Human pose estimation Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In Proc. CVPR 2011 Extension of LSVM model of Felzenszwalb et al. Y. Wang, D. Tran and Z. Liao. Learning Hierarchical Poselets for Human Parsing. In Proc. CVPR 2011 . Builds on Poslets idea of Bourdev et al. S. Johnson and M. Everingham. Learning Effective Human Pose Estimation from Inaccurate Annotation. In Proc. CVPR 2011 . Learns from lots of noisy annotations B. Sapp, D.Weiss and B. Taskar. Parsing Human Motion with Stretchable Models. In Proc. CVPR 2011 . Explores temporal continuity
Human pose estimation J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman and A. Blake. Real-Time Human Pose Recognition in Parts from Single Depth Images. (Best paper award at CVPR 2011)
Pose estimation is still a hard problem • occlusions Issues: • clothing and pose variations
Appearance methods: Shape [A.F. Bobick and J.W. Davis, PAMI 2001] Idea: summarize motion in video in a Motion History Image (MHI) : L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as spacetime shapes. 2007
Appearance methods: Shape Pros: + Simple and fast + Works in controlled settings Cons: - Prone to errors of background subtraction Variations in light, shadows, clothing… What is the background here? - Does not capture interior Structure and motion Silhouette tells little about actions
Appearance methods: Motion Learning Parameterized Models of Image Motion M.J. Black, Y. Yacoob, A.D. Jepson and D.J. Fleet, 1997 Recognizing action at a distance A.A. Efros, A.C. Berg, G. Mori, and J. Malik., 2003. blurred , , , F F F F x x y y
Action recognition with local features
Local space-time features + No segmentation needed + No object detection/tracking needed - Loss of global structure [Laptev 2005]
Local approach: Bag of Visual Words Airplanes Motorbikes Faces Wild Cats Leaves People Bikes
Space-Time Interest Points: Detection What neighborhoods to consider? Look at the High image Distinctive distribution of the variation in space neighborhoods gradient and time Definitions: Original image sequence Space-time Gaussian with covariance Gaussian derivative of Space-time gradient Second-moment matrix [Laptev 2005]
Local features: Proof of concept Finds similar events in pairs of video sequences
Bag-of-Features action recogntion space-time patches Extraction of Local features K-means clustering Occurrence histogram (k=4000) of visual words Feature description Non-linear SVM with χ 2 Feature kernel quantization [Laptev, Marszałek , Schmid, Rozenfeld 2008]
Action classification results KTH dataset Hollywood-2 dataset AnswerPhone GetOutCar HandShake StandUp DriveCar Kiss [Laptev, Marszałek , Schmid, Rozenfeld 2008]
Action classification Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”
Evaluation of local feature detectors and descriptors Four types of detectors: • Harris3D [Laptev 2003] • Cuboids [Dollar et al. 2005] • Hessian [Willems et al. 2008] • Regular dense sampling Four types of descriptors: • HoG/HoF [Laptev et al. 2008] • Cuboids [Dollar et al. 2005] • HoG3D [Kläser et al. 2008] • Extended SURF [ Willems’et al. 2008] Three human actions datasets: • KTH actions [Schuldt et al. 2004] • UCF Sports [Rodriguez et al. 2008] • Hollywood 2 [ Marszałek et al. 2009]
Space-time feature detectors Harris3D Hessian Cuboids Dense
Results on KTH Actions 6 action classes, 4 scenarios, staged Detectors Harris3D Cuboids Hessian Dense HOG3D 89.0% 90.0% 84.6% 85.3% Descriptors HOG/HOF 91.8% 88.7% 88.7% 86.1% 80.9% 82.3% 77.7% 79.0% HOG 92.1% 88.2% 88.6% 88.0% HOF Cuboids - 89.1% - - E-SURF - - 81.4% - (Average accuracy scores) • Best results for sparse Harris3D + HOF • Dense features perform relatively poor compared to sparse features [Wang, Ullah, Kläser, Laptev, Schmid, 2009]
Results on Diving Kicking Walking UCF Sports Skateboarding High-Bar-Swinging Golf-Swinging 10 action classes, videos from TV broadcasts Detectors Harris3D Cuboids Hessian Dense Descriptors 79.7% 82.9% 79.0% 85.6% HOG3D HOG/HOF 78.1% 77.7% 79.3% 81.6% HOG 71.4% 72.7% 66.0% 77.4% HOF 75.4% 76.7% 75.3% 82.6% - 76.6% - - Cuboids - - 77.3% - E-SURF (Average precision scores) • Best results for dense + HOG3D [Wang, Ullah, Kläser, Laptev, Schmid, 2009]
Results on Hollywood-2 AnswerPhone GetOutCar Kiss HandShake StandUp DriveCar 12 action classes collected from 69 movies Detectors Harris3D Cuboids Hessian Dense Descriptors HOG3D 43.7% 45.7% 41.3% 45.3% HOG/HOF 45.2% 46.2% 46.0% 47.4% 32.8% 39.4% 36.2% 39.4% HOG 43.3% 42.9% 43.0% 45.5% HOF Cuboids - 45.0% - - E-SURF - - 38.2% - (Average precision scores) • Best results for dense + HOG/HOF [Wang, Ullah, Kläser, Laptev, Schmid, 2009]
Other recent local representations • Y. and L. Wolf, "Local Trinary Patterns for Human Action Recognition ", ICCV 2009 • P. Matikainen, R. Sukthankar and M. Hebert "Trajectons: Action Recognition Through the Motion Analysis of Tracked Features" ICCV VOEC Workshop 2009, • • H. Wang, A. Klaser, C. Schmid, C.-L. Liu, "Action Recognition by Dense Trajectories", CVPR 2011 • Recognizing Human Actions by Attributes J. Liu, B. Kuipers, S. Savarese, CVPR 2011
Dense trajectory descriptors [Wang et al. CVPR’11]
Dense trajectory descriptors [Wang et al. CVPR’11] [Wang et al.] [Wang et al.] [Wang et al.] [Wang et al.]
Dense trajectory descriptors [Wang et al. CVPR’11] Computational cost:
Highly-efficient video descriptors Optical flow from MPEG video compression
Highly-efficient video descriptors Evaluation on Hollywood2 [Wang et al.’11] Evaluation on UCF50 [Wang et al.’11] [Kantorov & Laptev, 2013]
Beyond BOF: Temporal structure • Modeling Temporal Structure of Decomposable Motion Segments for Activity Classication, J.C. Niebles, C.-W. Chen and L. Fei-Fei, ECCV 2010 • Learning Latent Temporal Structure for Complex Event Detection. Kevin Tang, Li Fei-Fei and Daphne Koller, CVPR 2012
Beyond BOF: Social roles • T. Yu, S.-N. Lim, K. Patwardhan, and N. Krahnstoever. Monitoring, recognizing and discovering social networks. In CVPR, 2009. • L. Ding and A. Yilmaz. Learning relations among movie characters: A social network perspective. In ECCV, 2010 • V. Ramanathan, B. Yao, and L. Fei-Fei. Social Role Discovery in Human Events. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2013.
Recommend
More recommend