computer vision by learning
play

Computer Vision by Learning: Motion in Action Jan van Gemert, UvA 2 - PowerPoint PPT Presentation

Computer Vision by Learning: Motion in Action Jan van Gemert, UvA 2 Motion and perceptual organization Even impoverished motion data can evoke a strong percept 3 Motion and perceptual organization Even impoverished motion


  1. Computer Vision by Learning: Motion in Action Jan van Gemert, UvA

  2. 2 Motion and perceptual organization • Even “impoverished” motion data can evoke a strong percept

  3. 3 Motion and perceptual organization • Even “impoverished” motion data can evoke a strong percept

  4. 4 Uses of motion • Estimating 3D structure • Segmenting objects based on motion cues • Learning dynamical models • Improving video quality (motion stabilization) • Recognizing actions, activities, events

  5. 5 Action Recognition Pipeline Spatio-temporal Space-time Space-time Interest point detection patch/trajectory descriptor Followed by Bag-of-Words/Fisher vector and SVM Similar setup as in static image-classification

  6. 6 Measuring Motion Lagrangian and Eulerian Perspectives: Fluid Dynamics Lagrangian Eulerian 6

  7. 7 1. Lagrangian Perspective: Optical Flow • Track each pixel as it moves through a video Lagrangian

  8. 8 [Lukas-­‑Kanade, ¡1981] ¡ Problem definition: optical flow How to estimate pixel motion from image H to image I? • Solve pixel correspondence problem – given a pixel in H, look for nearby pixels of the same color in I Key assumptions • color constancy: a point in H looks the same in I – For grayscale images, this is brightness constancy • small motion: points do not move very far This is called the optical flow problem

  9. 9 Visualizing optical flow Color Legend Q: What do you notice? • Camera Motion • Parallax (Hue is an angular color space)

  10. 10 Optical flow and parallax P ( t+dt ) • P ( t ) is a moving 3D point V P ( t ) • Velocity of scene point: V = d P /d t • p ( t ) = ( x ( t ), y ( t )) is the projection of P in the image • Apparent velocity v in the image: given by v p ( t+dt ) components v x = d x /d t and p ( t ) v y = d y /d t • Length of v inversely proportional to the depth Z of the 3d point

  11. 11 Optical flow and camera motion Q: Name the camera motion: Zoom out Zoom in Pan right to left

  12. 12 [Dalal, ¡eccv06] ¡ Motion boundaries Video frame Spatial gradient Horizontal motion boundaries Optical flow (quivers) Optical flow (hue) Vertical motion boundaries Q: What do you notice? • Motion boundaries are invariant to constant camera motion Color legend

  13. 13 Animated Example Video frame Spatial gradient Horizontal motion boundaries Optical flow (quivers) Optical flow (hue) Vertical motion boundaries Motion boundaries are the spatial gradients of the x and y flow images Q: What do you notice? • Similar properties as the spatial gradient • No motion: motion boundaries disappear • Parallax Color legend

  14. 14 Modeling camera motion [wang, ¡iccv13] ¡ [jain, ¡cvpr13] ¡ Video frame Remove global motion: 1. Globally align frames Optical Flow 2. Optical flow on the aligned frames Subtracted (assumes homography) Camera motion Subtract background motion: Video Frame Subtracted Cam Human Detector Subtracted Cam • Assume the background is where the human is not

  15. 15 Flow trajectory descriptors [Wang, ¡ijcv13] ¡

  16. 16 2. Eulerian Perspective: stationary • Treat each pixel as a time series through a video Eulerian y x time

  17. 17 [Laptev, ¡ijcv03] ¡ Spatio-Temporal Interest Points (STIP) Spatio-Temporal Harris Corners

  18. 18 [willems,eccv08] ¡ Spatio-Temporal Blobs (hes-Stip) • Spatio-Temporal extension of hessian blob detector • Strength S of the interest point computed with the determinant of the Hessian matrix H • Approximations with integral videos

  19. [Dollar, ¡vspets05] ¡ 19 Periodic Interest Points (Cuboids) Beyond Spatio-Temporal Corners 2D Gaussian smoothing kernel: 1D Gabor filters applied temporally:

  20. 20 Dense Sampling • Motivation: Dense sampling outperforms interest- points for object recognition • Extract 3D cubes at regular positions (x,y,t) with varying scales

  21. 21 [Kläser, ¡bmvc08] ¡ Spatio-Temporal Gradient Descriptor (HOG3D) 2D HOG/SIFT : 3D HOG/SIFT : (polygon) (polyhedrons) HOG3D: Quantization of 3D gradient : Extensions to color [Everts, ¡cvpr13] ¡ Concatenation: Integration (tensors):

  22. 22 3. Action Recognition • Automatically recognizing actions, activities, events • Learn from training data • Apply on unseen test data

  23. 23 Video Coding • Create feature vocabulary Fisher – Kmeans (BOW) – GMM (Fisher) • Assign features to vocabulary – Hard assignment (BOW) – Vector differences (Fisher) • Aggregate over whole video – Spatio-Temporal Pyramid • Classifier (SVM) Bag of words

  24. 24 Detectors and Descriptors Interest point Space-time Space-time detection patch/trajectory descriptor Detectors: Descriptors: Dense HOG3D Harris3D STIP HOG Hessian STIP HOF Cuboids MBH KLTtraj Modeling the camera yes/no DenseTraj

  25. Action Recognition Datasets Hollywood2 , ¡ 12 ¡classes ¡1,707 ¡vids: ¡movie ¡acJons ¡ UCF50 , ¡50 ¡classes ¡6,618 ¡vids: ¡sports, ¡daily ¡exercises ¡ HMDB51 , ¡51 ¡classes ¡6,766 ¡vids: ¡body ¡moJon, ¡Facial ¡expressions, ¡human ¡InteracJons ¡

  26. 26 Results Hollywood2 Motion is ref Detector HOG3D HOG HOF MBH important [wang,bmvc09] Harris3D STIP 43,7 32,8 43,3 [wang,bmvc09] Cuboids 45,7 39,4 42,9 [wang,bmvc09] Hessian STIP 41,3 36,2 43 [wang,bmvc09] Dense 45,3 39,4 45,5 [klaser,bmvc08] Cuboids 48,6 38,2 43,8

  27. 27 Results Hollywood2 Motion is ref Detector HOG3D HOG HOF MBH important [wang,bmvc09] Harris3D STIP 43,7 32,8 43,3 [wang,bmvc09] Cuboids 45,7 39,4 42,9 Camera [wang,bmvc09] Hessian STIP 41,3 36,2 43 motion [wang,bmvc09] Dense 45,3 39,4 45,5 invariance [klaser,bmvc08] Cuboids 48,6 38,2 43,8 [wang,ijcv13] Dense 43,3 48 52,1 [wang,ijcv13] KLTtraj 41 48,4 48,6 [wang, ijcv13] DenseTraj 41,2 50,3 55,1 Dense [wang, ijcv13] Harris3D STIP 40,4 44,9 Traject [jain,cvpr13] DenseTrajCam 45,6 54,1 54,2

  28. 28 Results Hollywood2 Motion is ref Detector HOG3D HOG HOF MBH important [wang,bmvc09] Harris3D STIP 43,7 32,8 43,3 [wang,bmvc09] Cuboids 45,7 39,4 42,9 Camera [wang,bmvc09] Hessian STIP 41,3 36,2 43 motion [wang,bmvc09] Dense 45,3 39,4 45,5 invariance [klaser,bmvc08] Cuboids 48,6 38,2 43,8 [wang,ijcv13] Dense 43,3 48 52,1 [wang,ijcv13] KLTtraj 41 48,4 48,6 [wang, ijcv13] DenseTraj 41,2 50,3 55,1 Dense [wang, ijcv13] Harris3D STIP 40,4 44,9 Traject [jain,cvpr13] DenseTrajCam 45,6 54,1 54,2 [oneata,iccv13] DenseTrajFisher 42,5 61,9 [wang,iccv13] DenseTrajFisher 46,9 51,4 57,4 [wang,iccv13] DenseTrajFisherCam 47,1 58,8 60,5 Fisher

  29. 29 Results UCF50 and HMDB51 UCF50: Dense ref Detector HOG3D HOG HOF MBH [everts,cvpr13] Cuboids 68,3 [everts,cvpr13] CuboidsColor 72,9 Camera [wang,ijcv13] Dense 64,4 65,9 78,3 [wang,ijcv13] KLTtraj 57,4 57,9 71,1 motion [wang,ijcv13] DenseTraj 68 68,2 82,2 invariance [shi,cvpr13] Dense10k 72,4 58,6 69,7 80,1 [oneata,iccv13] DenseTrajFisher 76,3 87,8 [wang,iccv13] DenseTrajFisher 81,8 74,3 86,5 Fisher [wang,iccv13] DenseTrajFisherCam 82,6 85,1 88,9 HMDB51: ref Detector HOG3D HOG HOF MBH Dense [wang,ijcv13] Dense 25,2 29,4 40,9 [wang,ijcv13] KLTtraj 22,2 23,7 33,7 Camera [wang,ijcv13] DenseTraj 27,9 31,5 43,2 [shi,cvpr13] Dense10k 34,7 21 33,5 43 motion [oneata,iccv13] DenseTrajFisher 34,8 51,9 invariance [jain,cvpr13] DenseTrajCam 29,1 38,6 40,9 [wang,iccv13] DenseTrajFisher 38,4 39,5 49,1 Fisher [wang,iccv13] DenseTrajFisherCam 40,2 48,9 52,1

  30. 30 Reflection • Detector: dense (trajectories) • Descriptor: camera motion invariance – MBH descriptor • Fisher Vector • Ignored: – Combinatorics of hog+hof+mbh (muddies the analysis) – Human pose modeling literature – Deep learning performs still below State-of-the-art • Gaps: – Fisher on HOG3D? – Camera motion invariance for Eularian methods? – Parallax?

  31. 31 4. Action Localisation Goal: Finding Actions in Videos: Where , When and What is happening (tube) Challenges: Exponential search space, Occlusion, Motion, Non-rigid deformations Applications : Video Indexing, Security, Sport Statistics, Animal Monitoring, Elderly Safety, Marketing Research.

  32. Inspired ¡by ¡Object ¡LocalizaJon ¡ 32 In ¡StaJc ¡Images ¡ Sliding Window Boosting Cascade Image Video Image Video … [Rowley, ¡pami98] ¡ [Rodriguez, ¡cvpr08] ¡ [violaJones, ¡ijcv04] ¡ [ke, ¡iccv05] ¡ Branch and Bound Deformable Parts Image Video Image Video [Lampert, ¡pami09] ¡ [Yuan, ¡pami11] ¡ [Felzenswalb, ¡pami10] ¡ [Tian, ¡cvpr13] ¡

  33. SelecJve ¡Search ¡for ¡ ¡ 33 [Uijlings, ¡ijcv13] ¡ StaJc ¡Image ¡Object ¡LocalizaJon ¡ Object hypotheses based on hierarchical grouping of super-pixels • High recall with modest nr of Object Hypotheses. • Train an expensive classifier for single hypothesis Q: How would you extend it to video?

  34. [Jain, ¡CVPR14] ¡ 34 Selective Search for Action Localisation in Video Super-voxel video segmentation with high boundary recall [Xu, ¡cvpr12] ¡ Tubelet hypothesis generation by merging independent cues Tubelet classification based on MBH and SVM

  35. 35 Example of Super-Voxel Segmentation

  36. 36 Example of Super-Voxel Segmentation

Recommend


More recommend