hands objects and videotape
play

Hands, objects, and videotape: recognizing object interactions from - PowerPoint PPT Presentation

Hands, objects, and videotape: recognizing object interactions from streaming wearable cameras Deva Ramanan UC Irvine Students doing the work Greg Rogez Hamed Pirsiavash Mohsen Hejrati Post-doc, looking to Former student, now Current


  1. Hands, objects, and videotape: recognizing object interactions from streaming wearable cameras Deva Ramanan UC Irvine

  2. Students doing the work Greg Rogez Hamed Pirsiavash Mohsen Hejrati Post-doc, looking to Former student, now Current student come back to France! post-doc at MIT

  3. Motivation 1: integrated perception and actuation

  4. Motivation 2: wearable (mobile) cameras Google Glass

  5. Outline -Egocentric hand estimation -Data analysis: “Making tea” Analyze big temporal data -Functional prediction: what can user do in scene? Grab here

  6. Deva: ¡Perhaps ¡the ¡most ¡relevant ¡would ¡be ¡[6], ¡but ¡I ¡found ¡the ¡description ¡of ¡the ¡text ¡hard ¡to ¡ follow. ¡Perhaps ¡[8] ¡would ¡be ¡the ¡easiest ¡to ¡implement. ¡ Scenarios Easy: Third Person -‑ HCI/Gesture (8) Egocentric hand pose estimation ¡ Egocentric (4) ¡ Challenges: -hands have a higher (effective) DOFs than bodies -self-occlusion due to egocentric viewpoint -occlusions to objects

  7. Past approaches Skin-pixel classification: Li & Kitani, CVPR13, ICCV13 Motion segmentation: Ren & Gu, CVPR10, Fathi et al CVPR 11

  8. Observation: RGB-D saves the day Produces accurate depth over “near-field workspace” Mimic near-field depth from human vision (stereopsis) TOF camera

  9. Does depth solve it all? Hand detection in egocentric views PXC = Intel’s Perceptual Computing Software

  10. Our approach Make use of massive synthetic training set Mount avatar with virtual egocentric cameras Use animation library of household objects and scenes

  11. Our approach Make use of massive synthetic training set Mount avatar with virtual egocentric cameras Use animation library of household objects and scenes Naturally enforces “egocentric” priors over viewpoint, grasping poses, etc.

  12. Recognition Decision / regression trees Nearest-neighbor on volumetric depth features

  13. Results Rogez et al, ECCV 14 Workshop on Consumer Depth Cameras

  14. Ablative analysis Depth & egocentric priors (over viewpoint & grasping poses) are crucial

  15. Ongoing work: hand grasp region prediction Functionally-motivated pose classes Disc grasp Dynamic lateral tripod Lumbrical grasp (Though we are finding it hard to publish in computer vision venues!)

  16. Outline -Egocentric hand estimation -Data analysis: “Making tea” Analyze big temporal data -Functional prediction: what can user do in scene? Grab here

  17. Temporal data analysis Challenges: -analyze large collections of temporal big-data vs YouTube clips - some daily activities can take a long time (interrupted) - some daily activities exhibit “internal structure” (more on this) Start boiling Do other things Pour in cup Drink tea water (while waiting) time start wait steep boiling water tea leaves

  18. Classic models for capturing temporal structure Wait Boil Steep Markov models

  19. Classic models for capturing temporal structure Wait Boil Steep Markov models ... but does this really matter? Maybe local bag-of-feature templates suffice “Making tea” template time P. Smyth “Oftentimes a strong data model will do the job”

  20. But some annoying details... How to find multiple actions of differing lengths? Can we do better that window scan of O(NL) + heuristic NMS ? N = length of video L = maximum temporal length

  21. Insufficiently well-known fact We can do all this for 1-D (temporal) signals with grammars “The hungry rabbit eats quickly”

  22. Application to actions S Production rules: S S → S → SA B S A S → SB “Snatch” action rule bg bg press yank press bg yank pause (2 latent subactions) A → “Clean and jerk” action rule (3 latent subactions) B → Context-free grammars (CFGs): surprisingly simple to implement but poor scalabity O(N 3 ) Our contribution: many restricted grammars (like above) can be parsed in O(NL) In theory & practice, no more expensive than a sliding window!

  23. Real power of CFGs: recursion e.g., rules for generating valid sequences of parenthesis “((()())())()” S → {} S → ( S ) S → SS If we don’t make use of this recursion, we can often make do with a simpler grammar. Regular grammar: X → uvw X → Y uvw

  24. Intuition: compile regular grammar into a semi-markov model S Production rules: S S → S → SA B S A S → SB “Snatch” action rule bg bg press yank press bg yank pause (2 latent subactions) A → “Clean and jerk” action rule (3 latent subactions) B → Semi-markov models = markov models with “counting” states

  25. But aren’t semi-markov models already standard? Action segmentation with 2-state semi-markov model: (Shi et al IJCV10, Hoai et al CVPR11) Model subactions with 3-state semi-markov model: Tang et al CVPR12 (+ NMS?)

  26. Our work + Single model enforces temporal constraints at multiple scales (actions, sub-actions) S S B S A bg bg press yank press bg yank pause Use production rules to implicitly manage additional dummy / counting states used by underlying markov model

  27. Inference S S B S A bg bg press yank press bg yank pause Maximum segment-length O(NL) time O(L) storage Naturally online Possible symbols time t (current frame)

  28. Scoring each segment i j video data (D) time S ( D, r, i, j ) = α r · φ ( D, i, j ) + β r · ψ ( j − i ) + γ r : data model α : prior over length of segment β : prior of transition rule r = X → Y γ

  29. Learning Score is linear in parameters β (segment data model , segment length prior , and rule transition prior ) γ α Structured SVM solver Supervised : Weakly-supervised: (Latent)

  30. Results Latently inferred subactions appear to be run, release, and throw.

  31. Results Latently inferred subactions appear to be bend and jump.

  32. Baselines Action segmentation with 2-state semi-markov model: (Shi et al IJCV10, Hoai et al CVPR11) Model subactions with 3-state semi-markov model: Tang et al CVPR12 + NMS

  33. Results for action segment detection (AP) Segment Detection (AP) 0.4 Subaction scan-win [28] Our model w/o prior Segmental actions [25] Our model 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 Overlap threshold detection and frame labeling accuracy. On the left Pirsiavash & Ramanan, CVPR14

  34. Outline -Egocentric hand estimation -Data analysis: “Making tea” Analyze big temporal data -Functional prediction: what can user do in scene? Grab here

  35. Object touch (interaction) codes Label object surfaces with body parts that come in contact with them arms hands hands back bum mouth bum feet

  36. Dataset of interaction region masks monitor bottle chair sofa

  37. Alternate perspective Prediction of functional landmarks

  38. How hard is this problem? Benchmark evaluation of several standard approaches Blind regression (from bounding box coordinates) Regression from part locations Bottom-up geometric labeling of superpixels Nearest neighbor matching + label transfer ... Desai & Ramanan “Predicting Functional Regions of Objects” Beyond Semantics Workshop, CVPR13

  39. Some initial conclusions % correctly-parsed objects 90 67.5 45 22.5 0 bikes chairs bottles sofas tv -Difficulty varies greatly per object Harder to ride a bike than sit on sofa (or watch TV)! Blind prediction of bottle & TV regions works just as well as anything else -Nearest neighbor + label transfer is the winner Simple and works annoyingly well (though considerable room for improvement)

  40. Strategic question How to build models that produce detailed 3D landmark reports for general objects?

  41. Recognition by 3D Reconstruction Input: Output: 2D image 3D shape camera viewpoint

  42. Overall approach: “brute-force” enumeration of 3D hypotheses θ 1 θ 2 θ 3 . . . Enumerate hypotheses = (shape,viewpoint) Find one that correlates θ and rendered HOG images best with query image w ( θ )

  43. A model of 3D shape and viewpoint 1) 3D shape of object = linear combinations of 3D basis shapes X B = α i B i i 2) Standard perspective camera model p ( θ ) ∼ C ( R, t, f ) B θ = ( α , R, t, f ) (shape coefficients, camera rotation, translation, focal length)

  44. View & shape-specific templates θ 1 w ( θ 1 ) w ( θ 2 ) θ 2 w ( θ 3 ) θ 3 θ n Treat each as unique subcategory (e.g., side-view SUVs) and learn template for it

  45. Challenge: rare shapes & views We need lots of templates, but will likely have little data of ‘rare’ car views Zhu, Anguelov, & Ramanan “Capturing long-tail distributions of object subcategories” CVPR14

  46. Long tail distributions of categories (cf. LabelMe) PASCAL 2010 training data 2000 1500 1000 500 0 person chair plane train boat sofa cow “Zero-shot” learning

  47. Soln: share information with parts Use ‘wheels’ from common views/shapes to help model rare ones

  48. Some formalities Cast recognition and reconstruction as a maximization problem θ 1 θ 2 θ 3 . . . S ( I, θ ) = w ( θ ) · I θ ∗ = arg max θ ∈ Ω S ( I, θ )

  49. Templates with shared parts w m i ( θ ) X S ( I, θ ) = · φ ( I, p i ( θ )) i i ∈ V ( θ ) } V: set of visible parts m i : local mixture of part i all depend on θ p i : pixel location of part i

Recommend


More recommend