rich representations for rich representations for
play

Rich representations for Rich representations for learning visual - PowerPoint PPT Presentation

Rich representations for Rich representations for learning visual recognition learning visual recognition g g g g Jitendra Malik Jitendra Malik Jitendra Malik Jitendra Malik University of California at Berkeley University of California


  1. Rich representations for Rich representations for learning visual recognition learning visual recognition g g g g Jitendra Malik Jitendra Malik Jitendra Malik Jitendra Malik University of California at Berkeley University of California at Berkeley

  2. Detection can be very fast Detection can be very fast Detection can be very fast Detection can be very fast  On a task of judging animal vs no O O On a task of judging animal vs no k f j d i k f j d i i i l l animal, humans can make mostly correct animal, humans can make mostly correct saccades in 150 ms (Kirchner & Thorpe, saccades in 150 ms (Kirchner & Thorpe, ( ( p , p , 2006) 2006)  Comparable to synaptic delay in the retina, C C Comparable to synaptic delay in the retina, bl bl i d l i d l i i h h i i LGN, V1, V2, V4, IT pathway. LGN, V1, V2, V4, IT pathway.  Doesn’t rule out feed back but shows feed Doesn’t rule out feed back but shows feed forward only is very powerful f f forward only is very powerful d d l i l i f l f l  Detection and categorization are Detection and categorization are practically simultaneous (Grill practically simultaneous (Grill-Spector & practically simultaneous (Grill practically simultaneous (Grill Spector & Spector & Spector & Kanwisher, 2005) Kanwisher, 2005)

  3. Rolls et al (2000) Rolls et al (2000) Rolls et al (2000) Rolls et al (2000)

  4. Some opinions Some opinions… Some opinions… Some opinions  A hierarchical, mostly A hierarchical, mostly feedforward feedforward network is network is the right model, the question is how to train it the right model, the question is how to train it g g , , q q  Unsupervised, Unsupervised, sparsity sparsity encouraging techniques encouraging techniques are promising for lower layers are promising for lower layers are promising for lower layers are promising for lower layers  But so far the success of this approach at the But so far the success of this approach at the higher stages has not yet been demonstrated higher stages has not yet been demonstrated

  5. Insights from child development Insights from child development Insights from child development Insights from child development •Trying to learn object recognition from bounding boxes is like trying to learn language from a list of sentences. y g g g • The development of visual recognition, like language acquisition benefits from supportive “scaffolding” acquisition, benefits from supportive scaffolding  Grouping and tracking can play an important role by helping solve the correspondence problem. In a machine vision system, we can “cheat” by supplying keypoint correspondences

  6. Detecting and Segmenting People Where are they? What are they wearing? What are they doing? Jitendra Malik Jitendra Malik UC Berkeley This is joint work with L. Bourdev, S. Maji and T. Brox. Th s s jo t wo w th . ou dev, S. Maj a d T. o .

  7. Trying to extract stick figures is hard Trying to extract stick figures is hard (and unnecessary!) (and unnecessary!) Generalized cylinders (Marr & Nishihara, Binford) Pictorial Structures (Felszenswalb & Huttenlocher)

  8. All the wrong limbs… All the wrong limbs… g

  9. High High-Level Computer Vision High High Level Computer Vision Level Computer Vision Level Computer Vision

  10. High High-Level Computer Vision High High Level Computer Vision Level Computer Vision Level Computer Vision Object Recognition Object Recognition person person van an person dog

  11. High High-Level Computer Vision High High Level Computer Vision Level Computer Vision Level Computer Vision Object Recognition Object Recognition person person van an Semantic Segmentation person dog

  12. High High-Level Computer Vision High High Level Computer Vision Level Computer Vision Level Computer Vision Object Recognition Object Recognition Facing the camera Semantic Segmentation Pose Estimation Pose Estimation In a back view Facing back, head to the right

  13. High High-Level Computer Vision High High Level Computer Vision Level Computer Vision Level Computer Vision Walking away g y Object Recognition Object Recognition talking Semantic Segmentation Pose Estimation Pose Estimation Action Recognition

  14. High High High Level Computer Vision High-Level Computer Vision Level Computer Vision Level Computer Vision Object Recognition Object Recognition blue GMC van Semantic Segmentation Pose Estimation Pose Estimation Action Recognition Man with elderly white glasses and a Attribute Classification Attribute Classification man with a coat baseball hat Entlebucher mountain dog

  15. High High High Level Computer Vision High-Level Computer Vision Level Computer Vision Level Computer Vision “A blue GMC van Object Recognition Object Recognition parked, in a back view” k d i b k i ” Semantic Segmentation Pose Estimation Pose Estimation Action Recognition “A man with glasses g Attribute Classification Attribute Classification “An elderly man with a An elderly man with a and a coat, facing back, hat and glasses, facing walking away” the camera and talking” “An entlebucher m mountain dog sitting in nt in d sittin in a bag”

  16. Person Detection is Challenging Person Detection is Challenging g g g g Clothing Clothing Occlusion Occlusion No silhouette Accessories Articulation Viewpoint Wrinkles

  17. How can we make the problem harder? How can we make the problem harder? p  Solution: Severely limit the supervision Solution: Severely limit the supervision

  18. The best approach in such setup? The best approach in such setup? pp pp p p Part 2 fires on left torso …but sometimes on ½ of the head head Learned part Learned part Learned part Learned part location penalty location penalty Part 5 fires on one leg… …or both legs g  Divide  Divide Divide and Divide-and and-conquer: One global template + five parts and conquer: One global template + five parts conquer: One global template + five parts conquer: One global template + five parts  Positions and appearance of parts trained jointly (Latent SVM) Positions and appearance of parts trained jointly (Latent SVM)  Mixture of models for various poses (standing, sitting, etc) Mi Mi Mixture of models for various poses (standing, sitting, etc) f f d l f d l f i i ( ( di di i i i i ) )  Parts are not well localized and have large appearance variations Parts are not well localized and have large appearance variations [Felzenszwalb Felzenszwalb et al. PAMI 2010] et al. PAMI 2010]

  19. Radical idea: What if, instead, we try to Radical idea: What if, instead, we try to make the problem easier? make the problem easier? make the problem easier? make the problem easier? Nose Right Shoulder Left Shoulder f Sh ld Right Elbow Left Elbow [Bourdev and Malik, ICCV 2009] [Bourdev and Malik, ICCV 2009]

  20. Can we build upon the success of Can we build upon the success of faces and pedestrians? faces and pedestrians?  Both do template matching  Both do template matching Both do template matching Both do template matching  Capture salient and common patterns Capture salient and common patterns  Are these the only two salient & common patterns? Are these the only two salient & common patterns?  But how are we going to create the training set? But how are we going to create the training set?

  21. Agenda Agenda Agenda Agenda  Poselets Poselets  Training a Training a poselet g p poselet  Selecting a good set of Selecting a good set of poselets poselets  Impro ing  Improving Impro ing poselets Improving poselets poselets with context poselets with context ith conte t ith conte t  Detection with Detection with poselets poselets  Segmentation Segmentation  Attributes  Attributes Attributes Attributes  Action Recognition Action Recognition

  22. Agenda Agenda Agenda Agenda  Poselets Poselets  Training a Training a poselet g p poselet  Selecting a good set of Selecting a good set of poselets poselets  Impro ing  Improving Impro ing poselets Improving poselets poselets with context poselets with context ith conte t ith conte t  Detection with Detection with poselets poselets  Segmentation Segmentation  Attributes  Attributes Attributes Attributes  Action Recognition Action Recognition

  23. Examples of poselets Examples of poselets Examples of poselets Examples of poselets Patches are often far Patches are often far visually Patches are often far Patches are often far visually visually , but they are close visually , but they are close , but they are close semantically , but they are close semantically semantically semantically

  24. Agenda Agenda Agenda Agenda  Poselets Poselets  Training a Training a poselet g p poselet  Selecting a good set of Selecting a good set of poselets poselets  Impro ing  Improving Impro ing poselets Improving poselets poselets with context poselets with context ith conte t ith conte t  Detection with Detection with poselets poselets  Segmentation Segmentation  Attributes  Attributes Attributes Attributes  Action Recognition Action Recognition

  25. How do we train a How do we train a poselet poselet for a for a given pose configuration? given pose configuration?

Recommend


More recommend