Rich representations for Rich representations for learning visual recognition learning visual recognition g g g g Jitendra Malik Jitendra Malik Jitendra Malik Jitendra Malik University of California at Berkeley University of California at Berkeley
Detection can be very fast Detection can be very fast Detection can be very fast Detection can be very fast On a task of judging animal vs no O O On a task of judging animal vs no k f j d i k f j d i i i l l animal, humans can make mostly correct animal, humans can make mostly correct saccades in 150 ms (Kirchner & Thorpe, saccades in 150 ms (Kirchner & Thorpe, ( ( p , p , 2006) 2006) Comparable to synaptic delay in the retina, C C Comparable to synaptic delay in the retina, bl bl i d l i d l i i h h i i LGN, V1, V2, V4, IT pathway. LGN, V1, V2, V4, IT pathway. Doesn’t rule out feed back but shows feed Doesn’t rule out feed back but shows feed forward only is very powerful f f forward only is very powerful d d l i l i f l f l Detection and categorization are Detection and categorization are practically simultaneous (Grill practically simultaneous (Grill-Spector & practically simultaneous (Grill practically simultaneous (Grill Spector & Spector & Spector & Kanwisher, 2005) Kanwisher, 2005)
Rolls et al (2000) Rolls et al (2000) Rolls et al (2000) Rolls et al (2000)
Some opinions Some opinions… Some opinions… Some opinions A hierarchical, mostly A hierarchical, mostly feedforward feedforward network is network is the right model, the question is how to train it the right model, the question is how to train it g g , , q q Unsupervised, Unsupervised, sparsity sparsity encouraging techniques encouraging techniques are promising for lower layers are promising for lower layers are promising for lower layers are promising for lower layers But so far the success of this approach at the But so far the success of this approach at the higher stages has not yet been demonstrated higher stages has not yet been demonstrated
Insights from child development Insights from child development Insights from child development Insights from child development •Trying to learn object recognition from bounding boxes is like trying to learn language from a list of sentences. y g g g • The development of visual recognition, like language acquisition benefits from supportive “scaffolding” acquisition, benefits from supportive scaffolding Grouping and tracking can play an important role by helping solve the correspondence problem. In a machine vision system, we can “cheat” by supplying keypoint correspondences
Detecting and Segmenting People Where are they? What are they wearing? What are they doing? Jitendra Malik Jitendra Malik UC Berkeley This is joint work with L. Bourdev, S. Maji and T. Brox. Th s s jo t wo w th . ou dev, S. Maj a d T. o .
Trying to extract stick figures is hard Trying to extract stick figures is hard (and unnecessary!) (and unnecessary!) Generalized cylinders (Marr & Nishihara, Binford) Pictorial Structures (Felszenswalb & Huttenlocher)
All the wrong limbs… All the wrong limbs… g
High High-Level Computer Vision High High Level Computer Vision Level Computer Vision Level Computer Vision
High High-Level Computer Vision High High Level Computer Vision Level Computer Vision Level Computer Vision Object Recognition Object Recognition person person van an person dog
High High-Level Computer Vision High High Level Computer Vision Level Computer Vision Level Computer Vision Object Recognition Object Recognition person person van an Semantic Segmentation person dog
High High-Level Computer Vision High High Level Computer Vision Level Computer Vision Level Computer Vision Object Recognition Object Recognition Facing the camera Semantic Segmentation Pose Estimation Pose Estimation In a back view Facing back, head to the right
High High-Level Computer Vision High High Level Computer Vision Level Computer Vision Level Computer Vision Walking away g y Object Recognition Object Recognition talking Semantic Segmentation Pose Estimation Pose Estimation Action Recognition
High High High Level Computer Vision High-Level Computer Vision Level Computer Vision Level Computer Vision Object Recognition Object Recognition blue GMC van Semantic Segmentation Pose Estimation Pose Estimation Action Recognition Man with elderly white glasses and a Attribute Classification Attribute Classification man with a coat baseball hat Entlebucher mountain dog
High High High Level Computer Vision High-Level Computer Vision Level Computer Vision Level Computer Vision “A blue GMC van Object Recognition Object Recognition parked, in a back view” k d i b k i ” Semantic Segmentation Pose Estimation Pose Estimation Action Recognition “A man with glasses g Attribute Classification Attribute Classification “An elderly man with a An elderly man with a and a coat, facing back, hat and glasses, facing walking away” the camera and talking” “An entlebucher m mountain dog sitting in nt in d sittin in a bag”
Person Detection is Challenging Person Detection is Challenging g g g g Clothing Clothing Occlusion Occlusion No silhouette Accessories Articulation Viewpoint Wrinkles
How can we make the problem harder? How can we make the problem harder? p Solution: Severely limit the supervision Solution: Severely limit the supervision
The best approach in such setup? The best approach in such setup? pp pp p p Part 2 fires on left torso …but sometimes on ½ of the head head Learned part Learned part Learned part Learned part location penalty location penalty Part 5 fires on one leg… …or both legs g Divide Divide Divide and Divide-and and-conquer: One global template + five parts and conquer: One global template + five parts conquer: One global template + five parts conquer: One global template + five parts Positions and appearance of parts trained jointly (Latent SVM) Positions and appearance of parts trained jointly (Latent SVM) Mixture of models for various poses (standing, sitting, etc) Mi Mi Mixture of models for various poses (standing, sitting, etc) f f d l f d l f i i ( ( di di i i i i ) ) Parts are not well localized and have large appearance variations Parts are not well localized and have large appearance variations [Felzenszwalb Felzenszwalb et al. PAMI 2010] et al. PAMI 2010]
Radical idea: What if, instead, we try to Radical idea: What if, instead, we try to make the problem easier? make the problem easier? make the problem easier? make the problem easier? Nose Right Shoulder Left Shoulder f Sh ld Right Elbow Left Elbow [Bourdev and Malik, ICCV 2009] [Bourdev and Malik, ICCV 2009]
Can we build upon the success of Can we build upon the success of faces and pedestrians? faces and pedestrians? Both do template matching Both do template matching Both do template matching Both do template matching Capture salient and common patterns Capture salient and common patterns Are these the only two salient & common patterns? Are these the only two salient & common patterns? But how are we going to create the training set? But how are we going to create the training set?
Agenda Agenda Agenda Agenda Poselets Poselets Training a Training a poselet g p poselet Selecting a good set of Selecting a good set of poselets poselets Impro ing Improving Impro ing poselets Improving poselets poselets with context poselets with context ith conte t ith conte t Detection with Detection with poselets poselets Segmentation Segmentation Attributes Attributes Attributes Attributes Action Recognition Action Recognition
Agenda Agenda Agenda Agenda Poselets Poselets Training a Training a poselet g p poselet Selecting a good set of Selecting a good set of poselets poselets Impro ing Improving Impro ing poselets Improving poselets poselets with context poselets with context ith conte t ith conte t Detection with Detection with poselets poselets Segmentation Segmentation Attributes Attributes Attributes Attributes Action Recognition Action Recognition
Examples of poselets Examples of poselets Examples of poselets Examples of poselets Patches are often far Patches are often far visually Patches are often far Patches are often far visually visually , but they are close visually , but they are close , but they are close semantically , but they are close semantically semantically semantically
Agenda Agenda Agenda Agenda Poselets Poselets Training a Training a poselet g p poselet Selecting a good set of Selecting a good set of poselets poselets Impro ing Improving Impro ing poselets Improving poselets poselets with context poselets with context ith conte t ith conte t Detection with Detection with poselets poselets Segmentation Segmentation Attributes Attributes Attributes Attributes Action Recognition Action Recognition
How do we train a How do we train a poselet poselet for a for a given pose configuration? given pose configuration?
Recommend
More recommend