the three r s of vision
play

The Three R s of Vision Recognition Reconstruction Reorganization - PowerPoint PPT Presentation

The Three R s of Vision Recognition Reconstruction Reorganization Jitendra Malik UC Berkeley Recognition, Reconstruction & Reorganization Recognition Reconstruction Reorganization Fifty years of computer vision 1963-2013


  1. The Three R ’ s of Vision Recognition Reconstruction Reorganization Jitendra Malik UC Berkeley

  2. Recognition, Reconstruction & Reorganization Recognition Reconstruction Reorganization

  3. Fifty years of computer vision 1963-2013 • 1960s: Beginnings in artificial intelligence, image processing and pattern recognition • 1970s: Foundational work on image formation: Horn, Koenderink, Longuet- Higgins … • 1980s: Vision as applied mathematics: geometry, multi-scale analysis, probabilistic modeling, control theory, optimization • 1990s: Geometric analysis largely completed, vision meets graphics, statistical learning approaches resurface • 2000s: Significant advances in visual recognition, range of practical applications

  4. Different aspects of vision • Perception: study the “ laws of seeing ” -predict what a human would perceive in an image. • Neuroscience: understand the mechanisms in the retina and the brain • Function: how laws of optics, and the statistics of the world we live in, make certain interpretations of an image more likely to be valid The match between human and computer vision is strongest at the level of function, but since typically the results of computer vision are meant to be conveyed to humans makes it useful to be consistent with human perception. Neuroscience is a source of ideas but being bio-mimetic is not a requirement.

  5. The Three R ’ s of Vision Recognition Reconstruction Reorganization

  6. The Three R ’ s of Vision Recognition Reconstruction Reorganization Each of the 6 directed arcs in this diagram is a useful direction of information flow

  7. Review • Reconstruction – Feature matching + multiple view geometry has led to city scale point cloud reconstructions • Recognition – 2D problems such as handwriting recognition, face detection successfully fielded in applications. – Partial progress on 3d object category recognition • Reorganization – Progress on bottom-up segmentation hitting diminishing returns – Semantic segmentation is the key problem now

  8. Image-based Modeling • Façade (1996) Debevec, Taylor & Malik – Acquire photographs – Recover geometry (explicit or implicit) – Texture map

  9. Campus Model of UC Berkeley Campanile + 40 Buildings (Debevec et al, 1997)

  10. Arc de Triomphe

  11. The Taj Mahal Taj Mahal modeled from one photograph by G. Borshukov

  12. State of the Art in Reconstruction • Multiple photographs • Range Sensors Kinect (PrimeSense) Velodyne Lidar Agarwal et al (2010) Frahm et al, (2010) Semantic Segmentation is needed to make this more useful…

  13. Shape, Albedo, and Illumination from Shading Jonathan Barron Jitendra Malik UC Berkeley

  14. Forward Optics Far Near shape / depth

  15. Forward Optics Far Near shape / depth illumination

  16. Forward Optics Far Near shape / depth illumination log-shading image of Z and L

  17. Forward Optics Far Near shape / depth illumination log-shading image of Z and L log-albedo / log-reflectance

  18. Forward Optics Far Near shape / depth illumination log-shading image of Z and L log-albedo / log-reflectance Lambertian reflectance in log-intensity

  19. Shape, Albedo, and Illumination from Shading SAIFS ( “ safes ” ) Far ? ? ? Near shape / depth illumination log-shading image of Z and L ? log-albedo / log-reflectance Lambertian reflectance in log-intensity

  20. Problem Formulation : Known Lighting ? ? ? “ Find the most likely explanation (shape Z and log-albedo A) that together exactly reconstructs log-image I , given rendering engine S () and known illumination L . ”

  21. Demo!

  22. What do we know about reflectance ? 1) Piecewise smooth (variation is small and sparse) 2) Palette is small (distribution is low-entropy) 3) Some colors are common (maximize likelihood under density model)

  23. Reflectance : Absolute Color

  24. What do we know about shapes ? 1) Piecewise smooth (variation in mean curvature is small and sparse) 2) Face outward at the occluding contour 3) Tend to be fronto-parallel (slant tends to be small)

  25. Evaluation : Real World Images

  26. Evaluation : Real World Images

  27. Recognition helps reconstruction Blanz & Vetter (1999) Geometric Context (Hoiem, Efros, Hebert) for outdoor scenes; recent work on rooms (CMU, UIUC) is another example

  28. The Three R ’ s of Vision Recognition Reconstruction Reorganization

  29. Caltech-101 [Fei-Fei et al. 04] • 102 classes, 31-300 images/class

  30. Caltech 101 classification results (even better by combining cues..)

  31. Texton Histogram Model for Recognition (Leung & Malik, 1999) cf. Bag of Words Rough Plastic Pebbles Plaster-b Terrycloth ICCV '99, Corfu, Greece

  32. Lazebnik, Schmid & Ponce (2006) They proposed using vector-quantized SIFT descriptors as “ words ”

  33. PASCAL Visual Object Challenge (Everingham et al)

  34. A good building block is a linear SVM trained on HOG features (Dalal & Triggs)

  35. AP=0.23

  36. Problems with current recognition approaches • Performance is quite poor compared to that at 2d recognition tasks and the needs of many applications. • Pose Estimation / Localization of parts or keypoints is even worse. We can ’ t isolate decent stick figures from radiance images, making use of depth data necessary. • Progress has slowed down. Variations of HOG/Deformable part models dominate.

  37. PCA Results on APs of 20 VOC classes 1 BROOKES(11) UoCTTI(10) 0.8 UoCTTI(09) UoCTTI(11) 0.6 MISSOURI(11) 0.4 UMNECUIUC(10) NLPR(10) NLPR(11) Principal Component 2 NUS(10) 0.2 NUS(11) UCLA(10) 0 CORNELL(11) OXFORD(11) OXFORD(09) -0.2 UCLA(11) BONN_FGT(10) -0.4 BONN_SVR(10) NECUIUC(09) -0.6 UVA(11) -0.8 UVA(10) FGBG(10) -1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Principal Component 1

  38. Next steps in recognition • Richer features than SIFT/HOG (deep learning ?) • Incorporate the “ shape bias ” known from child development literature to improve generalization – This requires monocular computation of shape, as once posited in the 2.5D sketch, and distinguishing albedo and illumination changes from geometric contours • Top down templates should predict keypoint locations and image support, not just information about category • Recognition and figure-ground inference need to co- evolve. Occlusion is signal, not noise.

  39. High-Level Computer Vision

  40. High-Level Computer Vision Object Recognition person van person dog

  41. High-Level Computer Vision Object Recognition person van Semantic Segmentation person dog

  42. High-Level Computer Vision Object Recognition Facing the camera Semantic Segmentation Pose Estimation In a back view Facing back, head to the right

  43. High-Level Computer Vision Walking away Object Recognition talking Semantic Segmentation Pose Estimation Action Recognition

  44. High-Level Computer Vision Object Recognition blue GMC van Semantic Segmentation Pose Estimation Action Recognition Man with elderly white glasses and a Attribute Classification man with a coat baseball hat Entlebucher mountain dog

  45. High-Level Computer Vision “ A blue GMC van Object Recognition parked, in a back view ” Semantic Segmentation Pose Estimation Action Recognition “ A man with glasses Attribute Classification “ An elderly man with a and a coat, facing back, hat and glasses, facing walking away ” the camera and talking ” “ An entlebucher mountain dog sitting in a bag ”

  46. Trying to extract stick figures is hard (and unnecessary!) Generalized cylinders (Marr & Nishihara, Binford) Pictorial Structures (Felszenswalb & Huttenlocher)

  47. All the wrong limbs…

  48. Motivation

  49. Face Detection Carnegie Mellon University various images submitted to the CMU on-lin http://www.vasc.ri.cmu.edu/cgi-bin/demos/findface.cgi

  50. Examples of poselets (Bourdev & Malik , 2009) Patches are often far visually , but they are close semantically

  51. How do we train a poselet for a given pose configuration?

  52. Finding Correspondences Given part of a human How do we find a similar pose pose configuration in the training set?

  53. Finding Correspondences Left Shoulder Left Hip We use keypoints to annotate the joints, eyes, nose, etc. of people

  54. Finding Correspondences Residual Error

  55. Training poselet classifiers Residual 0.15 0.20 0.10 0.85 0.15 0.35 Error: 1. Given a seed patch 2. Find the closest patch for every other person 3. Sort them by residual error 4. Threshold them

  56. Male or female?

  57. How do we train attribute classifiers “ in the wild ” ?  Effective prediction requires inferring the pose and camera view  Pose reconstruction is itself a hard problem, but we don ’ t need perfect solution.  We train attribute classifiers for each poselet  Poselets implicitly decompose the pose

  58. Gender classifier per poselet is much easier to train

  59. Is male

  60. Has long hair

  61. Wears a hat

  62. Wears glasses

  63. Wears long pants

  64. Wears long sleeves

Recommend


More recommend