object recognition
play

Object Recognition Computer Vision Fall 2018 Columbia University - PowerPoint PPT Presentation

Object Recognition Computer Vision Fall 2018 Columbia University The Big Picture Low-level Mid-level High-level David Marr Discussion 1) What does it mean to 2) How to make software understand this picture? understand this picture?


  1. Object Recognition Computer Vision Fall 2018 Columbia University

  2. The Big Picture Low-level Mid-level High-level David Marr

  3. Discussion 1) What does it mean to 2) How to make software understand this picture? understand this picture?

  4. Classification: Is there a dog in this image?

  5. Detection: Where are the people?

  6. Segmentation: Where really are the people?

  7. Attributes: What features do objects have? 45° rotation soft plastic furry sideways hard

  8. Actions: What are they doing? sitting playing sleeping sleeping

  9. How many visual object categories are there? Biederman 1987

  10. Rapid scene catgorization People can distinguish high-level concepts (animal/transport) in under 150ms (Thorpe) Appears to suggest feed-forward computations suffice (or at least dominate)

  11. Journal of Vision (2007) 7(1):10, 1–29 http://journalofvision.org/7/1/10/ 1 What do we perceive in a glance of a real-world scene?

  12. Should language be the right output?

  13. Object recognition Is it really so hard? Output of normalized correlation Find the chair in this image This is a chair

  14. Object recognition Is it really so hard? Find the chair in this image Pretty much garbage Simple template matching is not going to make it My biggest concern while making this slide was:

  15. Challenges:'viewpoint'varia/on' Michelangelo 1475-1564

  16. Challenges:'illumina/on'

  17. Challenges:'scale'

  18. Challenges:'background'clu_er' Kilmeny'Niland.'1995 ,,

  19. Within-class variations Svetlana Lazebnik

  20. Supervised Visual Recognition

  21. Can we define a canonical list of objects, attributes, actions, materials….? ImageNet (cf. WordNet, VerbNet, FrameNet,..)

  22. Crowdsourcing

  23. The value of data Amazon Mechanical Turk The Large Hadron Collider $ 10 2 - 10 4 $ 10 10

  24. Mechanical Turk • von Kempelen, 1770. • Robotic chess player. • Clockwork routines. • Magnetic induction (not vision) • Toured the world; played Napoleon Bonaparte and Benjamin Franklin.

  25. Mechanical Turk • It was all a ruse! • Ho ho ho.

  26. Amazon Mechanical Turk Artificial artificial intelligence. Launched 2005. Small tasks, small pay. Used extensively in data collection. Image: Gizmodo

  27. Beware of the human in your loop • What do you know about them? • Will they do the work you pay for? Let’s check a few simple experiments

  28. Workers are given 1 cent to randomly pick number between 1 and 10

  29. Workers are given 1 cent to randomly pick number between 1 and 10 Turkers were offered 1 cent to pick a number from 1 to 10. ~850 turkers Experiment by Greg Little From http://groups.csail.mit.edu/uid/deneme/

  30. Please choose one of the following:

  31. Please choose one of the following: TS: Experiment by Greg Little From http://groups.csail.mit.edu/uid/deneme/

  32. Please flip an actual coin and report the result

  33. Please flip an actual coin and report the result After 50 HITS: And 50 more: 34 heads, 16 tails 31 heads, 19 tails Experiment by Rob Miller From http://groups.csail.mit.edu/uid/deneme/

  34. Please click option B: A B C

  35. Please click option B: A B C Results of 100 HITS A: 2 B: 96 C: 2 Experiment by Greg Little From http://groups.csail.mit.edu/uid/deneme/

  36. How do we annotate this?

  37. Notes on image annotation arXiv:1210.3448v1 [cs.CV] 12 Oct 2012 Adela Barriuso, Antonio Torralba Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology “ I can see the ceiling, a wall and a ladder, but I do not know how to annotate what is on the right side of the picture. Maybe I just need to admit that I can not solve ” this picture in an easy and fast way. But if I was forced Semantic blindspots

  38. Jia Deng, Fei-Fei Li, and many collaborators

  39. What is WordNet? Establishes Organizes over ontological and 150,000 words into lexical relationships 117,000 categories Original paper by in NLP and related called synsets . [George Miller, et tasks. al 1990] cited over 5,000 times

  40. Individually Illustrated WordNet Nodes jacket: a short coat A massive ontology of German shepherd: breed of large shepherd dogs used in police work and as a guide for the images to transform blind. computer vision microwave: kitchen appliance that cooks food by passing an electromagnetic wave through it. mountain: a land mass that projects well above its surroundings; higher than a hill.

  41. OBJECTS INANIMATE ANIMALS PLANTS NATURAL MAN-MADE ….. VERTEBRATE MAMMALS BIRDS TAPIR BOAR GROUSE CAMERA

  42. 1 0.75 0.5 0.25 0 t e g g a s o i P C u D o M Target Label

  43. 12 6 CNN 0 -6 t e g g a s o i C P u D o M 1 What’s wrong here? 0.75 0.5 0.25 0 t e g g a s o i P C u D o M Target Label

  44. 1 0.75 CNN 0.5 0.25 0 t e g g a s o i C P u D o M 1 0.75 Normalize outputs to sum 0.5 to unity with softmax: 0.25 exp( z j ) 0 σ ( z ) j = t e g g a s o i P C u D ∑ K o k =1 exp( z k ) M Target Label

  45. 1 0.75 CNN 0.5 0.25 0 t e g g a s o i C P u D o M 1 0.75 0.5 0.25 0 Cross entropy loss: t e g g a s o i P C ℒ ( x , y ) = − ∑ u D o M y i log x i Target Label i

  46. Follow gradient step to lower loss: 1 0.75 CNN 0.5 0.25 0 t e g g a s o i C P u D o M 1 0.75 0.5 0.25 0 Cross entropy loss: t e g g a s o i P C ℒ ( x , y ) = − ∑ u D o M y i log x i Target Label i

  47. 1 0.75 CNN 0.5 0.25 0 t e g g a s o i C P u D o M Question: How to localize where objects are?

  48. How much data do you need? Systematic evaluation of CNN advances on the ImageNet

  49. How much data do you need? CNN Features o ff -the-shelf: an Astounding Baseline for Recognition

  50. Short cuts to AI With billions of images on the web, it’s often possible to find a close nearest neighbor. We can shortcut hard problems by “looking up” the answer, stealing the labels from our nearest neighbor.

  51. Chinese Room experiment, John Searle (1980) Input to program is Chinese, and output is also Chinese. It passes the Turing test. Does the computer “understand” Chinese or just “simulate” it? What if the software is just a lookup table?

  52. History

  53. Recognition as an alignment problem: Block world L. G. Roberts Machine Perception of Three Dimensional Solids, Ph.D. thesis, MIT Department of Electrical Engineering, 1963. J. Mundy, Object Recognition in the Geometric Era: a Retrospective, 2006

  54. Representing and recognizing object categories is harder... ACRONYM (Brooks and Binford, 1981) Binford (1971), Nevatia & Binford (1972), Marr & Nishihara (1978)

  55. Binford and generalized cylinders Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006

  56. General shape primitives? Generalized cylinders Ponce et al. (1989) Forsyth (2000) Zisserman et al. (1995) Svetlana Lazebnik

  57. Recognition by components Biederman (1987) Primitives (geons) Objects http://en.wikipedia.org/wiki/Recognition_by_Components_Theory Svetlana Lazebnik

  58. Scenes and geons Mezzanotte & Biederman

  59. Bag-of-features models Bag of Object ‘words’ Svetlana Lazebnik

  60. Origin 1: Bag-of-words models • Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983) US Presidential Speeches Tag Cloud http://chir.ag/phernalia/preztags/

  61. Origin 2: Texture recognition • Characterized by repetition of basic elements or textons • For stochastic textures, the identity of textons matters, not their spatial arrangement Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003

  62. Origin 2: Texture recognition histogram Universal texton dictionary Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003

  63. Bag-of-features models Svetlana Lazebnik

  64. Objects as texture • All of these are treated as being the same • No distinction between foreground and background: scene recognition? Svetlana Lazebnik

  65. Bag-of-features steps 1. Feature extraction 2. Learn “visual vocabulary” 3. Quantize features using visual vocabulary 4. Represent images by frequencies of “visual words”

  66. 1. Feature extraction • Regular grid or interest regions

  67. 1. Feature extraction Compute Extract patch descriptor Detect patches Slide credit: Josef Sivic

  68. 1. Feature extraction … Slide credit: Josef Sivic

  69. 2. Learning the visual vocabulary … Slide credit: Josef Sivic

  70. 2. Learning the visual vocabulary … Clustering Slide credit: Josef Sivic

  71. 3. Quantize the visual vocabulary Visual vocabulary … Clustering Slide credit: Josef Sivic

  72. Example codebook … Appearance codebook Source: B. Leibe

Recommend


More recommend