10/28/2015 Instance recognition Thurs Oct 29 Last time • Depth from stereo: main idea is to triangulate from corresponding image points. • Epipolar geometry defined by two cameras – We’ve assumed known extrinsic parameters relating their poses • Epipolar constraint limits where points from one view will be imaged in the other – Makes search for correspondences quicker • To estimate depth – Limit search by epipolarconstraint – Compute correspondences, incorporate matching preferences 1
10/28/2015 Virtual viewpoint video C. Zitnick et al, High-quality video view interpolation using a layered representation, SIGGRAPH 2004. Virtual viewpoint video C. Larry Zitnick et al, High-quality video view interpolation using a layered representation, SIGGRAPH 2004. http://research.microsoft.com/IVM/VVV/ 2
10/28/2015 Review questions: What stereo rig yielded these epipolar lines? e’ e Epipole has same coordinates in both images. Points move along lines radiating from e: “Focus of expansion” Figure f rom Hartley & Zisserman Review questions • When solving for stereo, when is it necessary to break the soft disparity gradient constraint? • What can cause a disparity value to be undefined? • What parameters relating the two cameras in the stereo rig must be known (or inferred) to compute depth? 3
10/28/2015 Today • Instance recognition – Indexing local features efficiently – Spatial verification models Recognizing or retrieving specific objects Example I: Visual search in feature films Visually defined query “Groundhog Day” [Rammis, 1993] “Find this clock” “ Find this place ” Slide credit: J. Sivic 4
10/28/2015 Recognizing or retrieving specific objects Example II: Search photos on the web for particular places Find these landmarks ...in these images and 1M more Slide credit: J. Sivic 5
10/28/2015 Recall: matching local features ? Image 1 Image 2 T o generate candidate matches , find patches that have the most similar appearance (e.g., lowest SSD) Simplest approach: compare them all, take the closest (or closest k, or within a thresholded distance) Multi-view matching ? vs … Matching two given Search for a matching views for depth view for recognition 6
10/28/2015 Indexing local features … Indexing local features • Each patch / region has a descriptor, which is a point in some high-dimensional feature space (e.g., SIFT) Descriptor’s feature space 7
10/28/2015 Indexing local features • When we see close points in feature space, we have similar descriptors, which indicates similar local content. Query Descriptor’s image feature space Database images Indexing local features • With potentially thousands of features per image, and hundreds to millions of images to search, how to efficiently find those that are relevant to a new image? • Possible solutions: – Inverted file – Nearest neighbor data structures • Kd-trees • Hashing 8
10/28/2015 Indexing local features: inverted file index • For text documents, an efficient way to find all pages on which a w ord occurs is to use an index… • We want to find all images in which a feature occurs. • To use this idea, we’ll need to map our features to “visual words”. Visual words • Map high-dimensional descriptors to tokens/words by quantizing the feature space • Quantize via clustering, let cluster centers be the prototype “words” Word #2 • Determine which word to assign to Descriptor’s each new image feature space region by finding the closest cluster center. 9
10/28/2015 Visual words: main idea • Extract some local features from a number of images … e.g., SIFT descriptor space: each point is 128-dimensional Slide cr edit: D. Nister , CVPR 2006 Visual words: main idea 10
10/28/2015 Visual words: main idea Visual words: main idea 11
10/28/2015 Each point is a local descriptor, e.g. SIFT vector. 12
10/28/2015 Visual words • Example: each group of patches belongs to the same visual word Figure from Sivic & Zisserman, ICCV 2003 Visual words • Also used for describing scenes and object categories for the sake of indexing or classification. Sivic & Zisserman 2003; Csurka, Bray , Dance, & Fan 2004; many others. 13
10/28/2015 Visual words and textons • First explored for texture and material representations • Texton = cluster center of filter responses over collection of images • Describe textures and materials based on distribution of prototypical texture elements. Leung & Malik 1999; Varma & Zisserman, 2002 Recall: Texture representation example Windows with primarily horizontal Both edges Dimension 2 (mean d/dy value) mean mean d/dx d/dy value value Win. #1 4 10 Win.#2 18 7 … Win.#9 20 20 Dimension 1 (mean d/dx value) … Windows with Windows with statistics to small gradient in primarily vertical both directions edges summarize patterns in small windows 14
10/28/2015 Visual vocabulary formation Issues: • Sampling strategy: where to extract features? • Clustering / quantization algorithm • Unsupervised vs. supervised • What corpus provides features (universal vocabulary?) • Vocabulary size, number of words Inverted file index • Database images are loaded into the index mapping words to image numbers 15
10/28/2015 Inverted file index When w ill this give us a significant gain in efficiency? • New query image is mapped to indices of database images that share a word. Instance recognition: remaining issues • How to summarize the content of an entire image? And gauge overall similarity? • How large should the vocabulary be? How to perform quantization efficiently? • Is having the same set of visual words enough to identify the object/scene? How to verify spatial agreement? • How to score the retrieval results? Kristen Grauman 16
10/28/2015 Analogy to documents Of all the sensory impressions proceeding to China is f orecasting a trade surplus of $90bn the brain, the v isual experiences are the (£51bn) to $100bn this y ear, a threef old dominant ones. Our perception of the world increase on 2004's $32bn. The Commerce around us is based essentially on the Ministry said the surplus would be created by messages that reach the brain f rom our ey es. a predicted 30% jump in exports to $750bn, For a long time it was thought that the retinal compared with a 18% rise in imports to sensory, brain, China, trade, image was transmitted point by point to v isual $660bn. The f igures are likely to f urther centers in the brain; the cerebral cortex was a annoy the US, which has long argued that visual, perception, surplus, commerce, mov ie screen, so to speak, upon which the China's exports are unf airly helped by a retinal, cerebral cortex, exports, imports, US, image in the ey e was projected. Through the deliberately underv alued y uan. Beijing discov eries of Hubel and Wiesel we now eye, cell, optical agrees the surplus is too high, but say s the yuan, bank, domestic, know that behind the origin of the v isual y uan is only one f actor. Bank of China nerve, image foreign, increase, perception in the brain there is a considerably gov ernor Zhou Xiaochuan said the country Hubel, Wiesel trade, value more complicated course of ev ents. By also needed to do more to boost domestic f ollowing the v isual impulses along their path demand so more goods stay ed within the to the v arious cell lay ers of the optical cortex, country. China increased the v alue of the Hubel and Wiesel hav e been able to y uan against the dollar by 2.1% in July and demonstrate that the m essage about the permitted it to trade within a narrow band, but im age falling on the retina undergoes a step- the US wants the y uan to be allowed to trade wise analysis in a system of nerve cells f reely. Howev er, Beijing has made it clear that stored in colum ns. In this system each cell it will take its time and tread caref ully bef ore allowing the y uan to rise f urther in v alue. has its specific function and is responsible for a specific detail in the pattern of the retinal im age. ICCV 2005 short course, L. Fei-Fei Bag of ‘words’ Object ICCV 2005 short course, L. Fei-Fei 17
10/28/2015 Bags of visual words • Summarize entire image based on its distribution (histogram) of word occurrences. • Analogous to bag of words representation commonly used for documents. 18
10/28/2015 Comparing bags of words • Rank frames by normalized scalar product between their (possibly weighted) occurrence counts--- nearest neighbor search for similar images. [1 8 1 4] [5 1 1 0] 𝑒 𝑘 ,𝑟 𝑡𝑗𝑛 𝑒 𝑘 ,𝑟 = 𝑒 𝑘 𝑟 𝑊 𝑗=1 𝑒 𝑘 𝑗 ∗ 𝑟(𝑗) = 𝑒 𝑘 (𝑗) 2 ∗ 𝑊 𝑊 𝑟(𝑗) 2 𝑗=1 𝑗=1 for vocabulary of V words q d j tf-idf weighting • T erm f requency – i nverse d ocument f requency • Describe frame by frequency of each word within it, downweight words that appear often in the database • (Standard weighting for text retrieval) Total number of Number of documents in occurrences of word database i in document d Number of documents Number of words in word i occurs in, in document d whole database 19
Recommend
More recommend