Instance-level Recognition Pingmei Xu
Object Recognition Friends SE01EP02
Recognition: Find the Ring! Friends SE01EP02
��������������������������� Recognition: Find the Ring! Instance Recognition � Friends SE01EP02
����������������� Recognition: Find the Ring! Category Recognition � Friends SE01EP02
����������������������� ���������������� Recognition: Find the Ring! Recognition Algorithm � Friends SE01EP02
������������������������� ��������������������������� Recognition: Find the Ring! Scene Understanding � Friends SE01EP02
History of Ideas in Recognition Some Slides are borrowed from Svetlana Lazebnik
The Religious Wars Geometry Appearance vs. Parts vs. The Whole • …and the standard answer: probably both or neither Alexei Efros
1960s ~ late 1990s the Geometric Era
Blocks World (1960s) � � � � L.G. Roberts � � • Constrained 3D scene models to allow object recognition from very simple image features.
Generalized Cylinders (1970s) � � � � T. Binford � � • Representing 3D shapes and parts in terms of “Generalized Cylinders”.
Recognition by Parts � � � � � Biederman (1987) Pentland (1986) � � • Geons: shape primitives + deformations, with predictable edge properties under perspective.
Recognition by Parts Primitives (geons) Objects � � � � � � Biederman (1987) � • Hypothesis: there is a small number of geometric components that constitute the primitive elements of the object recognition system.
Parts + Spatial Configurations Fischler & Elschlager (1973) • There is more to shape than just the right part primitives, i.e., their spatial relationships.
Alignment Huttenlocher & Ullman (1987)
1990s Appearance-Based
Eigenfaces Turk & Pentland (1991)
Color Histogram Swain & Ballard (1991)
1990s ~ present Sliding Window
Sliding Window Approaches Viola & Jones (2000) HOG feature map Template Response map Dalal & Triggs (2005)
late 1990s ~ present Local Features
Local Features D. Lowe (1999, 2004)
early 2000s ~ present Parts and Shape
Constellation Models Weber, Welling & Perona (2000), Fergus, Perona & Zisserman (2003)
Deformable Part Model Felzenszwalb, Girshick, McAllester & Ramanan (2009)
mid 2000s ~ present Bag of Features
Bag-of-features Models Objects Bag of “words”
Present Trends
Data-driven Method Malisiewicz et al. (2011)
Recognition from RGBD Images Shotton et al. (2011)
Deep Learning http://deeplearning.net/
History of Ideas in Recognition • 1960s – late 1990s: the geometric era • 1990s: appearance-based models • 1990s – present: sliding window approaches • late 1990s – present: local features • early 2000s – present: parts-and-shape models • mid 2000s – present: bags of features • present trends: “big data”, context, attributes, combining geometry and recognition, advanced scene understanding tasks, deep learning
History of Ideas in Recognition THE ¡COMPUTER ¡VISION ¡EVOLUTION ?
3D Object Modeling and Recognition A test image Instances of 5 models Rothganger, Lazebnik, Schmid, & Ponce (2006)
3D Modeling: Pairwise Matching
Affine Patches STEP 1: Detection STEP 2: Description Detect salient image regions Extract a descriptor • Idea: although smooth surfaces are almost never planar in the large, they are always planar in the small.
Affine Patches Harris-Laplacian DoG
Affine Patches
Affine Patches
Affine Patches
Affine Patches
Affine Patches • Patch rectification
Geometric Constraints
3D Object Modeling: Matching Procedure RANSAC: 1) sampling stage 2) consensus stage
Application: 3D Object Modeling
3D Modeling: Input Images • The 20 images used to construct the teddy bear model.
3D Modeling: Partial Model from Image Pairs • Matches between two images.
3D Modeling: Partial Model → Composite Ones Covert a collection of matches to a 3D model 1. Chaining 2. Stitching 3. Bundle adjustment 4. Euclidean upgrade
3D Modeling: Partial → Composite: Chaining Chaining: link matches across multiple images. • Construction of the patch-view matrix. A (subsampled) patch-view matrix for the teddy bear. Each black square indicates the presence of a given patch in a given image.
3D Modeling: Partial → Composite: Stitching Stitching: solve for the affine structure and motion while coping with missing data. Common patches of adjacent modeling views presented in a common coordinate frame.
3D Modeling: Partial → Composite: Bundle Adjustment Bundle adjustment: refine the model using non-linear least squares. The bear model along with the recovered affine camera configurations.
3D Modeling: Object Gallery
Application: 3D Object Recognition
3D Object Recognition: Select Potential Matches Features: 1) a measure of the contrast (average squared gradient norm) in the patch 2) a 10 × 10 color histogram drawn from the UV portion of YUV space, and 3) SIFT
3D Object Recognition: Robust Estimation • RANSAC • Greedy Here |P| denotes the size of the set P of match hypotheses, K is the number of best matches kept per model patch, M is the number of samples drawn, and N is the size of one seed.
3D Object Recognition: Object Detection • Criteria used to decide whether it is present or not: � � � • Measure of distortion: reflects how close to the top part of a scaled rotation this matrix is.
3D Object Recognition: Successful Examples
3D Object Recognition: Failed Examples
3D Object Modeling and Recognition Paper: 3D Object Modeling and Recognition Using Local Affine-Invariant • Image Descriptors and Multi-View Spatial Constraints. F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce, IJCV 2006 Input Pairwise Matching Test Image 3D Recognition 3D Modeling
Large Scale Retrieval
Large Scale Image Retrieval • Combining local features, indexing, and spatial constraints K. Grauman and B. Leibe
Video Google (VG) Query Search Results J Sivic et al. (2003), Philbin et al. (2007, 2008), Chum et al. (2007) • Whether text retrieval approach is applicable to object recognition?
������������������������������ ������������������ Text Retrieval Word Stem Document Corpus
The Visual Analogy ? ? Frame/Image Film/Image Set
VG:Local Descriptor • Viewpoint covariant regions: 1) ’Maximally Stable’ (yellow) 2) ’Shape Adapted’ (cyan) • 128-dimensional SIFT
The Visual Analogy Descriptor ? Frame/Image Film/Image Set
VG: Visual Vocabulary Affine covariant regions Clusters • Vector quantize the descriptors into clusters by k-means.
VG: Visual Words • Each group of patches belongs to the same visual word.
The Visual Analogy Descriptor Centroid Frame/Image Film/Image Set
Image Retrieval Using Visual Words Vocabulary construction (offline) � • • Database construction (offline) • Image retrieval (online)
VG: Stop List Before stop list → After stop list → • The most frequent visual words that occur in almost all images are suppressed.
VG: Soft Assignment • Count in one bin is spread to neighbouring bins.
Vocabulary Construction Summary Subset of 48 shots Select regions (SA+MS) Frame tracking is selected 10k frames= 10k frames*1600 ~200k 10% of movie =1,600,000 regions regions Cluster descriptors Reject unstable SIFT descriptors using K-means regions Parameters tuning
Image Retrieval Using Visual Words • Vocabulary construction (offline) Database construction (offline) � • • Image retrieval (online)
tf-idf Vector Number of Total number of occurrences of word i documents in database in document d Number of words Total number of in document d word i in database • Documents -> vectors of word frequencies • Term frequency – inverse document frequency • Downweight words that appear often in the database
VG: Inverted File Index Word ID Document ID 1 1 2 2 3 3 4 4 … … N K � • Word -> a list of all documents (with frequencies)
Crawling Movies Summary Vocabulary Construction Select key Select regions Reject unstable Frame tracking frames (SA+MS) regions Stop list Vector quantization SIFT descriptors Tf-idf weighting Inverted index Database Construction
Recommend
More recommend