ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/
2 Outline • Knowledge Representation • Statistical Pattern Recognition • Neural Networks • Boosting
3 Object Recognition • Pattern recognition is a fundamental component of machine vision • Recognition is high-level image analysis ▫ From the bottom-up perspective (pixels objects) ▫ Many software packages exist to easily implement recognition algorithms (E.g. Weka Project, R package) • Goal of object recognition is to “learn” characteristics that help distinguish object of interest ▫ Most are binary problems
4 Knowledge Representation • Syntax – specifies the symbols that may be used and ways they may be arranged • Semantics – specifies how meaning is embodied in syntax • Representation – set of syntactic and semantic conventions used to describe things • Sonka book focuses on artificial intelligence (AI) representations ▫ More closely related to human cognition modeling (e.g. how humans represent things) ▫ Not as popular in vision community
5 Descriptors/Features • Most common representation in vision • Descriptors (features) usually represent some scalar property of an object ▫ These are often combined into feature vectors • Numerical feature vectors are inputs for statistical pattern recognition techniques ▫ Descriptor represents a point in feature space
6 Statistical Pattern Recognition • Object recognition = pattern recognition ▫ Pattern – measureable properties of object • Pattern recognition steps: ▫ Description – determine right features for task ▫ Classification – technique to separate different object “classes” • Separable classes – hyper-surface exists perfectly distinguish objects ▫ Hyper-planes used for linearly separable classes ▫ This is unlikely in real-world scenarios
7 General Classification Principles • A statistical classifiers takes in a • Decision rule 𝑜 -dimensional feature of an ▫ 𝑒 𝒚 = 𝜕 𝑠 ⇔ 𝑠 𝒚 = object and has a single output 𝑡=1,…,𝑆 𝑡 (𝒚) max ▫ The output is one of the 𝑆 available class symbols ▫ Which subset (region) provides (identifiers) maximum discrimination • Decision rule – describes • Linear discriminant functions relations between classifier are simple and often used in inputs and output linear classifier ▫ 𝑒 𝒚 = 𝜕 𝑠 ▫ 𝑠 𝒚 = 𝑟 𝑠0 + 𝑟 𝑠1 𝑦 1 + ⋯ + ▫ Divides feature space into 𝑆 𝑟 𝑠𝑜 𝑦 𝑜 disjoint subsets 𝐿 𝑠 • Must use non-linear for more • Discrimination hyper-surface is complex problems the border between subsets ▫ Trick is to transform the • Discrimination function original feature space into a higher dimensional space ▫ 𝑠 𝒚 ≥ 𝑡 𝒚 , 𝑡 ≠ 𝑠 Can use a linear classifier in 𝒚 ∈ 𝐿 𝑠 the higher dim space • Discrimination hyper-surface ▫ 𝑠 𝒚 = 𝒓 𝒔 ⋅ Φ 𝒚 between class regions Φ(𝒚) – non-linear mapping to ▫ 𝑠 𝒚 − 𝑡 𝒚 = 0 higher dimensional space
8 Nearest Neighbors • Classifier based on minimum • Nearest neighbor (NN) classifier distance principle ▫ Very simple classifier uses multiple exemplars per class • Minimum distance classifier labels pattern 𝒚 into the class ▫ Take same label as closest with closest exemplar exemplar • k-NN classifier ▫ 𝑒 𝒚 = 𝑏𝑠𝑛𝑗𝑜 𝑡 |𝒘 𝑡 − 𝒚| ▫ More robust version by ▫ 𝒘 𝑡 - exemplars (sample examining 𝑙 closest points and pattern) for class 𝜕 𝑡 taking most often occurring • With a single exemplar per label class, results in linear classifier • Advantage: easy “training” • Problems: computational complexity ▫ Scales with number of exemplars and dimensions ▫ Must do many comparisons ▫ Can improve performance with K-D trees
9 Classifier Optimization • Discriminative classifiers are deterministic • Discriminative function ▫ Pattern 𝒚 always mapped to same class ▫ 𝑠 𝑦 = 𝑞 𝑦 𝜕 𝑠 𝑞 𝜕 𝑠 • Would like to have an optimal classifier ▫ Corresponds to posteriori probability ▫ Classifier the minimizes the errors in 𝑞(𝜕 𝑠 |𝑦) classification • Define loss function to optimize based on • Posteriori probability describes how often classifier parameters 𝑟 pattern x is from class 𝜕 𝑠 𝐾 𝑟 ∗ = min ▫ 𝑟 𝐾 𝑟 Optimal decision is to classify 𝑦 to class 𝜕 𝑠 if • posteriori 𝑄(𝜕 𝑠 |𝑦) is highest 𝑒 𝑦, 𝑟 = 𝜕 ▫ ▫ However, we do not know the posteriori • Minimum error criterion (Bayes criterion, • Bayes theorem maximum likelihood) loss function 𝑞 𝜕 𝑡 𝑦 = 𝑞 𝑦 𝜕 𝑡 𝑞 𝜕 𝑡 ▫ 𝜇 𝜕 𝑠 𝜕 𝑡 - loss incurred if classifier ▫ incorrectly labels object 𝜕 𝑠 𝑞 𝑦 Since 𝑞(𝑦) is a constant and prior 𝑞 𝜕 𝑡 is • 𝜇 𝜕 𝑠 𝜕 𝑡 = 1 for 𝑠 ≠ 𝑡 known, • Mean loss Just need to maximize likelihood 𝑞 𝑦 𝜕 𝑡 ▫ ▫ 𝐾 𝑟 = • This is desirable because the likelihood is 𝑆 something we can learn using training data 𝜇 𝑒 𝑦, 𝑟 𝜕 𝑡 𝑞 𝑦 𝜕 𝑡 𝑞 𝜕 𝑡 𝑒𝑦 𝑡=1 𝑌 𝑞 𝜕 𝑡 - prior probability of class 𝑞 𝑦 𝜕 𝑡 - conditional probability density
10 Classifier Training • Supervised approach • Usually, larger datasets result in better generalization • Training set is given with feature and associated class ▫ Some state-of-the-art classifiers use millions of label examples ▫ 𝑈 = { 𝒚 𝑗 , 𝑧 𝑗 } ▫ Try to have enough samples ▫ Used to set the classifier to statistical cover space parameters 𝒓 • N Cross-fold validation/testing • Learning methods should be ▫ Divide training data into a inductive to generalize well train and validation set ▫ Represent entire feature ▫ Only train using training data space and check results on ▫ E.g. work even on unseen validation set examples ▫ Can be used for “bootstrapping” or to select best parameters after partitioning data N times
11 Classifier Learning • Probability density estimation ▫ Estimate the probability densities 𝑞(𝒚|𝜕 𝑠 ) and priors 𝑞(𝜕 𝑠 ) • Parametric learning ▫ Typically, the distribution 𝑞 𝒚 𝜕 𝑠 shape is known but the parameters must be learned E.g. Gaussian mixture model ▫ Like to select a distribution family that can be efficiently estimated such as Gaussians ▫ Prior estimation by relative frequency 𝑞 𝜕 𝑠 = 𝐿 𝑠 /𝐿 Number of objects in class 𝑠 over total objects in training database
12 Support Vector Machines (SVM) • Maybe the most popular classifier in CV today • SVM is an optimal classification for separable two- class problem ▫ Maximizes the margin (separation) between two classes generalizable and avoids overfitting ▫ Relaxed constraints for non-separable classes ▫ Can use kernel trick to provide non-linear separating hyper-surfaces • Support vectors – vectors from each class that are closest to the discriminating surface ▫ Define the margin • Rather than explicitly model the likelihood, search for the discrimination function ▫ Don’t waste time modeling densities when class label is all we need
13 SVM Insight • SVM is designed for binary classification of linearly separable classes • Input 𝒚 is n-dimensional (scaled between [0,1] to normalize) and class label 𝜕 ∈ {−1,1} • Discrimination between classes defined by hyperplane s.t. no training samples are misclassified ▫ 𝒙 ⋅ 𝒚 + 𝑐 = 0 𝒙 – plane normal, 𝑐 offset ▫ Optimization finds “best” separating hyperplane
14 SVM Power • Final discrimination function ▫ 𝑔 𝑦 = 𝑥 ⋅ 𝑦 + 𝑐 • Re-written using training data ▫ 𝑔 𝑦 = 𝛽 𝑗 𝜕 𝑗 𝑦 𝑗 ⋅ 𝑦 + 𝑐 𝑗∈𝑇𝑊 𝛽 𝑗 - weight of support vector SV ▫ Only need to keep support vectors for classification • Kernel trick ▫ Replace 𝑦 𝑗 ⋅ 𝑦 with non-linear mapping kernel 𝑙 𝑦 𝑗 , 𝑦 = Φ 𝑦 𝑗 ⋅ Φ 𝑦 𝑘 ▫ For specific kernels this can be efficiently computed without doing the warping Φ Can even map into an infinite dimensional space ▫ Allows linear separation in a higher dimensional space
15 SVM Resources • More detailed treatment can be found in ▫ Duda , Hart, Stork, “Pattern Classification” • Lecture notes from Nuno Vasconcelos (UCSD) ▫ http://www.svcl.ucsd.edu/courses/ece271B- F09/handouts/SVMs.pdf • SVM software ▫ LibSVM [link] ▫ SVMLight [link]
16 Cluster Analysis • Unsupervised learning method that does not require labeled training data • Divide training set into subsets (clusters) based on mutual similarity of subset elements ▫ Similar objects are in a single cluster, dissimilar objects in separate clusters • Clustering can be performed hierarchically or non- hierarchically • Hierarchical clustering ▫ Agglomerative – each sample starts as its own cluster and clusters are merged ▫ Divisive – the whole dataset starts as a single cluster and is divided • Non-hierarchical clustering ▫ Parametric approaches – assumes a known class-conditioned distribution (similar to classifier learning) ▫ Non-parametric approaches – avoid strict definition of distribution
Recommend
More recommend