Bayes Decision Theory - I Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A – Winter 2012 - UCSD
Statistical Learning from Data Goal: Given a relationship between a feature vector x and a vector y , and iid data samples ( x i ,y i ), find an approximating function f ( x ) y ˆ ( ) x y f x y ( ) · f This is called training or learning . Two major types of learning: • Unsupervised Classification (aka Clustering) or Regression (“blind” curve fitting): only X is known. • Supervised Classification or Regression: both X and target value Y are known during training, only X is known at test time. 2
Nearest Neighbor Classifier • The simplest possible classifier that one could think of: – It consists of assigning to a new, unclassified vector the same class label as that of the closest vector in the labeled training set – E.g. to classify the unlabeled point “ Red ”: measure Red ’s distance to all other labeled training points If the closest point to Red is labeled “A = square”, assign it to the class A otherwise assign Red to the “B = circle” class • This works a lot better than what one might expect, particularly if there are a lot of labeled training points 3
Nearest Neighbor Classifier • To define this classification procedure rigorously, define: – a Training Set D = {( x 1 ,y 1 ) , …, ( x n ,y n )} – x i is a vector of observations , y i is the class label – a new vector x to classify • The Decision Rule is set y y * i where * arg min ( , ) i d x x i { 1 ,..., } i n – argmin means: “the i that minimizes the distance” 4
Metrics • we have seen some examples: – R d -- Continuous functions Inner Product : Inner Product : d T , ( ), ( ) ( ) ( ) x y x y x y f x g x f x g x dx i i 1 i Euclidean norm: norm 2 = ‘energy’: 2 d ( ) ( ) f x f x dx 2 T x x x x i 1 i Euclidean distance: Distance 2 = ‘energy’ of difference: d 2 2 ( , ) ( ) ( , ) [ ( ) ( )] d x y x y x y d f g f x g x dx i i 1 i 5
Euclidean distance • We considered in detail the Euclidean distance d 2 ( , ) ( ) d x y x y i i 1 i x • Equidistant points to x? d 2 2 ( , ) ( ) d x y r x y r i i i 1 2 2 2 – E.g. ( ) ( ) x y x y r 1 1 2 2 • The equidistant points to x are on spheres around x • Why would we need any other metric? 6
Inner Products • fish example: – features are L = fish length, W = scale width – measure L in meters and W in milimeters typical L: 0.70m for salmon, 0.40m for sea-bass typical W: 35mm for salmon, 40mm for sea-bass – I have three fish F 1 = (.7,35) F 2 = (.4, 40) F 3 = (.75, 37.8) F 1 clearly salmon, F 2 clearly sea-bass, F 3 looks like salmon yet d(F 1 ,F 3 ) = 2.8 > d(F 2 ,F 3 ) = 2.23 – there seems to be something wrong here – but if scale width is also measured in meters: F 1 = (.7,.035) F 2 = (.4, .040) F 3 = (.75, .0378) and now d(F 1 ,F 3 ) = .05 < d(F 2 ,F 3 ) = 0.35 – which seems to be right – the units are commensurate 7
Inner Product • Suppose the scale width is also measured in meters: x – I have three fish F 1 = (.7,.035) F 2 = (.4, .040) F 3 = (.75, .0378) and now d(F 1 ,F 3 ) = .05 < d(F 2 ,F 3 ) = 0.35 – which seems to be right • The problem is that the Euclidean distance depends on the units (or scaling) of each axis x – e.g. if I multiply the second coordinate by 1,000 ' 2 2 ( , ) ( ) 1,000,000( ) d x y x y x y 1 1 2 2 The 2 nd coordinates influence on the distance increases 1,000-fold! • Often “right” units are not clear (e.g. car speed vs weight) 8
Inner Products • W e need to work with the “ right ”, or at least “ better ”, units • Apply a transformation to get a “better” feature space x ' Ax • examples: – Taking A = R , R proper and orthogonal, is R x equivalent to a rotation – Another important special case is scaling ( A = S , for S diagonal) SR 0 0 x x S 1 1 1 1 0 0 0 0 x x n n n n x – We can combine these two transformations by making taking A = SR 9
Inner Products • what is the Euclidean inner product in the transformed space? T T T T T ' , ' , ' ' ( ) x y x My x y x y Ax Ay x A Ay • Using the weighted inner product in the original T , x y x M y space, is the equivalent to working in the transformed space • More generally, what is a “good” M? – Let the data tell us! – one possibility is to take M to be the inverse of the covariance matrix 2 ( , ) 1 T ( ) ( ) d x y x y x y • This is the Mahalanobis distance – This distance is adapted to the data “scatter” and thereby yields “natural” units under a Gaussian assumption 10
The multivariate Gaussian • Using Mahalanobis distance = assuming Gaussian data • Mahalanobis distance: Gaussian: 1 1 2 ( , ) 1 T ( ) ( ) 1 d x x x T ( ) exp ( ) ( ) P x x x X 2 d ( 2 ) | | – Points of high probability are those of small distance to the center of the data distribution (mean) – Thus the Mahalanobis distance can be interpreted as the “right” norm for a certain type of non-Cartesian space 11
The multivariate Gaussian • For Gaussian data , the Mahalanobis distance tells us all we could possibly know statistically about the data: – The pdf for a d-dimensional Gaussian of mean and covariance is 1 1 1 T ( ) exp ( ) ( ) P x x x X 2 d ( 2 ) | | – This is equivalent to 1 1 2 ( ) exp , P x d x X 2 K which is the exponential of the negative Mahalanobis distance-squared up to a constant scaling factor K. The constant K is needed only to ensure that the pdf integrates to 1 12
“Optimal” Classifiers • Some metrics are “better” than others • The meaning of “better” is connected to how well adapted the metric is to the properties of the data • Can we be more rigorous? Can we have an “optimal” metric? What could we mean by “ optimal ”? • To talk about optimality we start by defining cost or loss y x ˆ ( ) f x ˆ ( , ) L y y f (.) – Cost is a real-valued loss function that we want to minimize ˆ y – It depends on the true y and the prediction ˆ – The value of the cost tells us how good our predictor is y 13
Loss Functions for Classification • Classification Problem: loss is function of classification errors – What types of errors can we have? – Two Types : False Positives and False Negatives Consider a face detection problem If you see these two images and say say = “face” say = “non - face” you have a false-positive false-negative (miss) – Obviously, we have similar sub-classes for non-errors true-positives and true-negatives – The positive/negative part reflects what we say (predict) – The true/false part reflects the reality of the situation 14
Loss Functions • Are some errors more important than others? – Depends on the problem – Consider a snake looking for lunch The snake likes to eat frogs but dart frogs are highly poisonous The snake must classify each frog that it sees, Y ∈ {“dart”, “regular”} The losses are clearly different snake frog = frog = prediction dart regular “regular” 0 “dart” 0 10 15
Loss Functions • But not all snakes are the same – The one to the right is a dart frog predator – It also can classify each frog it sees, Y ∈ {“dart”, “regular”} , but it actually prefers to eat dart frogs and thus it might pass up a regular frog in its search for a tastier meal However, other frogs are ok to eat too snake dart regular prediction frog frog “regular” 10 0 “dart” 0 10 16
(Conditional) Risk as Average Cost • Given a loss function, denote the cost of classifying a data vector x generated from class j as i by L j i • Conditioned on an observed data vector x , to measure how good the classifier is on the average if one (always) decides i use the (conditional) expected value of the loss , aka the (data-conditional) Risk , de f ( , ) E{ | } ( | ) R x i L Y i x L j i P j x | Y X j • This means that the risk of classifying x as i is equal to – the sum, over all classes j, of the cost of classifying x as i when the truth is j times the conditional probability that the true class is j (where the conditioning is on the observed value of x) 17
Recommend
More recommend