bayes decision theory i
play

Bayes Decision Theory - I Ken Kreutz-Delgado (Nuno Vasconcelos) ECE - PowerPoint PPT Presentation

Bayes Decision Theory - I Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 - UCSD Statistical Learning from Data Goal: Given a relationship between a feature vector x and a vector y , and iid data samples ( x i ,y i ), find an


  1. Bayes Decision Theory - I Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A – Winter 2012 - UCSD

  2. Statistical Learning from Data Goal: Given a relationship between a feature vector x and a vector y , and iid data samples ( x i ,y i ), find an approximating function f ( x )  y   ˆ ( ) x y f x y ( ) · f This is called training or learning . Two major types of learning: • Unsupervised Classification (aka Clustering) or Regression (“blind” curve fitting): only X is known. • Supervised Classification or Regression: both X and target value Y are known during training, only X is known at test time. 2

  3. Nearest Neighbor Classifier • The simplest possible classifier that one could think of: – It consists of assigning to a new, unclassified vector the same class label as that of the closest vector in the labeled training set – E.g. to classify the unlabeled point “ Red ”:  measure Red ’s distance to all other labeled training points  If the closest point to Red is labeled “A = square”, assign it to the class A  otherwise assign Red to the “B = circle” class • This works a lot better than what one might expect, particularly if there are a lot of labeled training points 3

  4. Nearest Neighbor Classifier • To define this classification procedure rigorously, define: – a Training Set D = {( x 1 ,y 1 ) , …, ( x n ,y n )} – x i is a vector of observations , y i is the class label – a new vector x to classify • The Decision Rule is  set y y * i where  * arg min ( , ) i d x x i  { 1 ,..., } i n – argmin means: “the i that minimizes the distance” 4

  5. Metrics • we have seen some examples: – R d -- Continuous functions Inner Product : Inner Product : d      T , ( ), ( ) ( ) ( ) x y x y x y f x g x f x g x dx i i  1 i Euclidean norm: norm 2 = ‘energy’:   2 d ( ) ( )  f x f x dx   2 T x x x x i  1 i Euclidean distance: Distance 2 = ‘energy’ of difference: d         2 2 ( , ) ( ) ( , ) [ ( ) ( )] d x y x y x y d f g f x g x dx i i  1 i 5

  6. Euclidean distance • We considered in detail the Euclidean distance d    2 ( , ) ( ) d x y x y i i  1 i x • Equidistant points to x? d      2 2 ( , ) ( ) d x y r x y r i i  i 1     2 2 2 – E.g. ( ) ( ) x y x y r 1 1 2 2 • The equidistant points to x are on spheres around x • Why would we need any other metric? 6

  7. Inner Products • fish example: – features are L = fish length, W = scale width – measure L in meters and W in milimeters  typical L: 0.70m for salmon, 0.40m for sea-bass  typical W: 35mm for salmon, 40mm for sea-bass – I have three fish  F 1 = (.7,35) F 2 = (.4, 40) F 3 = (.75, 37.8)  F 1 clearly salmon, F 2 clearly sea-bass, F 3 looks like salmon  yet d(F 1 ,F 3 ) = 2.8 > d(F 2 ,F 3 ) = 2.23 – there seems to be something wrong here – but if scale width is also measured in meters:  F 1 = (.7,.035) F 2 = (.4, .040) F 3 = (.75, .0378)  and now d(F 1 ,F 3 ) = .05 < d(F 2 ,F 3 ) = 0.35 – which seems to be right – the units are commensurate 7

  8. Inner Product • Suppose the scale width is also measured in meters: x – I have three fish  F 1 = (.7,.035) F 2 = (.4, .040) F 3 = (.75, .0378)  and now d(F 1 ,F 3 ) = .05 < d(F 2 ,F 3 ) = 0.35 – which seems to be right • The problem is that the Euclidean distance depends on the units (or scaling) of each axis x – e.g. if I multiply the second coordinate by 1,000     ' 2 2 ( , ) ( ) 1,000,000( ) d x y x y x y 1 1 2 2 The 2 nd coordinates influence on the distance increases 1,000-fold! • Often “right” units are not clear (e.g. car speed vs weight) 8

  9. Inner Products • W e need to work with the “ right ”, or at least “ better ”, units • Apply a transformation to get a “better” feature space x  ' Ax • examples: – Taking A = R , R proper and orthogonal, is R x equivalent to a rotation – Another important special case is scaling ( A = S , for S diagonal) SR         0 0 x x S 1 1 1 1           0 0                0 0      x x n n n n x – We can combine these two transformations by making taking A = SR 9

  10. Inner Products • what is the Euclidean inner product in the transformed space?         T    T T T T ' , ' , ' ' ( ) x y x My x y x y Ax Ay x A Ay    • Using the weighted inner product in the original T , x y x M y space, is the equivalent to working in the transformed space • More generally, what is a “good” M? – Let the data tell us! – one possibility is to take M to be the inverse of the covariance matrix      2 ( , ) 1 T ( ) ( ) d x y x y x y • This is the Mahalanobis distance – This distance is adapted to the data “scatter” and thereby yields “natural” units under a Gaussian assumption 10

  11. The multivariate Gaussian • Using Mahalanobis distance = assuming Gaussian data • Mahalanobis distance: Gaussian:   1 1         2 ( , ) 1 T         ( ) ( ) 1 d x x x  T  ( ) exp ( ) ( ) P x x x X     2 d ( 2 ) | | – Points of high probability are those of small distance to the center of the data distribution (mean) – Thus the Mahalanobis distance can be interpreted as the “right” norm for a certain type of non-Cartesian space 11

  12. The multivariate Gaussian • For Gaussian data , the Mahalanobis distance tells us all we could possibly know statistically about the data: – The pdf for a d-dimensional Gaussian of mean  and covariance  is   1 1         1  T  ( ) exp ( ) ( ) P x x x X     2 d ( 2 ) | | – This is equivalent to   1 1      2   ( ) exp , P x d x X   2 K which is the exponential of the negative Mahalanobis distance-squared up to a constant scaling factor K.  The constant K is needed only to ensure that the pdf integrates to 1 12

  13. “Optimal” Classifiers • Some metrics are “better” than others • The meaning of “better” is connected to how well adapted the metric is to the properties of the data • Can we be more rigorous? Can we have an “optimal” metric? What could we mean by “ optimal ”? • To talk about optimality we start by defining cost or loss y  x ˆ ( ) f x ˆ ( , ) L y y f (.) – Cost is a real-valued loss function that we want to minimize ˆ y – It depends on the true y and the prediction ˆ – The value of the cost tells us how good our predictor is y 13

  14. Loss Functions for Classification • Classification Problem: loss is function of classification errors – What types of errors can we have? – Two Types : False Positives and False Negatives  Consider a face detection problem  If you see these two images and say say = “face” say = “non - face”  you have a false-positive false-negative (miss) – Obviously, we have similar sub-classes for non-errors  true-positives and true-negatives – The positive/negative part reflects what we say (predict) – The true/false part reflects the reality of the situation 14

  15. Loss Functions • Are some errors more important than others? – Depends on the problem – Consider a snake looking for lunch  The snake likes to eat frogs  but dart frogs are highly poisonous  The snake must classify each frog that it sees, Y ∈ {“dart”, “regular”}  The losses are clearly different snake frog = frog = prediction dart regular  “regular” 0 “dart” 0 10 15

  16. Loss Functions • But not all snakes are the same – The one to the right is a dart frog predator – It also can classify each frog it sees, Y ∈ {“dart”, “regular”} , but it actually prefers to eat dart frogs and thus it might pass up a regular frog in its search for a tastier meal  However, other frogs are ok to eat too snake dart regular prediction frog frog “regular” 10 0 “dart” 0 10 16

  17. (Conditional) Risk as Average Cost • Given a loss function, denote the cost of classifying a data vector x generated from class j as i by    L j i • Conditioned on an observed data vector x , to measure how good the classifier is on the average if one (always) decides i use the (conditional) expected value of the loss , aka the (data-conditional) Risk , de f          ( , ) E{ | } ( | ) R x i L Y i x L j i P j x | Y X j • This means that the risk of classifying x as i is equal to – the sum, over all classes j, of the cost of classifying x as i when the truth is j times the conditional probability that the true class is j (where the conditioning is on the observed value of x) 17

Recommend


More recommend