bayes decision theory ii
play

Bayes Decision Theory - II Ken Kreutz-Delgado (Nuno Vasconcelos) - PowerPoint PPT Presentation

Bayes Decision Theory - II Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175 Winter 2012 - UCSD Nearest Neighbor Classifier We are considering supervised classification Nearest Neighbor (NN) Classifier A training set D = {( x 1 ,y 1


  1. Bayes Decision Theory - II Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175 – Winter 2012 - UCSD

  2. Nearest Neighbor Classifier • We are considering supervised classification • Nearest Neighbor (NN) Classifier – A training set D = {( x 1 ,y 1 ), …, ( x n ,y n )} – x i is a vector of observations , y i is the corresponding class label – a vector x to classify • The “NN Decision Rule” is  Set y y * i where  * arg min ( , ) i d x x i  {1,..., } i n – argmin means: “the i that minimizes the distance” 2

  3. Optimal Classifiers • We have seen that performance depends on metric • Some metrics are “better” than others • The meaning of “better” is connected to how well adapted the metric is to the properties of the data • But can we be more rigorous? what do we mean by optimal? • To talk about optimality we define cost or loss  ˆ x ( ) y f x ˆ ( , ) L y y ( ) · f – Loss is the function that we want to minimize ˆ – Loss depends on true y and prediction y – Loss tells us how good our predictor is 3

  4. Loss Functions & Classification Errors • Loss is a function of classification errors – What errors can we have? – Two types: false positives and false negatives  consider a face detection problem (decide “face” or “non - face”)  if you see this and say “face” “non - face”  you have a false – positive false-negative (false alarm) (miss, failure to detect) – Obviously, we have corresponding sub-classes for non-errors  true-positives and true-negatives – positive/negative part reflects what we say or decide, – true/false part reflects the true class label (“true state of the world”) 4

  5. (Conditional) Risk • To weigh different errors differently – We introduce a loss function – Denote the cost of classifying X from class i as j by    L i j – One way to measure how good the classifier is to use the (data- conditional) expected value of the loss, aka the (conditional) Risk,        ( , ) { [ ]| } ( | ) R x i E L Y i x L j i P j x | Y X j – Note that the ( data-conditional ) risk is a function of both the decision “ decide class i ” and the conditioning data (measured feature vector), x. 5

  6. Loss Functions • example: two snakes and eating poisonous dart frogs – Regular snake will die – Frogs are a good snack for the predator dart-snake – This leads to the losses Regular dart regular Predator dart regular snake frog frog snake frog frog  regular 0 regular 10 0 dart 0 10 dart 0 10 – What is optimal decision when snakes find a frog like these? 6

  7. Minimum Risk Classification • We have seen that – if both snakes have    0 dart j    ( | ) P j x  | Y X   1 regular j then both say “regular” – However, if    0.1 dart j    ( | ) P j x  | Y X   0.9 regular j then the vulnerable snake says “dart” while the predator says “regular” • Its infinite loss for saying regular when frog is dart, makes the vulnerable snake much more cautious! 7

  8. BDR = Minimizing Conditional Risk • Note that the definition of risk: – Immediately defines the optimal classifier as the one that minimizes the conditional risk for a given observation x – The Optimal Decision is the Bayes Decision Rule (BDR) :  * ( ) argmin ( , ) i x R x i i      argmin ( | ). L j i P j x | Y X i j – The BDR yields the optimal (minimal) risk :       * * ( ) ( , ) min ( | ) R x R x i L j i P j x | Y X i j 8

  9. What is a Decision Rule? • Consider the c -ary classification problem with class labels,  {1,···, }. c • Given an observation (feature), x, to be classified, a decision rule is a function d = d ( . ) of the observation that takes its values in the set of class labels, d x   ( ) {1 , , }. c i x  * * • Note that defined on the previous slide ( ) ( ) d x is an optimal decision rule in the sense that for a specific value of x it minimizes the conditional risk R(x,i) over all possible decisions i in C 9

  10. (d-Dependent) Total Average Risk • Given a decision rule d and the conditional risk R(x,i), we can consider the (d-dependent) conditional risk R(x,d(x)). • We can now define the total ( d-Dependent) Expected or Average Risk (aka d-Risk):  ( ) E { ( , ( ) ) } R d R x d x – Note that we have averaged over all possible measurements (features) x that we might encounter in the world. – Note that R(d) is a function of a function! (A function of d ) – The (d-risk) R(d) is a measure of how we expect to perform on the average when we use the fixed decision rule d over-and-over- again on a large set of real world data. – It is natural to ask if there is an “optimal decision rule” which minimizes the average risk R(d) over the class of all possible decision rules. 10

  11. Minimizing the Average Risk R(d) • Optimizing total risk R ( d ) seems hard because we are trying to minimize it over a family of functions (decision rules), d . • However, since    ( ) { ( , ( ))} ( , ( )) ( ) , R d E R x d x R x d x p x dx X  0 one can equivalently minimize the data-conditional risk R ( x,d ( x )) point-wise in x. • I.e. solve for the value of the optimal decision rule at each x :   * ( ) arg min ( , ( )) argmin ( , ) d x R x d x R x i    ( ) d x i i • Thus d* ( x ) = i* ( x ) !! I.e. the BDR, which we already know optimizes the Data-Conditional Risk, ALSO optimizes the Average Risk R(d) over ALL possible decision rules d !! • This makes sense: if the BDR is optimal for every single situation, x, it must be optimal on the average over all x 11

  12. The 0/1 Loss Function • An important special case of interest: – zero loss for no error and equal loss for two error types • This is equivalent to the snake dart regular “zero/one” loss : prediction frog frog   0 i j regular 1 0      L i j   1 i j dart 0 1 • Under this loss the optimal Bayes decision rule (BDR) is       * * ( ) ( ) argmin ( | ) d x i x L j i P j x | Y X i j   argmin ( | ) P j x Y X | i  j i 12

  13. 0/1 Loss yields MAP Decision Rule • Note that :   * ( ) argmin ( | ) i x P j x | Y X i  j i     argmin 1 ( | ) P i x   Y X | i  argmax ( | ) P i x | Y X i • Thus the Optimal Decision for the 0/1 loss is : – Pick the class that is most probable given the observation x – i*(x ) is known as the Maximum a Posteriori Probability (MAP) solution • This is also known as the Bayes Decision Rule (BDR) for the 0/1 loss – We will often simplify our discussion by assuming this loss – But you should always be aware that other losses may be used 13

  14. BDR for the 0/1 Loss • Consider the evaluation of the BDR for 0/1 loss  * ( ) argmax ( | ) i x P i x | Y X i – This is also called the Maximum a Posteriori Probability (MAP) rule – It is usually not trivial to evaluate the posterior probabilities P Y|X ( i | x ) – This is due to the fact that we are trying to infer the cause (class i ) from the consequence (observation x ) – i.e. we are trying to solve a nontrivial inverse problem  E.g. imagine that I want to evaluate P Y|X ( person | “has two eyes”)  This strongly depends on what the other classes are 14

  15. Posterior Probabilities and Detection • If the two classes are “people” and “cars” – then P Y|X ( person | “has two eyes” ) = 1 • B ut if the classes are “people” and “cats” – then P Y|X ( person | “has two eyes” ) = ½ if there are equal numbers of cats and people to uniformly choose from [ this is additional info! ] • How do we deal with this problem? – We note that it is much easier to infer consequence from cause – E.g., it is easy to infer that P X|Y ( “has two eyes” | person ) = 1 – This does not depend on any other classes – We do not need any additional information – Given a class, just count the frequency of observation 15

  16. Bayes Rule • How do we go from P X|Y ( x | j ) to P Y|X ( j | x ) ? • We use Bayes rule : ( | ) ( ) P x i P i  | X Y Y ( | ) P i x | Y X ( ) P x X • Consider the two-class problem, i.e. Y=0 or Y=1 – the BDR under 0/1 loss is  * ( ) argmax ( | ) i x P i x | Y X i    0, if (0| ) (1| )  P x P x  | | Y X Y X      1, if (0| ) (1| )  P x P x   | | Y X Y X 16

  17. BDR for 0/1 Loss Binary Classification  • P ick “0” when and “1” otherwise (0| ) (1| ) P x P x | | Y X Y X • Using Bayes rule on both sides of this inequality yields   (0| ) (1| ) P x P x | | Y X Y X ( | 0) (0) ( |1) (1) P x P P x P  | | X Y Y X Y Y ( ) ( ) P x P x X X – Noting that P X (x) is a non-negative quantity this is the same as the rule pick “0” when  ( | 0) (0) ( |1) (1) P x P P x P | | X Y Y X Y Y  * i.e. ( ) argmax ( | ) ( ) i x P x i P i | X Y Y i 17

Recommend


More recommend