 
              Bayes Decision Theory - II Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175 – Winter 2012 - UCSD
Nearest Neighbor Classifier • We are considering supervised classification • Nearest Neighbor (NN) Classifier – A training set D = {( x 1 ,y 1 ), …, ( x n ,y n )} – x i is a vector of observations , y i is the corresponding class label – a vector x to classify • The “NN Decision Rule” is  Set y y * i where  * arg min ( , ) i d x x i  {1,..., } i n – argmin means: “the i that minimizes the distance” 2
Optimal Classifiers • We have seen that performance depends on metric • Some metrics are “better” than others • The meaning of “better” is connected to how well adapted the metric is to the properties of the data • But can we be more rigorous? what do we mean by optimal? • To talk about optimality we define cost or loss  ˆ x ( ) y f x ˆ ( , ) L y y ( ) · f – Loss is the function that we want to minimize ˆ – Loss depends on true y and prediction y – Loss tells us how good our predictor is 3
Loss Functions & Classification Errors • Loss is a function of classification errors – What errors can we have? – Two types: false positives and false negatives  consider a face detection problem (decide “face” or “non - face”)  if you see this and say “face” “non - face”  you have a false – positive false-negative (false alarm) (miss, failure to detect) – Obviously, we have corresponding sub-classes for non-errors  true-positives and true-negatives – positive/negative part reflects what we say or decide, – true/false part reflects the true class label (“true state of the world”) 4
(Conditional) Risk • To weigh different errors differently – We introduce a loss function – Denote the cost of classifying X from class i as j by    L i j – One way to measure how good the classifier is to use the (data- conditional) expected value of the loss, aka the (conditional) Risk,        ( , ) { [ ]| } ( | ) R x i E L Y i x L j i P j x | Y X j – Note that the ( data-conditional ) risk is a function of both the decision “ decide class i ” and the conditioning data (measured feature vector), x. 5
Loss Functions • example: two snakes and eating poisonous dart frogs – Regular snake will die – Frogs are a good snack for the predator dart-snake – This leads to the losses Regular dart regular Predator dart regular snake frog frog snake frog frog  regular 0 regular 10 0 dart 0 10 dart 0 10 – What is optimal decision when snakes find a frog like these? 6
Minimum Risk Classification • We have seen that – if both snakes have    0 dart j    ( | ) P j x  | Y X   1 regular j then both say “regular” – However, if    0.1 dart j    ( | ) P j x  | Y X   0.9 regular j then the vulnerable snake says “dart” while the predator says “regular” • Its infinite loss for saying regular when frog is dart, makes the vulnerable snake much more cautious! 7
BDR = Minimizing Conditional Risk • Note that the definition of risk: – Immediately defines the optimal classifier as the one that minimizes the conditional risk for a given observation x – The Optimal Decision is the Bayes Decision Rule (BDR) :  * ( ) argmin ( , ) i x R x i i      argmin ( | ). L j i P j x | Y X i j – The BDR yields the optimal (minimal) risk :       * * ( ) ( , ) min ( | ) R x R x i L j i P j x | Y X i j 8
What is a Decision Rule? • Consider the c -ary classification problem with class labels,  {1,···, }. c • Given an observation (feature), x, to be classified, a decision rule is a function d = d ( . ) of the observation that takes its values in the set of class labels, d x   ( ) {1 , , }. c i x  * * • Note that defined on the previous slide ( ) ( ) d x is an optimal decision rule in the sense that for a specific value of x it minimizes the conditional risk R(x,i) over all possible decisions i in C 9
(d-Dependent) Total Average Risk • Given a decision rule d and the conditional risk R(x,i), we can consider the (d-dependent) conditional risk R(x,d(x)). • We can now define the total ( d-Dependent) Expected or Average Risk (aka d-Risk):  ( ) E { ( , ( ) ) } R d R x d x – Note that we have averaged over all possible measurements (features) x that we might encounter in the world. – Note that R(d) is a function of a function! (A function of d ) – The (d-risk) R(d) is a measure of how we expect to perform on the average when we use the fixed decision rule d over-and-over- again on a large set of real world data. – It is natural to ask if there is an “optimal decision rule” which minimizes the average risk R(d) over the class of all possible decision rules. 10
Minimizing the Average Risk R(d) • Optimizing total risk R ( d ) seems hard because we are trying to minimize it over a family of functions (decision rules), d . • However, since    ( ) { ( , ( ))} ( , ( )) ( ) , R d E R x d x R x d x p x dx X  0 one can equivalently minimize the data-conditional risk R ( x,d ( x )) point-wise in x. • I.e. solve for the value of the optimal decision rule at each x :   * ( ) arg min ( , ( )) argmin ( , ) d x R x d x R x i    ( ) d x i i • Thus d* ( x ) = i* ( x ) !! I.e. the BDR, which we already know optimizes the Data-Conditional Risk, ALSO optimizes the Average Risk R(d) over ALL possible decision rules d !! • This makes sense: if the BDR is optimal for every single situation, x, it must be optimal on the average over all x 11
The 0/1 Loss Function • An important special case of interest: – zero loss for no error and equal loss for two error types • This is equivalent to the snake dart regular “zero/one” loss : prediction frog frog   0 i j regular 1 0      L i j   1 i j dart 0 1 • Under this loss the optimal Bayes decision rule (BDR) is       * * ( ) ( ) argmin ( | ) d x i x L j i P j x | Y X i j   argmin ( | ) P j x Y X | i  j i 12
0/1 Loss yields MAP Decision Rule • Note that :   * ( ) argmin ( | ) i x P j x | Y X i  j i     argmin 1 ( | ) P i x   Y X | i  argmax ( | ) P i x | Y X i • Thus the Optimal Decision for the 0/1 loss is : – Pick the class that is most probable given the observation x – i*(x ) is known as the Maximum a Posteriori Probability (MAP) solution • This is also known as the Bayes Decision Rule (BDR) for the 0/1 loss – We will often simplify our discussion by assuming this loss – But you should always be aware that other losses may be used 13
BDR for the 0/1 Loss • Consider the evaluation of the BDR for 0/1 loss  * ( ) argmax ( | ) i x P i x | Y X i – This is also called the Maximum a Posteriori Probability (MAP) rule – It is usually not trivial to evaluate the posterior probabilities P Y|X ( i | x ) – This is due to the fact that we are trying to infer the cause (class i ) from the consequence (observation x ) – i.e. we are trying to solve a nontrivial inverse problem  E.g. imagine that I want to evaluate P Y|X ( person | “has two eyes”)  This strongly depends on what the other classes are 14
Posterior Probabilities and Detection • If the two classes are “people” and “cars” – then P Y|X ( person | “has two eyes” ) = 1 • B ut if the classes are “people” and “cats” – then P Y|X ( person | “has two eyes” ) = ½ if there are equal numbers of cats and people to uniformly choose from [ this is additional info! ] • How do we deal with this problem? – We note that it is much easier to infer consequence from cause – E.g., it is easy to infer that P X|Y ( “has two eyes” | person ) = 1 – This does not depend on any other classes – We do not need any additional information – Given a class, just count the frequency of observation 15
Bayes Rule • How do we go from P X|Y ( x | j ) to P Y|X ( j | x ) ? • We use Bayes rule : ( | ) ( ) P x i P i  | X Y Y ( | ) P i x | Y X ( ) P x X • Consider the two-class problem, i.e. Y=0 or Y=1 – the BDR under 0/1 loss is  * ( ) argmax ( | ) i x P i x | Y X i    0, if (0| ) (1| )  P x P x  | | Y X Y X      1, if (0| ) (1| )  P x P x   | | Y X Y X 16
BDR for 0/1 Loss Binary Classification  • P ick “0” when and “1” otherwise (0| ) (1| ) P x P x | | Y X Y X • Using Bayes rule on both sides of this inequality yields   (0| ) (1| ) P x P x | | Y X Y X ( | 0) (0) ( |1) (1) P x P P x P  | | X Y Y X Y Y ( ) ( ) P x P x X X – Noting that P X (x) is a non-negative quantity this is the same as the rule pick “0” when  ( | 0) (0) ( |1) (1) P x P P x P | | X Y Y X Y Y  * i.e. ( ) argmax ( | ) ( ) i x P x i P i | X Y Y i 17
Recommend
More recommend