Lecture 3: Bayesian Decision Theory Dr. Chengjiang Long Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu
Recap Previous Lecture 2 C. Long Lecture 3 January 28, 2018
Outline What's Beyesian Decision Theory? • A More General Theory • Discriminant Function and Decision Boundary • Multivariate Gaussian Density • 3 C. Long Lecture 3 January 28, 2018
Outline What's Beyesian Decision Theory? • A More General Theory • Discriminant Function and Decision Boundary • Multivariate Gaussian Density • 4 C. Long Lecture 3 January 28, 2018
Bayesian Decision Theory Design classifiers to make decisions subject to • minimizing an expected " risk " . The simplest risk is the classification error ( i . e . , • assuming that misclassification costs are equal ). When misclassification costs are not equal , the risk • can include the cost associated with different misclassifications . 5 C. Long Lecture 3 January 28, 2018
Terminology State of nature ω ( class label ): • e . g ., ω 1 for sea bass , ω 2 for salmon • Probabilities P ( ω 1 ) and P ( ω 2 ) ( priors ): • e . g ., prior knowledge of how likely is to get a sea • bass or a salmon Probability density function p ( x ) ( evidence ): • e . g ., how frequently we will measure a pattern with • feature value x ( e . g ., x corresponds to lightness ) 6 C. Long Lecture 3 January 28, 2018
Terminology Conditional probability density p ( x / ω j ) ( likelihood ) : • e . g ., how frequently we will measure a pattern with • feature value x given that the pattern belongs to class ω j 7 C. Long Lecture 3 January 28, 2018
Terminology Conditional probability P ( ω j / x ) ( posterior ) : • e . g ., the probability that the fish belongs to • class ω j given feature x . Ultimately , we are interested in computing P ( ω j / x ) • for each class ω j . 8 C. Long Lecture 3 January 28, 2018
Decision Rule or Favours the most likely class . • This rule will be making the same decision all times . • i . e ., optimum if no other information is available • 9 C. Long Lecture 3 January 28, 2018
Decision Rule Using Bayes’ rule : • w w p x ( / ) ( P ) ´ likelihood prior w = j j = P ( / ) x j p x ( ) evidence where = å 2 w w p x ( ) p x ( / ) ( P ) j j = j 1 Decide ω 1 if P(ω 1 /x) > P(ω 2 /x); otherwise decide ω 2 or Decide ω 1 if p(x/ω 1 )P(ω 1 )>p(x/ω 2 )P(ω 2 ); otherwise decide ω 2 or Decide ω 1 if p(x/ω 1 )/p(x/ω 2 ) >P(ω 2 )/P(ω 1 ) ; otherwise decide ω 2 10 C. Long Lecture 3 January 28, 2018
Decision Rule p(x/ω j ) 2 1 P(ω j /x) w = w = P ( ) P ( ) 1 2 3 3 11 C. Long Lecture 3 January 28, 2018
Probability of Error The probability of error is defined as : • or What is the average probability error ? • The Bayes rule is optimum , that is , it minimizes • the average probability error ! 12 C. Long Lecture 3 January 28, 2018
Where do Probabilities come from? There are two competitive answers : • Relative frequency ( objective ) approach . Probabilities can only come from experiments . Bayesian ( subjective ) approach . Probabilities may reflect degree of belief and can be based on opinion . 13 C. Long Lecture 3 January 28, 2018
Example: Objective approach Classify cars whether they are more or less than • $ 50 K : Classes : C 1 if price > 50 K , C 2 if price < = 50 K Features : x , the height of a car Use the Bayes’ rule to compute the posterior • probabilities : p x C P C ( / ) ( ) = P C ( / ) x i i i p x ( ) We need to estimate p ( x / C 1 ), p ( x / C 2 ), P ( C 1 ), P ( C 2 ) • 14 C. Long Lecture 3 January 28, 2018
Example: Objective approach Collect data • Ask drivers how much their car was and measure height . Determine prior probabilities P ( C 1 ), P ( C 2 ) • e . g ., 1209 samples : #C 1 =221 #C 2 =988 221 = = P C ( ) 0.183 1 1209 988 = = P C ( ) 0.817 2 1209 15 C. Long Lecture 3 January 28, 2018
Example: Objective approach Determine class conditional probabilities ( likelihood ) • Discretize car height into bins and use normalized histogram p x C ( / ) i Calculate the posterior probability for each bin : = p x ( 1.0/ C P C ) ( ) = = = P C ( / x 1.0) 1 1 1 = + = p x ( 1.0/ C P C ) ( ) p x ( 1.0/ C P C ) ( ) 1 1 2 2 0.2081*0.183 = = 0.438 + 0.2081*0.183 0.0597*0.817 16 C. Long Lecture 3 January 28, 2018
Outline What's Beyesian Decision Theory? • A More General Theory • Discriminant Function and Decision Boundary • Multivariate Gaussian Density • 17 C. Long Lecture 3 January 28, 2018
A More General Theory Use more than one features . Allow more than two categories . Allow actions other than classifying the input to one of the possible categories ( e . g ., rejection ). Employ a more general error function ( i . e ., expected “risk” ) by associating a “cost” ( based on a “loss” function ) with different errors . 18 C. Long Lecture 3 January 28, 2018
Terminology Î d Features form a vector x R • A set of c categories ω 1 , ω 2, … , ω c • A finite set of l actions α 1, α 2, … , α l • A loss function λ ( α i / ω j ) • the cost associated with taking action α i when the correct classification category is ω j 19 C. Long Lecture 3 January 28, 2018
Conditional Risk (or Expected Loss) Suppose we observe x and take action α i • The conditional risk ( or expected loss ) with taking • action α i is defined as : = å c l w w R a ( / ) x ( a / ) ( P / ) x i i j j = j 1 From a medical image, we want to classify (determine) whether it contains cancer tissues or not. 20 C. Long Lecture 3 January 28, 2018
Overall Risk Suppose α ( x ) is a general decision rule that • determines which action α 1, α 2, … , α l to take for every x . The overall risk is defined as : • = ò R R a ( ( ) / ) ( ) x x p x d x The optimum decision rule is the Bayes rule • 21 C. Long Lecture 3 January 28, 2018
Overall Risk The Bayes rule minimizes R by : • ( i ) Computing R ( α i / x ) for every α i given an x ( ii ) Choosing the action α i with the minimum R ( α i / x ) The resulting minimum R * is called Bayes risk and • is the best ( i . e ., optimum ) performance that can be achieved : = * R min R 22 C. Long Lecture 3 January 28, 2018
Example: Two-category classification Define • α 1 : decide ω 1 α 2: decide ω 2 λ ij = λ ( α i / ω j ) The conditional risks are : • = å c l w w R a ( / ) x ( a / ) ( P / ) x i i j j = j 1 23 C. Long Lecture 3 January 28, 2018
Example: Two-category classification Minimum risk decision rule : • or or ( i . e ., using likelihood ratio ) likelihood ratio threshold 24 C. Long Lecture 3 January 28, 2018
Special Case: Zero-One Loss Function Assign the same loss to all errors : • The conditional risk corresponding to this loss function : • 25 C. Long Lecture 3 January 28, 2018
Special Case: Zero-One Loss Function The decision rule becomes : • or or The overall risk turns out to be the average probability • error ! 26 C. Long Lecture 3 January 28, 2018
Example Assuming general loss: • Assuming zero - one loss : • Decide ω 1 if p(x/ω 1 )/p(x/ω 2 )>P(ω 2 )/P(ω 1 ) otherwise decide ω 2 q = w w P ( )/ P ( ) a 2 1 w l - l P ( )( ) q = 2 12 22 b w l - l P ( )( ) 1 21 11 l > l assume : 12 21 27 C. Long Lecture 3 January 28, 2018
Outline What's Beyesian Decision Theory? • A More General Theory • Discriminant Function and Decision Boundary • Multivariate Gaussian Density • Error Bound, ROC, Missing Features and Compound • Bayesian Decision Theory Summary • 28 C. Long Lecture 3 January 28, 2018
Discriminant Functions A useful way to represent a classifier is through • discriminant functions g i (x), i = 1, . . . , c, where a feature vector x is assigned to class ω i if g i ( x ) > g j ( x ) for all j i max 29 C. Long Lecture 3 January 28, 2018
Discriminants for Bayes Classifier Is the choice of g i unique ? • Replacing g i ( x ) with f ( g i ( x )), where f () is monotonically increasing , does not change the classification results . w w p ( / x ) ( P ) = i i g ( ) x i p ( ) x g i ( x )= P ( ω i / x ) = w w g ( ) x p ( / x ) ( P ) i i i = w + w g ( ) x ln p ( / x ) ln P ( ) i i i we’ll use this discriminant extensively ! 30 C. Long Lecture 3 January 28, 2018
Case of two categories More common to use a single discriminant function • ( dichotomizer ) instead of two : Examples : = w - w g ( ) x P ( / ) x P ( / ) x 1 2 w w p ( / x ) P ( ) = + g ( ) x ln 1 ln 1 w w p ( / x ) P ( ) 2 2 31 C. Long Lecture 3 January 28, 2018
Decision Regions and Boundaries Discriminants divide the feature space in decision regions • R 1 , R 2, … , R c , separated by decision boundaries . Decision boundary is defined by : g 1 (x)=g 2 (x) 32 C. Long Lecture 3 January 28, 2018
Recommend
More recommend