Bayesian decision theory Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian decision theory
Introduction Overview Bayesian decision theory allows to take optimal decisions in a fully probabilistic setting It assumes all relevant probabilities are known It allows to provide upper bounds on achievable errors and evaluate classifiers accordingly Bayesian reasoning can be generalized to cases when the probabilistic structure is not entirely known Bayesian decision theory
Input-Output pair Binary classification Assume examples ( x , y ) ∈ X × {− 1 , 1 } are drawn from a known distribution p ( x , y ) . The task is predicting the class y of examples given the input x . Bayes rule allows us to write it in probabilistic terms as: P ( y | x ) = p ( x | y ) P ( y ) p ( x ) Bayesian decision theory
Output given input Bayes rule Bayes rule allows to compute the posterior probability given likelihood, prior and evidence: posterior = likelihood × prior evidence posterior P ( y | x ) is the probability that class is y given that x was observed likelihood p ( x | y ) is the probability of observing x given that its class is y prior P ( y ) is the prior probability of the class, without any evidence evidence p ( x ) is the probability of the observation, and by the law of total probability can be computed as: 2 � p ( x ) = p ( x | y ) P ( y ) i = 1 Bayesian decision theory
Expected error Probability of error Probability of error given x : � P ( y 2 | x ) if we decide y 1 P ( error | x ) = P ( y 1 | x ) if we decide y 2 Average probability of error: � ∞ P ( error ) = P ( error | x ) p ( x ) dx −∞ Bayesian decision theory
Bayes decision rule Binary case y B = argmax y i ∈{− 1 , 1 } P ( y i | x ) = argmax y i ∈{− 1 , 1 } p ( x | y i ) P ( y i ) Multiclass case y B = argmax y i ∈{ 1 ,..., c } P ( y i | x ) = argmax y i ∈{ 1 ,..., c } p ( x | y i ) P ( y i ) Optimal rule The probability of error given x is: P ( error | x ) = 1 − P ( y B | x ) The Bayes decision rule minimizes the probability of error Bayesian decision theory
Representing classifiers Discriminant functions A classifier can be represented as a set of discriminant functions g i ( x ) , i ∈ 1 , . . . , c , giving: y = argmax i ∈ 1 ,..., c g i ( x ) A discriminant function is not unique ⇒ the most convenient one for computational or explanatory reasons can be used: g i ( x ) = P ( y i | x ) = p ( x | y i ) P ( y i ) p ( x ) g i ( x ) = p ( x | y i ) P ( y i ) g i ( x ) = ln p ( x | y i ) + ln P ( y i ) Bayesian decision theory
Representing classifiers p(x | ω 1 )P(ω 1 ) p(x | ω 2 )P(ω 2 ) 0.3 0.2 0.1 0 R 1 R 2 R 2 de cision 5 boundary 5 0 0 Decision regions The feature space is divided into decision regions R 1 , . . . , R c such that: x ∈ R i if g i ( x ) > g j ( x ) ∀ j � = i Decision regions are separated by decision boundaries , regions in which ties occur among the largest discriminant functions Bayesian decision theory
Normal density Multivariate normal density ( 2 π ) d / 2 | Σ | 1 / 2 exp − 1 1 2 ( x − µ ) t Σ − 1 ( x − µ ) The covariance matrix Σ is always symmetric and positive semi-definite The covariance matrix is strictly positive definite if the dimension of the feature space is d (otherwise | Σ | = 0) Bayesian decision theory
Normal density x 2 µ x 1 Hyperellipsoids The loci of points of constant density are hyperellipsoids of constant Mahalanobis distance from x to µ . The principal axes of such hyperellipsoids are the eigenvectors of Σ , their lengths are given by the corresponding eigenvalues Bayesian decision theory
Discriminant functions for normal density Discriminant functions g i ( x ) = ln p ( x | y i ) + ln P ( y i ) = − 1 ( x − µ i ) − d 2 ln 2 π − 1 2 ( x − µ i ) t Σ − 1 2 ln | Σ i | + ln P ( y i ) i Discarding terms which are independent of i we obtain: g i ( x ) = − 1 ( x − µ i ) − 1 2 ( x − µ i ) t Σ − 1 2 ln | Σ i | + ln P ( y i ) i Bayesian decision theory
Discriminant functions for normal density case Σ i = σ 2 I Features are statistically independent All features have same variance σ 2 Covariance determinant | Σ i | = σ 2 d can be ignored being independent of i Covariance inverse is given by Σ − 1 = ( 1 /σ 2 ) I i The discriminant functions become: g i ( x ) = −|| x − µ i || 2 + ln P ( y i ) 2 σ 2 Bayesian decision theory
Discriminant functions for normal density case Σ i = σ 2 I Expansion of the quadratic form leads to: g i ( x ) = − 1 2 σ 2 [ x t x − 2 µ t i x + µ t i µ i ] + ln P ( y i ) Discarding terms which are independent of i we obtain linear discriminant functions : g i ( x ) = 1 x − 1 σ 2 µ t 2 σ 2 µ t i µ i + ln P ( y i ) i � �� � � �� � w i 0 w t i Bayesian decision theory
case Σ i = σ 2 I Separating hyperplane Setting g i ( x ) = g j ( x ) we note that the decision boundaries are pieces of hyperplanes : σ 2 1 || µ i − µ j || 2 ln P ( y i ) ( µ i − µ j ) t ( x − 2 ( µ i + µ j ) − P ( y j )( µ i − µ j ) ) � �� � � �� � w t x 0 The hyperplane is orthogonal to vector w ⇒ orthogonal to the line linking the means The hyperplane passes through x 0 : if the prior probabilities of classes are equal, x 0 is halfway between the means otherwise, x 0 shifts away from the more likely mean Bayesian decision theory
case Σ i = σ 2 I 4 2 2 0 -2 ω 1 2 ω 0.15 1 0 P(ω 2 )= .5 p(x | ω i ) ω 0.1 ω 2 1 2 ω 0.4 2 0.05 1 0.3 ω 0 1 R 2 0 0.2 -1 0.1 P(ω 2 )= .5 P(ω 1 )= .5 R 1 P(ω 1 )= .5 -2 R 2 x R 1 -2 -2 0 2 4 -2 -1 0 0 R 1 R 2 2 1 P(ω 1 )= .5 P(ω 2 )= .5 4 2 Bayesian decision theory
case Σ i = σ 2 I p(x | ω i ) p(x | ω i ) ω ω ω ω 1 2 1 2 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 x x -2 0 2 4 -2 0 2 4 R 1 R 2 R 1 R 2 P(ω 1 )= .7 P(ω 2 )= .3 P(ω 1 )= .9 P(ω 2 )= .1 4 4 2 2 0 0 -2 -2 ω ω ω 2 ω 0.15 1 0.15 1 2 0.1 0.1 0.05 0.05 0 0 P(ω 2 )= .01 P(ω 2 )= .2 R 2 P(ω 1 )= .8 R 2 P(ω 1 )= .99 R 1 R 1 -2 -2 0 0 2 2 4 4 3 4 2 2 1 0 P(ω 2 )= .2 0 2 2 R 2 R 2 R 1 R 1 1 ω 1 2 ω 1 ω P(ω 2 )= .01 0 0 1 ω 2 -1 -1 P(ω 1 )= .8 P(ω 1 )= .99 -2 -2 -2 -2 -1 -1 0 0 1 1 2 2 Bayesian decision theory
case Σ i = Σ ω ω ω ω 2 0.2 1 0.2 2 1 -0.1 -0.1 0 0 P(ω 2 )= .9 P(ω 2 )= .5 R 1 5 R 1 5 R 2 R 2 P(ω 1 )= .5 P(ω 1 )= .1 0 -5 0 -5 0 0 -5 5 -5 5 10 7.5 R 1 R 1 7.5 5 P(ω 1 )= .5 P(ω 1 )= .1 5 2.5 ω 1 ω R 2 1 -2.5 0 R 2 ω 2 P(ω 2 )= .5 ω 0 -2.5 2 -2 -2 P(ω 2 )= .9 0 4 0 4 2 2 2 2 0 0 -2 -2 Bayesian decision theory
case Σ i = arbitrary Bayesian decision theory
APPENDIX Appendix Additional reference material Bayesian decision theory
case Σ i = σ 2 I Separating hyperplane: derivation (1) g i ( x ) − g j ( x ) = 0 1 1 i µ i + ln P ( y i ) − 1 1 σ 2 µ t 2 σ 2 µ t σ 2 µ t 2 σ 2 µ t i x − j x + j µ j − ln P ( y j ) = 0 j µ j ) + σ 2 ln P ( y i ) ( µ i − µ j ) t x − 1 / 2 ( µ t i µ i − µ t P ( y j ) = 0 w t ( x − x 0 ) = 0 w = ( µ i − µ j ) j µ j ) − σ 2 ln P ( y i ) ( µ i − µ j ) t x 0 = 1 / 2 ( µ t i µ i − µ t P ( y j ) Bayesian decision theory
case Σ i = σ 2 I Separating hyperplane: derivation (2) j µ j ) − σ 2 ln P ( y i ) ( µ i − µ j ) t x 0 = 1 / 2 ( µ t i µ i − µ t P ( y j ) ( µ t i µ i − µ t j µ j ) = ( µ i − µ j ) t ( µ i + µ j ) P ( y j ) = ( µ i − µ j ) t ( µ i − µ j ) ln P ( y i ) ( µ i − µ j ) t ( µ i − µ j ) ln P ( y i ) P ( y j ) = = ( µ i − µ j ) t ( µ i − µ j ) || µ i − µ j || 2 ln P ( y i ) P ( y j ) x 0 = 1 / 2 ( µ i + µ j ) − σ 2 ( µ i − µ j ) || µ i − µ j || 2 ln P ( y i ) P ( y j ) Bayesian decision theory
case Σ i = σ 2 I p(x | ω i ) p(x | ω i ) ω ω ω ω 1 2 1 2 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 x x -2 0 2 4 -2 0 2 4 R 1 R 2 R 1 R 2 P(ω 1 )= .7 P(ω 2 )= .3 P(ω 1 )= .9 P(ω 2 )= .1 4 4 2 2 0 0 -2 -2 ω ω ω 2 ω 0.15 1 0.15 1 2 0.1 0.1 0.05 0.05 0 0 P(ω 2 )= .01 P(ω 2 )= .2 R 2 P(ω 1 )= .8 R 2 P(ω 1 )= .99 R 1 R 1 -2 -2 0 0 2 2 4 4 3 4 2 2 1 0 P(ω 2 )= .2 0 2 2 R 2 R 2 R 1 R 1 1 ω 1 2 ω 1 ω P(ω 2 )= .01 0 0 1 ω 2 -1 -1 P(ω 1 )= .8 P(ω 1 )= .99 -2 -2 -2 -2 -1 -1 0 0 1 1 2 2 Bayesian decision theory
Discriminant functions for normal density case Σ i = Σ All classes have same covariance matrix The discriminant functions become: g i ( x ) = − 1 2 ( x − µ i ) t Σ − 1 ( x − µ i ) + ln P ( y i ) Expanding the quadratic form and discarding terms independent of i we again obtain linear discriminant functions: x − 1 g i ( x ) = µ t i Σ − 1 2 µ t i Σ − 1 µ i + ln P ( y i ) � �� � � �� � w t w i 0 i The separating hyperplanes are not necessarily orthogonal to the line linking the means: ( x − 1 ln P ( y i ) / P ( y j ) ( µ i − µ j ) t Σ − 1 2 ( µ i + µ j ) − ( µ i − µ j ) t Σ − 1 ( µ i − µ j )( µ i − µ j ) ) � �� � � �� � w t x 0 Bayesian decision theory
Recommend
More recommend