bayesian decision theory
play

Bayesian decision theory Andrea Passerini passerini@disi.unitn.it - PowerPoint PPT Presentation

Bayesian decision theory Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian decision theory Introduction Overview Bayesian decision theory allows to take optimal decisions in a fully probabilistic setting It assumes all


  1. Bayesian decision theory Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian decision theory

  2. Introduction Overview Bayesian decision theory allows to take optimal decisions in a fully probabilistic setting It assumes all relevant probabilities are known It allows to provide upper bounds on achievable errors and evaluate classifiers accordingly Bayesian reasoning can be generalized to cases when the probabilistic structure is not entirely known Bayesian decision theory

  3. Input-Output pair Binary classification Assume examples ( x , y ) ∈ X × {− 1 , 1 } are drawn from a known distribution p ( x , y ) . The task is predicting the class y of examples given the input x . Bayes rule allows us to write it in probabilistic terms as: P ( y | x ) = p ( x | y ) P ( y ) p ( x ) Bayesian decision theory

  4. Output given input Bayes rule Bayes rule allows to compute the posterior probability given likelihood, prior and evidence: posterior = likelihood × prior evidence posterior P ( y | x ) is the probability that class is y given that x was observed likelihood p ( x | y ) is the probability of observing x given that its class is y prior P ( y ) is the prior probability of the class, without any evidence evidence p ( x ) is the probability of the observation, and by the law of total probability can be computed as: 2 � p ( x ) = p ( x | y ) P ( y ) i = 1 Bayesian decision theory

  5. Expected error Probability of error Probability of error given x : � P ( y 2 | x ) if we decide y 1 P ( error | x ) = P ( y 1 | x ) if we decide y 2 Average probability of error: � ∞ P ( error ) = P ( error | x ) p ( x ) dx −∞ Bayesian decision theory

  6. Bayes decision rule Binary case y B = argmax y i ∈{− 1 , 1 } P ( y i | x ) = argmax y i ∈{− 1 , 1 } p ( x | y i ) P ( y i ) Multiclass case y B = argmax y i ∈{ 1 ,..., c } P ( y i | x ) = argmax y i ∈{ 1 ,..., c } p ( x | y i ) P ( y i ) Optimal rule The probability of error given x is: P ( error | x ) = 1 − P ( y B | x ) The Bayes decision rule minimizes the probability of error Bayesian decision theory

  7. Representing classifiers Discriminant functions A classifier can be represented as a set of discriminant functions g i ( x ) , i ∈ 1 , . . . , c , giving: y = argmax i ∈ 1 ,..., c g i ( x ) A discriminant function is not unique ⇒ the most convenient one for computational or explanatory reasons can be used: g i ( x ) = P ( y i | x ) = p ( x | y i ) P ( y i ) p ( x ) g i ( x ) = p ( x | y i ) P ( y i ) g i ( x ) = ln p ( x | y i ) + ln P ( y i ) Bayesian decision theory

  8. Representing classifiers p(x | ω 1 )P(ω 1 ) p(x | ω 2 )P(ω 2 ) 0.3 0.2 0.1 0 R 1 R 2 R 2 de cision 5 boundary 5 0 0 Decision regions The feature space is divided into decision regions R 1 , . . . , R c such that: x ∈ R i if g i ( x ) > g j ( x ) ∀ j � = i Decision regions are separated by decision boundaries , regions in which ties occur among the largest discriminant functions Bayesian decision theory

  9. Normal density Multivariate normal density ( 2 π ) d / 2 | Σ | 1 / 2 exp − 1 1 2 ( x − µ ) t Σ − 1 ( x − µ ) The covariance matrix Σ is always symmetric and positive semi-definite The covariance matrix is strictly positive definite if the dimension of the feature space is d (otherwise | Σ | = 0) Bayesian decision theory

  10. Normal density x 2 µ x 1 Hyperellipsoids The loci of points of constant density are hyperellipsoids of constant Mahalanobis distance from x to µ . The principal axes of such hyperellipsoids are the eigenvectors of Σ , their lengths are given by the corresponding eigenvalues Bayesian decision theory

  11. Discriminant functions for normal density Discriminant functions g i ( x ) = ln p ( x | y i ) + ln P ( y i ) = − 1 ( x − µ i ) − d 2 ln 2 π − 1 2 ( x − µ i ) t Σ − 1 2 ln | Σ i | + ln P ( y i ) i Discarding terms which are independent of i we obtain: g i ( x ) = − 1 ( x − µ i ) − 1 2 ( x − µ i ) t Σ − 1 2 ln | Σ i | + ln P ( y i ) i Bayesian decision theory

  12. Discriminant functions for normal density case Σ i = σ 2 I Features are statistically independent All features have same variance σ 2 Covariance determinant | Σ i | = σ 2 d can be ignored being independent of i Covariance inverse is given by Σ − 1 = ( 1 /σ 2 ) I i The discriminant functions become: g i ( x ) = −|| x − µ i || 2 + ln P ( y i ) 2 σ 2 Bayesian decision theory

  13. Discriminant functions for normal density case Σ i = σ 2 I Expansion of the quadratic form leads to: g i ( x ) = − 1 2 σ 2 [ x t x − 2 µ t i x + µ t i µ i ] + ln P ( y i ) Discarding terms which are independent of i we obtain linear discriminant functions : g i ( x ) = 1 x − 1 σ 2 µ t 2 σ 2 µ t i µ i + ln P ( y i ) i � �� � � �� � w i 0 w t i Bayesian decision theory

  14. case Σ i = σ 2 I Separating hyperplane Setting g i ( x ) = g j ( x ) we note that the decision boundaries are pieces of hyperplanes :   σ 2   1 || µ i − µ j || 2 ln P ( y i ) ( µ i − µ j ) t   ( x − 2 ( µ i + µ j ) − P ( y j )( µ i − µ j )  )    � �� � � �� � w t x 0 The hyperplane is orthogonal to vector w ⇒ orthogonal to the line linking the means The hyperplane passes through x 0 : if the prior probabilities of classes are equal, x 0 is halfway between the means otherwise, x 0 shifts away from the more likely mean Bayesian decision theory

  15. case Σ i = σ 2 I 4 2 2 0 -2 ω 1 2 ω 0.15 1 0 P(ω 2 )= .5 p(x | ω i ) ω 0.1 ω 2 1 2 ω 0.4 2 0.05 1 0.3 ω 0 1 R 2 0 0.2 -1 0.1 P(ω 2 )= .5 P(ω 1 )= .5 R 1 P(ω 1 )= .5 -2 R 2 x R 1 -2 -2 0 2 4 -2 -1 0 0 R 1 R 2 2 1 P(ω 1 )= .5 P(ω 2 )= .5 4 2 Bayesian decision theory

  16. case Σ i = σ 2 I p(x | ω i ) p(x | ω i ) ω ω ω ω 1 2 1 2 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 x x -2 0 2 4 -2 0 2 4 R 1 R 2 R 1 R 2 P(ω 1 )= .7 P(ω 2 )= .3 P(ω 1 )= .9 P(ω 2 )= .1 4 4 2 2 0 0 -2 -2 ω ω ω 2 ω 0.15 1 0.15 1 2 0.1 0.1 0.05 0.05 0 0 P(ω 2 )= .01 P(ω 2 )= .2 R 2 P(ω 1 )= .8 R 2 P(ω 1 )= .99 R 1 R 1 -2 -2 0 0 2 2 4 4 3 4 2 2 1 0 P(ω 2 )= .2 0 2 2 R 2 R 2 R 1 R 1 1 ω 1 2 ω 1 ω P(ω 2 )= .01 0 0 1 ω 2 -1 -1 P(ω 1 )= .8 P(ω 1 )= .99 -2 -2 -2 -2 -1 -1 0 0 1 1 2 2 Bayesian decision theory

  17. case Σ i = Σ ω ω ω ω 2 0.2 1 0.2 2 1 -0.1 -0.1 0 0 P(ω 2 )= .9 P(ω 2 )= .5 R 1 5 R 1 5 R 2 R 2 P(ω 1 )= .5 P(ω 1 )= .1 0 -5 0 -5 0 0 -5 5 -5 5 10 7.5 R 1 R 1 7.5 5 P(ω 1 )= .5 P(ω 1 )= .1 5 2.5 ω 1 ω R 2 1 -2.5 0 R 2 ω 2 P(ω 2 )= .5 ω 0 -2.5 2 -2 -2 P(ω 2 )= .9 0 4 0 4 2 2 2 2 0 0 -2 -2 Bayesian decision theory

  18. case Σ i = arbitrary Bayesian decision theory

  19. APPENDIX Appendix Additional reference material Bayesian decision theory

  20. case Σ i = σ 2 I Separating hyperplane: derivation (1) g i ( x ) − g j ( x ) = 0 1 1 i µ i + ln P ( y i ) − 1 1 σ 2 µ t 2 σ 2 µ t σ 2 µ t 2 σ 2 µ t i x − j x + j µ j − ln P ( y j ) = 0 j µ j ) + σ 2 ln P ( y i ) ( µ i − µ j ) t x − 1 / 2 ( µ t i µ i − µ t P ( y j ) = 0 w t ( x − x 0 ) = 0 w = ( µ i − µ j ) j µ j ) − σ 2 ln P ( y i ) ( µ i − µ j ) t x 0 = 1 / 2 ( µ t i µ i − µ t P ( y j ) Bayesian decision theory

  21. case Σ i = σ 2 I Separating hyperplane: derivation (2) j µ j ) − σ 2 ln P ( y i ) ( µ i − µ j ) t x 0 = 1 / 2 ( µ t i µ i − µ t P ( y j ) ( µ t i µ i − µ t j µ j ) = ( µ i − µ j ) t ( µ i + µ j ) P ( y j ) = ( µ i − µ j ) t ( µ i − µ j ) ln P ( y i ) ( µ i − µ j ) t ( µ i − µ j ) ln P ( y i ) P ( y j ) = = ( µ i − µ j ) t ( µ i − µ j ) || µ i − µ j || 2 ln P ( y i ) P ( y j ) x 0 = 1 / 2 ( µ i + µ j ) − σ 2 ( µ i − µ j ) || µ i − µ j || 2 ln P ( y i ) P ( y j ) Bayesian decision theory

  22. case Σ i = σ 2 I p(x | ω i ) p(x | ω i ) ω ω ω ω 1 2 1 2 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 x x -2 0 2 4 -2 0 2 4 R 1 R 2 R 1 R 2 P(ω 1 )= .7 P(ω 2 )= .3 P(ω 1 )= .9 P(ω 2 )= .1 4 4 2 2 0 0 -2 -2 ω ω ω 2 ω 0.15 1 0.15 1 2 0.1 0.1 0.05 0.05 0 0 P(ω 2 )= .01 P(ω 2 )= .2 R 2 P(ω 1 )= .8 R 2 P(ω 1 )= .99 R 1 R 1 -2 -2 0 0 2 2 4 4 3 4 2 2 1 0 P(ω 2 )= .2 0 2 2 R 2 R 2 R 1 R 1 1 ω 1 2 ω 1 ω P(ω 2 )= .01 0 0 1 ω 2 -1 -1 P(ω 1 )= .8 P(ω 1 )= .99 -2 -2 -2 -2 -1 -1 0 0 1 1 2 2 Bayesian decision theory

  23. Discriminant functions for normal density case Σ i = Σ All classes have same covariance matrix The discriminant functions become: g i ( x ) = − 1 2 ( x − µ i ) t Σ − 1 ( x − µ i ) + ln P ( y i ) Expanding the quadratic form and discarding terms independent of i we again obtain linear discriminant functions: x − 1 g i ( x ) = µ t i Σ − 1 2 µ t i Σ − 1 µ i + ln P ( y i ) � �� � � �� � w t w i 0 i The separating hyperplanes are not necessarily orthogonal to the line linking the means: ( x − 1 ln P ( y i ) / P ( y j ) ( µ i − µ j ) t Σ − 1 2 ( µ i + µ j ) − ( µ i − µ j ) t Σ − 1 ( µ i − µ j )( µ i − µ j ) ) � �� � � �� � w t x 0 Bayesian decision theory

Recommend


More recommend