Bayesian decision theory Andrea Passerini passerini@disi.unitn.it - PowerPoint PPT Presentation

Bayesian decision theory Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian decision theory

Introduction Overview Bayesian decision theory allows to take optimal decisions in a fully probabilistic setting It assumes all relevant probabilities are known It allows to provide upper bounds on achievable errors and evaluate classifiers accordingly Bayesian reasoning can be generalized to cases when the probabilistic structure is not entirely known Bayesian decision theory

Input-Output pair Binary classification Assume examples ( x , y ) ∈ X × {− 1 , 1 } are drawn from a known distribution p ( x , y ) . The task is predicting the class y of examples given the input x . Bayes rule allows us to write it in probabilistic terms as: P ( y | x ) = p ( x | y ) P ( y ) p ( x ) Bayesian decision theory

Output given input Bayes rule Bayes rule allows to compute the posterior probability given likelihood, prior and evidence: posterior = likelihood × prior evidence posterior P ( y | x ) is the probability that class is y given that x was observed likelihood p ( x | y ) is the probability of observing x given that its class is y prior P ( y ) is the prior probability of the class, without any evidence evidence p ( x ) is the probability of the observation, and by the law of total probability can be computed as: 2 � p ( x ) = p ( x | y ) P ( y ) i = 1 Bayesian decision theory

Expected error Probability of error Probability of error given x : � P ( y 2 | x ) if we decide y 1 P ( error | x ) = P ( y 1 | x ) if we decide y 2 Average probability of error: � ∞ P ( error ) = P ( error | x ) p ( x ) dx −∞ Bayesian decision theory

Bayes decision rule Binary case y B = argmax y i ∈{− 1 , 1 } P ( y i | x ) = argmax y i ∈{− 1 , 1 } p ( x | y i ) P ( y i ) Multiclass case y B = argmax y i ∈{ 1 ,..., c } P ( y i | x ) = argmax y i ∈{ 1 ,..., c } p ( x | y i ) P ( y i ) Optimal rule The probability of error given x is: P ( error | x ) = 1 − P ( y B | x ) The Bayes decision rule minimizes the probability of error Bayesian decision theory

Representing classifiers Discriminant functions A classifier can be represented as a set of discriminant functions g i ( x ) , i ∈ 1 , . . . , c , giving: y = argmax i ∈ 1 ,..., c g i ( x ) A discriminant function is not unique ⇒ the most convenient one for computational or explanatory reasons can be used: g i ( x ) = P ( y i | x ) = p ( x | y i ) P ( y i ) p ( x ) g i ( x ) = p ( x | y i ) P ( y i ) g i ( x ) = ln p ( x | y i ) + ln P ( y i ) Bayesian decision theory

Representing classifiers p(x | ω 1 )P(ω 1 ) p(x | ω 2 )P(ω 2 ) 0.3 0.2 0.1 0 R 1 R 2 R 2 de cision 5 boundary 5 0 0 Decision regions The feature space is divided into decision regions R 1 , . . . , R c such that: x ∈ R i if g i ( x ) > g j ( x ) ∀ j � = i Decision regions are separated by decision boundaries , regions in which ties occur among the largest discriminant functions Bayesian decision theory

Normal density Multivariate normal density ( 2 π ) d / 2 | Σ | 1 / 2 exp − 1 1 2 ( x − µ ) t Σ − 1 ( x − µ ) The covariance matrix Σ is always symmetric and positive semi-definite The covariance matrix is strictly positive definite if the dimension of the feature space is d (otherwise | Σ | = 0) Bayesian decision theory

Normal density x 2 µ x 1 Hyperellipsoids The loci of points of constant density are hyperellipsoids of constant Mahalanobis distance from x to µ . The principal axes of such hyperellipsoids are the eigenvectors of Σ , their lengths are given by the corresponding eigenvalues Bayesian decision theory

Discriminant functions for normal density Discriminant functions g i ( x ) = ln p ( x | y i ) + ln P ( y i ) = − 1 ( x − µ i ) − d 2 ln 2 π − 1 2 ( x − µ i ) t Σ − 1 2 ln | Σ i | + ln P ( y i ) i Discarding terms which are independent of i we obtain: g i ( x ) = − 1 ( x − µ i ) − 1 2 ( x − µ i ) t Σ − 1 2 ln | Σ i | + ln P ( y i ) i Bayesian decision theory

Discriminant functions for normal density case Σ i = σ 2 I Features are statistically independent All features have same variance σ 2 Covariance determinant | Σ i | = σ 2 d can be ignored being independent of i Covariance inverse is given by Σ − 1 = ( 1 /σ 2 ) I i The discriminant functions become: g i ( x ) = −|| x − µ i || 2 + ln P ( y i ) 2 σ 2 Bayesian decision theory

Discriminant functions for normal density case Σ i = σ 2 I Expansion of the quadratic form leads to: g i ( x ) = − 1 2 σ 2 [ x t x − 2 µ t i x + µ t i µ i ] + ln P ( y i ) Discarding terms which are independent of i we obtain linear discriminant functions : g i ( x ) = 1 x − 1 σ 2 µ t 2 σ 2 µ t i µ i + ln P ( y i ) i � �� w i 0 w t i Bayesian decision theory

case Σ i = σ 2 I Separating hyperplane Setting g i ( x ) = g j ( x ) we note that the decision boundaries are pieces of hyperplanes :   σ 2   1 || µ i − µ j || 2 ln P ( y i ) ( µ i − µ j ) t   ( x − 2 ( µ i + µ j ) − P ( y j )( µ i − µ j )  )    � �� w t x 0 The hyperplane is orthogonal to vector w ⇒ orthogonal to the line linking the means The hyperplane passes through x 0 : if the prior probabilities of classes are equal, x 0 is halfway between the means otherwise, x 0 shifts away from the more likely mean Bayesian decision theory

case Σ i = σ 2 I 4 2 2 0 -2 ω 1 2 ω 0.15 1 0 P(ω 2 )= .5 p(x | ω i ) ω 0.1 ω 2 1 2 ω 0.4 2 0.05 1 0.3 ω 0 1 R 2 0 0.2 -1 0.1 P(ω 2 )= .5 P(ω 1 )= .5 R 1 P(ω 1 )= .5 -2 R 2 x R 1 -2 -2 0 2 4 -2 -1 0 0 R 1 R 2 2 1 P(ω 1 )= .5 P(ω 2 )= .5 4 2 Bayesian decision theory

case Σ i = σ 2 I p(x | ω i ) p(x | ω i ) ω ω ω ω 1 2 1 2 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 x x -2 0 2 4 -2 0 2 4 R 1 R 2 R 1 R 2 P(ω 1 )= .7 P(ω 2 )= .3 P(ω 1 )= .9 P(ω 2 )= .1 4 4 2 2 0 0 -2 -2 ω ω ω 2 ω 0.15 1 0.15 1 2 0.1 0.1 0.05 0.05 0 0 P(ω 2 )= .01 P(ω 2 )= .2 R 2 P(ω 1 )= .8 R 2 P(ω 1 )= .99 R 1 R 1 -2 -2 0 0 2 2 4 4 3 4 2 2 1 0 P(ω 2 )= .2 0 2 2 R 2 R 2 R 1 R 1 1 ω 1 2 ω 1 ω P(ω 2 )= .01 0 0 1 ω 2 -1 -1 P(ω 1 )= .8 P(ω 1 )= .99 -2 -2 -2 -2 -1 -1 0 0 1 1 2 2 Bayesian decision theory

case Σ i = Σ ω ω ω ω 2 0.2 1 0.2 2 1 -0.1 -0.1 0 0 P(ω 2 )= .9 P(ω 2 )= .5 R 1 5 R 1 5 R 2 R 2 P(ω 1 )= .5 P(ω 1 )= .1 0 -5 0 -5 0 0 -5 5 -5 5 10 7.5 R 1 R 1 7.5 5 P(ω 1 )= .5 P(ω 1 )= .1 5 2.5 ω 1 ω R 2 1 -2.5 0 R 2 ω 2 P(ω 2 )= .5 ω 0 -2.5 2 -2 -2 P(ω 2 )= .9 0 4 0 4 2 2 2 2 0 0 -2 -2 Bayesian decision theory

case Σ i = arbitrary Bayesian decision theory

APPENDIX Appendix Additional reference material Bayesian decision theory

case Σ i = σ 2 I Separating hyperplane: derivation (1) g i ( x ) − g j ( x ) = 0 1 1 i µ i + ln P ( y i ) − 1 1 σ 2 µ t 2 σ 2 µ t σ 2 µ t 2 σ 2 µ t i x − j x + j µ j − ln P ( y j ) = 0 j µ j ) + σ 2 ln P ( y i ) ( µ i − µ j ) t x − 1 / 2 ( µ t i µ i − µ t P ( y j ) = 0 w t ( x − x 0 ) = 0 w = ( µ i − µ j ) j µ j ) − σ 2 ln P ( y i ) ( µ i − µ j ) t x 0 = 1 / 2 ( µ t i µ i − µ t P ( y j ) Bayesian decision theory

case Σ i = σ 2 I Separating hyperplane: derivation (2) j µ j ) − σ 2 ln P ( y i ) ( µ i − µ j ) t x 0 = 1 / 2 ( µ t i µ i − µ t P ( y j ) ( µ t i µ i − µ t j µ j ) = ( µ i − µ j ) t ( µ i + µ j ) P ( y j ) = ( µ i − µ j ) t ( µ i − µ j ) ln P ( y i ) ( µ i − µ j ) t ( µ i − µ j ) ln P ( y i ) P ( y j ) = = ( µ i − µ j ) t ( µ i − µ j ) || µ i − µ j || 2 ln P ( y i ) P ( y j ) x 0 = 1 / 2 ( µ i + µ j ) − σ 2 ( µ i − µ j ) || µ i − µ j || 2 ln P ( y i ) P ( y j ) Bayesian decision theory

case Σ i = σ 2 I p(x | ω i ) p(x | ω i ) ω ω ω ω 1 2 1 2 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 x x -2 0 2 4 -2 0 2 4 R 1 R 2 R 1 R 2 P(ω 1 )= .7 P(ω 2 )= .3 P(ω 1 )= .9 P(ω 2 )= .1 4 4 2 2 0 0 -2 -2 ω ω ω 2 ω 0.15 1 0.15 1 2 0.1 0.1 0.05 0.05 0 0 P(ω 2 )= .01 P(ω 2 )= .2 R 2 P(ω 1 )= .8 R 2 P(ω 1 )= .99 R 1 R 1 -2 -2 0 0 2 2 4 4 3 4 2 2 1 0 P(ω 2 )= .2 0 2 2 R 2 R 2 R 1 R 1 1 ω 1 2 ω 1 ω P(ω 2 )= .01 0 0 1 ω 2 -1 -1 P(ω 1 )= .8 P(ω 1 )= .99 -2 -2 -2 -2 -1 -1 0 0 1 1 2 2 Bayesian decision theory

Discriminant functions for normal density case Σ i = Σ All classes have same covariance matrix The discriminant functions become: g i ( x ) = − 1 2 ( x − µ i ) t Σ − 1 ( x − µ i ) + ln P ( y i ) Expanding the quadratic form and discarding terms independent of i we again obtain linear discriminant functions: x − 1 g i ( x ) = µ t i Σ − 1 2 µ t i Σ − 1 µ i + ln P ( y i ) � �� w t w i 0 i The separating hyperplanes are not necessarily orthogonal to the line linking the means: ( x − 1 ln P ( y i ) / P ( y j ) ( µ i − µ j ) t Σ − 1 2 ( µ i + µ j ) − ( µ i − µ j ) t Σ − 1 ( µ i − µ j )( µ i − µ j ) ) � �� w t x 0 Bayesian decision theory

Bayesian decision theory Andrea Passerini passerini@disi.unitn.it - PowerPoint PPT Presentation

Bayesian decision theory Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian decision theory Introduction Overview Bayesian decision theory allows to take optimal decisions in a fully probabilistic setting It assumes all

Segmentation using Segmentation using Bayesian Decision Theory Bayesian Decision Theory

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

CS 7616 Pattern Recognition Bayesian Decision Theory Aaron Bobick School of Interactive Computing

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Bayesian Decision Theory with applications to Experimental Design Robbie Peck University of Bath

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University

Bayesian inference and mathematical imaging. Part I: Bayesian analysis and decision theory. Dr.

6 Decision- -Making Making MVC (revisited) 6 Decision MVC (revisited) decision

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Music and Technology Instructor: Keith McCuaig Learning in Retirement Program, Carleton

Natural Philosophy in the Sixteenth and Seventeenth Centuries: Celestial and Earthly Bodies

Diversity: Why, What, How Marina Drosou, Evaggelia Pitoura Hellenic Police, Computer Science

Semantic Conceptual Models Hans-Georg Fill Co-sponsored by the Austrian Science Fund: Grant Number:

Hermite Leap-Frog Methods for Waves Tom Hagstrom SMU Major contributors to this work : Daniel

Wireless Networks L ecture 19: MIMO Peter Steenkiste CS and ECE, Carnegie Mellon University

COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 Prof. John Paisley Department

descent fly gradient Using C SNE Symmetric Instead of Pilj 9ilj conditionals define