Classification based on Bayes decision theory Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 1 / 70
Introduction 1 Bayes decision theory 2 Minimizing the classification error probability Minimizing the average risk Discriminant function and decision surface Bayesian classifiers for Normally distributed classes Minimum distance classifier Bayesian classifiers for independent binary features Supervised learning of the Bayesian classifiers 3 Parametric methods for density estimation Maximum likelihood parameter estimation Bayesian estimation Maximum a posteriori estimation Mixture models for density estimation Nonparametric methods for density estimation Histogram estimator Naive estimator Kernel estimator k − Nearest neighbor estimator k − Nearest neighbor classifier Naive Bayes classifier 4 Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 2 / 70
Outline Introduction 1 Bayes decision theory 2 Minimizing the classification error probability Minimizing the average risk Discriminant function and decision surface Bayesian classifiers for Normally distributed classes Minimum distance classifier Bayesian classifiers for independent binary features Supervised learning of the Bayesian classifiers 3 Parametric methods for density estimation Maximum likelihood parameter estimation Bayesian estimation Maximum a posteriori estimation Mixture models for density estimation Nonparametric methods for density estimation Histogram estimator Naive estimator Kernel estimator k − Nearest neighbor estimator k − Nearest neighbor classifier Naive Bayes classifier 4 Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 3 / 70
Introduction In classification, the goal is to find a mapping from inputs X to outputs t given a labeled set of input-output pairs S = { ( x 1 , t 1 ) , ( x 2 , t 2 ) , . . . , ( x N , t N ) } . S is called training set. In the simplest setting, each training input x is a D − dimensional vector of numbers. Each component of x is called feature, attribute, or variable and x is called feature vector. The goal is to find a mapping from inputs X to outputs t , where t ∈ { 1 , 2 , . . . , C } with C being the number of classes. When C = 2, the problem is called binary classification. In this case, we often assume that t ∈ {− 1 , +1 } or t ∈ { 0 , 1 } . When C > 2, the problem is called multi-class classification. Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 3 / 70
Introduction (cont.) Bayes theorem P ( X | C k ) P ( C k ) p ( C k | X ) = P ( X ) P ( X | C k ) P ( C k ) = � Y p ( X | C k ) p ( C k ) p ( C k ) is called prior of C k . p ( X | C k ) is called likelihood of data . p ( C k | X ) is called posterior probability. Since p ( X ) is the same for all classes, we can write as p ( C k | X ) P ( X | C k ) P ( C k ) = Approaches for building a classifier. Generative approach: This approach first creates a joint model of the form of p ( x , C k ) and then to condition on x , then deriving p ( C k | x ). Discriminative approach: This approach creates a model of the form of p ( C k | x ) directly. Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 4 / 70
Outline Introduction 1 Bayes decision theory 2 Minimizing the classification error probability Minimizing the average risk Discriminant function and decision surface Bayesian classifiers for Normally distributed classes Minimum distance classifier Bayesian classifiers for independent binary features Supervised learning of the Bayesian classifiers 3 Parametric methods for density estimation Maximum likelihood parameter estimation Bayesian estimation Maximum a posteriori estimation Mixture models for density estimation Nonparametric methods for density estimation Histogram estimator Naive estimator Kernel estimator k − Nearest neighbor estimator k − Nearest neighbor classifier Naive Bayes classifier 4 Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 5 / 70
Bayes decision theory Given a classification task of M classes, C 1 , C 2 , . . . , C M , and an input vector x , we can form M conditional probabilities p ( C k | x ) ∀ k = 1 , 2 , . . . , M Without loss of generality, consider two class classification problem. From the Bayes theorem, we have p ( C k | x ) P ( x | C k ) P ( C k ) = The base classification rule is if p ( C 1 | x ) > p ( C 2 | x ) then x is classified to C 1 if p ( C 1 | x ) < p ( C 2 | x ) then x is classified to C 2 if p ( C 1 | x ) = p ( C 2 | x ) then x is classified to either C 1 or C 2 Since p ( x ) is same for all classes, then it can be removed. Hence p ( x | C 1 ) p ( C 1 ) ≶ p ( x | C 2 ) p ( C 2 ) If p ( C 1 ) = p ( C 2 ) = 1 2 , then we have p ( x | C 1 ) ≶ p ( x | C 2 ) Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 5 / 70
Bayes decision theory If p ( C 1 ) = p ( C 2 ) = 1 2 , then we have The coloured region may produce error. The probability of error equals to p ( x ∈ R 1 , C 2 ) + p ( x ∈ R 2 , C 1 ) P e = p ( mistake ) = 1 � p ( x | C 2 ) dx + 1 � = p ( x | C 1 ) dx 2 2 R 1 R 2 Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 6 / 70
Outline Introduction 1 Bayes decision theory 2 Minimizing the classification error probability Minimizing the average risk Discriminant function and decision surface Bayesian classifiers for Normally distributed classes Minimum distance classifier Bayesian classifiers for independent binary features Supervised learning of the Bayesian classifiers 3 Parametric methods for density estimation Maximum likelihood parameter estimation Bayesian estimation Maximum a posteriori estimation Mixture models for density estimation Nonparametric methods for density estimation Histogram estimator Naive estimator Kernel estimator k − Nearest neighbor estimator k − Nearest neighbor classifier Naive Bayes classifier 4
Minimizing the classification error probability (cont.) We now show that the Bayesian classifier is optimal with respect to minimizing the classification probability. Let R 1 ( R 2 ) be the region in the feature space in which we decide in favor of C 1 ( C 2 ). Then error is made if x ∈ R ∞ although it belongs to C 2 , or if x ∈ R ∈ but it may belongs to C 1 . That is p ( x ∈ R 2 , C 1 ) + p ( x ∈ R 1 , C 2 ) P e = = p ( x ∈ R 2 | C 1 ) p ( C 2 ) + p ( x ∈ R 1 | C 2 ) p ( C 1 ) � � p ( x ∈ R 2 | C 1 ) + p ( C 1 ) p ( x ∈ R 1 | C 2 ) = p ( C 2 ) R 2 R 1 Since R 1 ∪ R 2 covers all the feature space, from the definition of probability density function, we have � � p ( C 1 | x ) p ( x ) dx + p ( C 1 | x ) p ( x ) dx p ( C 1 ) = R 1 R 2 By combining these two equation, we obtain � P e = p ( C 1 ) − [ p ( C 1 | x ) − p ( C 2 | x )] p ( x ) dx R 1 Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 7 / 70
Minimizing the classification error probability (cont.) The probability of error equals to � P e = p ( C 1 ) − [ p ( C 1 | x ) − p ( C 2 | x )] p ( x ) dx R 1 The probability of error is minimized if R 1 is the region of the space in which [ p ( C 1 | x ) − p ( C 2 | x )] > 0 Then R 2 becomes the region where the reverse is true, i.e. is the region of the space in which [ p ( C 1 | x ) − p ( C 2 | x )] < 0 This completes the proof of the Theorem. For classification task with M classes, x is assigned to class C k with the following rule if p ( C k | x ) > p ( C j | x ) ∀ j � = k Show that this rule also minimizes the classification error probability for classification task with M classes. Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 8 / 70
Outline Introduction 1 Bayes decision theory 2 Minimizing the classification error probability Minimizing the average risk Discriminant function and decision surface Bayesian classifiers for Normally distributed classes Minimum distance classifier Bayesian classifiers for independent binary features Supervised learning of the Bayesian classifiers 3 Parametric methods for density estimation Maximum likelihood parameter estimation Bayesian estimation Maximum a posteriori estimation Mixture models for density estimation Nonparametric methods for density estimation Histogram estimator Naive estimator Kernel estimator k − Nearest neighbor estimator k − Nearest neighbor classifier Naive Bayes classifier 4
Recommend
More recommend