Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016
Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier Naïve Bayes Discriminative models Logistic regression 2
Classification problem: probabilistic view Each feature as a random variable Class label also as a random variable We observe the feature values for a random sample and we intend to find its class label Evidence: feature vector 𝒚 Query: class label 3
Definitions Posterior probability: 𝑞 𝒟 𝑙 𝒚 Likelihood or class conditional probability: 𝑞 𝒚|𝒟 𝑙 Prior probability: 𝑞(𝒟 𝑙 ) 𝑞(𝒚) : pdf of feature vector 𝒚 ( 𝑞 𝒚 = 𝑙=1 𝐿 𝑞 𝒚 𝒟 𝑙 𝑞(𝒟 𝑙 ) ) 𝑞(𝒚|𝒟 𝑙 ) : pdf of feature vector 𝒚 for samples of class 𝒟 𝑙 𝑞(𝒟 𝑙 ) : probability of the label be 𝒟 𝑙 4
Bayes decision rule 𝐿 = 2 If 𝑄 𝒟 1 |𝒚 > 𝑄(𝒟 2 |𝒚) decide 𝒟 1 otherwise decide 𝒟 2 𝑞 𝑓𝑠𝑠𝑝𝑠 𝒚 = 𝑞(𝐷 2 |𝒚) if we decide 𝒟 1 𝑄(𝐷 1 |𝒚) if we decide 𝒟 2 If we use Bayes decision rule: 𝑄 𝑓𝑠𝑠𝑝𝑠 𝒚 = min{𝑄 𝒟 1 𝒚 , 𝑄(𝒟 2 |𝒚)} Using Bayes rule, for each 𝒚 , 𝑄 𝑓𝑠𝑠𝑝𝑠 𝒚 is as small as possible and thus t his rule minimizes the probability of error 5
Optimal classifier The optimal decision is the one that minimizes the expected number of mistakes We show that Bayes classifier is an optimal classifier 6
Bayes decision rule Minimizing misclassification rate Decision regions: ℛ 𝑙 = {𝒚|𝛽 𝒚 = 𝑙} 𝐿 = 2 All points in ℛ 𝑙 are assigned to class 𝒟 𝑙 𝑞 𝑓𝑠𝑠𝑝𝑠 = 𝐹 𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧) = 𝑞 𝒚 ∈ ℛ 1 , 𝒟 2 + 𝑞 𝒚 ∈ ℛ 2 , 𝒟 1 = 𝑞 𝒚, 𝒟 2 𝑒𝒚 + 𝑞 𝒚, 𝒟 1 𝑒𝒚 ℛ 1 ℛ 2 = 𝑞 𝒟 2 |𝒚 𝑞 𝒚 𝑒𝒚 + 𝑞 𝒟 1 |𝒚 𝑞 𝒚 𝑒𝒚 ℛ 1 ℛ 2 Choose class with highest 𝑞 𝒟 𝑙 𝒚 as 𝛽 𝒚 7
Bayes minimum error Bayes minimum error classifier: 𝛽(.) 𝐹 𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧) min Zero-one loss If we know the probabilities in advance then the above optimization problem will be solved easily. 𝛽 𝒚 = argmax 𝑞(𝑧|𝒚) 𝑧 In practice, we can estimate 𝑞(𝑧|𝒚) based on a set of training samples 8
Bayes theorem Likelihood Prior Posterior Bayes ’ theorem 𝑞 𝒚|𝒟 𝑙 𝑞(𝒟 𝑙 ) 𝑞 𝒟 𝑙 𝒚 = 𝑞(𝒚) Posterior probability: 𝑞 𝒟 𝑙 𝒚 Likelihood or class conditional probability: 𝑞 𝒚|𝒟 𝑙 Prior probability: 𝑞(𝒟 𝑙 ) 𝑞(𝒚) : pdf of feature vector 𝒚 ( 𝑞 𝒚 = 𝑙=1 𝐿 𝑞 𝒚 𝒟 𝑙 𝑞(𝒟 𝑙 ) ) 𝑞(𝒚|𝒟 𝑙 ) : pdf of feature vector 𝒚 for samples of class 𝒟 𝑙 𝑞(𝒟 𝑙 ) : probability of the label be 𝒟 𝑙 9
Bayes decision rule: example Bayes decision: Choose the class with highest 𝑞 𝒟 𝑙 𝒚 𝑞(𝒟 1 |𝑦) 𝑞(𝑦|𝒟 2 ) 𝑞(𝑦|𝒟 1 ) 𝑞(𝒟 2 |𝑦) ℛ 2 ℛ 2 𝑞 𝒟 1 = 2 𝑞 𝒟 𝑙 𝒚 = 𝑞 𝒚|𝒟 𝑙 𝑞(𝒟 𝑙 ) 3 𝑞(𝒚) 𝑞 𝒟 2 = 1 𝑞 𝒚 = 𝑞 𝒟 1 𝑞 𝒚 𝒟 1 + 𝑞 𝒟 2 𝑞 𝒚 𝒟 2 3 10
Bayesian decision rule If 𝑄 𝒟 1 |𝒚 > 𝑄(𝒟 2 |𝒚) decide 𝒟 1 otherwise decide 𝒟 2 Equivalent 𝑞 𝒚|𝒟 1 𝑄(𝒟 1 ) 𝑞 𝒚|𝒟 2 𝑄(𝒟 2 ) If decide 𝒟 1 > 𝑞(𝒚) 𝑞(𝑦) otherwise decide 𝒟 2 Equivalent If 𝑞 𝒚|𝒟 1 𝑄(𝒟 1 ) > 𝑞 𝒚|𝒟 2 𝑄(𝒟 2 ) decide 𝒟 1 otherwise decide 𝒟 2 11
Bayes decision rule: example Bayes decision: Choose the class with highest 𝑞 𝒟 𝑙 𝒚 2 × 𝑞(𝑦|𝒟 1 ) 𝑞(𝑦|𝒟 2 ) 𝑞(𝑦|𝒟 1 ) ℛ 2 ℛ 2 𝑞 𝒟 1 = 2 𝑞(𝒟 1 |𝑦) 3 𝑞 𝒟 2 = 1 3 𝑞(𝒟 2 |𝑦) 12 ℛ 2 ℛ 2
Bayes Classier Simple Bayes classifier: estimate posterior probability of each class What should the decision criterion be? Choose class with highest 𝑞 𝒟 𝑙 𝒚 The optimal decision is the one that minimizes the expected number of mistakes 13
Diabetes example white blood cell count 14 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411
Diabetes example Doctor has a prior 𝑞 𝑧 = 1 = 0.2 Prior: In the absence of any observation, what do I know about the probability of the classes? A patient comes in with white blood cell count 𝑦 Does the patient have diabetes 𝑞 𝑧 = 1|𝑦 ? given a new observation, we still need to compute the posterior 15
Diabetes example 𝑞 𝑦 𝑧 = 0 𝑞 𝑦 𝑧 = 1 16 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411
Estimate probability densities from data If we assume Gaussian distributions for 𝑞(𝑦|𝒟 1 ) and 𝑞(𝑦|𝒟 2 ) Recall that for samples {𝑦 1 , … , 𝑦 𝑂 } , if we assume a Gaussian distribution, the MLE estimates will be 17
Diabetes example 𝑞 𝑦 𝑧 = 0 𝑞 𝑦 𝑧 = 1 𝑜: 𝑧 (𝑜) =1 𝑦 (𝑜) 𝑜: 𝑧 (𝑜) =1 𝑦 (𝑜) 2 𝑞 𝑦 𝑧 = 1 = 𝑂 𝜈 1 , 𝜏 1 𝜈 1 = = 𝑂 1 𝑜: 𝑧 (𝑜) =1 1 2 𝑜: 𝑧(𝑜)=1 𝑦 𝑜 −𝜈 1 2 = 𝜏 1 𝑂 1 18 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411
Diabetes example Add a second observation: Plasma glucose value 19 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411
Generative approach for this example Multivariate Gaussian distributions for 𝑞(𝑦|𝒟 𝑙 ) : 𝑞 𝒚 𝑧 = 𝑙 2𝜌 𝑒/2 Σ 1/2 exp{− 1 1 −1 𝒚 − 𝝂 𝑙 } 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 𝑙 = 𝑙 = 1,2 Prior distribution 𝑞(𝑦|𝒟 𝑙 ) : 𝑞 𝑧 = 1 = 𝜌 , 𝑞 𝑧 = 0 = 1 − 𝜌 20
MLE for multivariate Gaussian For samples {𝑦 1 , … , 𝑦 𝑂 } , if we assume a multivariate Gaussian distribution, the MLE estimates will be: 𝑂 𝒚 (𝑜) 𝝂 = 𝑜=1 𝑂 𝑂 𝜯 = 1 𝑈 𝒚 (𝑜) − 𝝂 𝒚 (𝑜) − 𝝂 𝑂 𝑜=1 21
Generative approach: example 𝑂 𝒚 𝑜 , 𝑧 𝑜 Maximum likelihood estimation ( 𝐸 = ): 𝑜=1 𝑂 1 𝜌 = 𝑂 𝑂 𝑂 𝑂 𝑧 (𝑜) 𝒚 (𝑜) (1−𝑧 (𝑜) )𝒚 (𝑜) 𝑜=1 𝑜=1 , 𝝂 2 = 𝝂 1 = 𝑧 (𝑜) 𝑂 1 = 𝑂 1 𝑂 2 𝑜=1 𝑈 1 𝑧 (𝑜) 𝒚 (𝑜) − 𝝂 𝒚 (𝑜) − 𝝂 𝑂 𝑂 1 𝑜=1 𝜯 1 = 𝑂 2 = 𝑂 − 𝑂 1 𝑈 1 (1 − 𝑧 𝑜 ) 𝒚 (𝑜) − 𝝂 𝒚 (𝑜) − 𝝂 𝑂 𝑂 2 𝑜=1 𝜯 2 = 22
Decision boundary for Gaussian Bayes classifier 𝑞 𝒟 1 𝒚 = 𝑞(𝒟 2 |𝒚) 𝑞 𝒟 𝑙 𝒚 = 𝑞 𝒚|𝒟 𝑙 𝑞(𝒟 𝑙 ) 𝑞(𝒚) ln 𝑞(𝒟 1 |𝒚) = ln 𝑞(𝒟 2 |𝒚) ln 𝑞(𝒚|𝒟 1 ) + ln 𝑞(𝒟 1 ) − ln 𝑞(𝒚) = ln 𝑞(𝒚|𝒟 2 ) + ln 𝑞(𝒟 2 ) − ln 𝑞(𝒚) ln 𝑞(𝒚|𝒟 1 ) + ln 𝑞(𝒟 1 ) = ln 𝑞(𝒚|𝒟 2 ) + ln 𝑞(𝒟 2 ) ln 𝑞(𝒚|𝒟 𝑙 ) = − 𝑒 2 ln 2𝜌 − 1 −1 − 1 −1 𝒚 − 𝝂 𝑙 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 𝑙 2 ln 𝜯 𝑙 23
Decision boundary 𝑞(𝒚|𝐷 1 ) 𝑞(𝒚|𝐷 2 ) 𝑞(𝐷 1 |𝒚) = 𝑞(𝐷 2 |𝒚) 𝑞(𝐷 1 |𝒚) 24
Shared covariance matrix When classes share a single covariance matrix 𝜯 = 𝜯 1 = 𝜯 2 2𝜌 𝑒/2 Σ 1/2 exp{− 1 1 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 −1 𝒚 − 𝝂 𝑙 } 𝑞 𝒚 𝐷 𝑙 = 𝑙 = 1,2 𝑞 𝐷 1 = 𝜌 , 𝑞 𝐷 2 = 1 − 𝜌 26
Likelihood 𝑂 𝑞(𝒚 𝑜 , 𝑧 (𝑜) |𝜌, 𝝂 1 , 𝝂 2 , 𝜯) 𝑜=1 𝑂 𝑞(𝒚 𝑜 |𝑧 𝑜 , 𝝂 1 , 𝝂 2 , 𝜯)𝑞(𝑧 𝑜 |𝜌) = 𝑜=1 27
Shared covariance matrix 𝑜 𝒚 𝑗 , 𝑧 𝑗 Maximum likelihood estimation ( 𝐸 = ): 𝑗=1 𝜌 = 𝑂 1 𝑂 𝑂 𝑧 (𝑜) 𝒚 (𝑜) 𝝂 1 = 𝑜=1 𝑂 1 𝑂 (1 − 𝑧 (𝑜) )𝒚 (𝑜) 𝝂 2 = 𝑜=1 𝑂 2 𝜯 = 1 𝑈 + 𝑈 𝒚 (𝑜) − 𝝂 1 𝒚 (𝑜) − 𝝂 1 𝒚 (𝑜) − 𝝂 2 𝒚 (𝑜) − 𝝂 2 𝑂 𝑜∈𝐷 1 𝑜∈𝐷 2 28
Decision boundary when shared covariance matrix ln 𝑞(𝒚|𝒟 1 ) + ln 𝑞(𝒟 1 ) = ln 𝑞(𝒚|𝒟 2 ) + ln 𝑞(𝒟 2 ) ln 𝑞(𝒚|𝒟 𝑙 ) = − 𝑒 2 ln 2𝜌 − 1 −1 − 1 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 −1 𝒚 − 𝝂 𝑙 2 ln 𝜯 𝑙 29
Bayes decision rule Multi-class misclassification rate Multi-class problem: Probability of error of Bayesian decision rule Simpler to compute the probability of correct decision 𝑄 𝑓𝑠𝑠𝑝𝑠 = 1 − 𝑄(𝑑𝑝𝑠𝑠𝑓𝑑𝑢) 𝐿 𝑄 𝐷𝑝𝑠𝑠𝑓𝑑𝑢 = 𝑞(𝒚, 𝒟 𝑗 ) 𝑒𝒚 ℛ 𝑗 𝑗=1 𝐿 = 𝑞 𝒟 𝑗 𝒚 𝑞(𝒚) 𝑒𝒚 ℛ 𝑗 𝑗=1 ℛ 𝑗 : the subset of feature space assigned to the class 𝒟 𝑗 using the classifier 31
Bayes minimum error Bayes minimum error classifier: 𝛽(.) 𝐹 𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧) min Zero-one loss 𝛽 𝒚 = argmax 𝑞(𝑧|𝒚) 𝑧 32
Recommend
More recommend