probabilistic classification
play

Probabilistic classification CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier Nave Bayes


  1. Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

  2. Topics  Probabilistic approach  Bayes decision theory  Generative models  Gaussian Bayes classifier  Naïve Bayes  Discriminative models  Logistic regression 2

  3. Classification problem: probabilistic view  Each feature as a random variable  Class label also as a random variable  We observe the feature values for a random sample and we intend to find its class label  Evidence: feature vector 𝒚  Query: class label 3

  4. Definitions  Posterior probability: 𝑞 𝒟 𝑙 𝒚  Likelihood or class conditional probability: 𝑞 𝒚|𝒟 𝑙  Prior probability: 𝑞(𝒟 𝑙 ) 𝑞(𝒚) : pdf of feature vector 𝒚 ( 𝑞 𝒚 = 𝑙=1 𝐿 𝑞 𝒚 𝒟 𝑙 𝑞(𝒟 𝑙 ) ) 𝑞(𝒚|𝒟 𝑙 ) : pdf of feature vector 𝒚 for samples of class 𝒟 𝑙 𝑞(𝒟 𝑙 ) : probability of the label be 𝒟 𝑙 4

  5. Bayes decision rule 𝐿 = 2 If 𝑄 𝒟 1 |𝒚 > 𝑄(𝒟 2 |𝒚) decide 𝒟 1 otherwise decide 𝒟 2 𝑞 𝑓𝑠𝑠𝑝𝑠 𝒚 = 𝑞(𝐷 2 |𝒚) if we decide 𝒟 1 𝑄(𝐷 1 |𝒚) if we decide 𝒟 2  If we use Bayes decision rule: 𝑄 𝑓𝑠𝑠𝑝𝑠 𝒚 = min{𝑄 𝒟 1 𝒚 , 𝑄(𝒟 2 |𝒚)} Using Bayes rule, for each 𝒚 , 𝑄 𝑓𝑠𝑠𝑝𝑠 𝒚 is as small as possible and thus t his rule minimizes the probability of error 5

  6. Optimal classifier  The optimal decision is the one that minimizes the expected number of mistakes  We show that Bayes classifier is an optimal classifier 6

  7. Bayes decision rule Minimizing misclassification rate  Decision regions: ℛ 𝑙 = {𝒚|𝛽 𝒚 = 𝑙} 𝐿 = 2  All points in ℛ 𝑙 are assigned to class 𝒟 𝑙 𝑞 𝑓𝑠𝑠𝑝𝑠 = 𝐹 𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧) = 𝑞 𝒚 ∈ ℛ 1 , 𝒟 2 + 𝑞 𝒚 ∈ ℛ 2 , 𝒟 1 = 𝑞 𝒚, 𝒟 2 𝑒𝒚 + 𝑞 𝒚, 𝒟 1 𝑒𝒚 ℛ 1 ℛ 2 = 𝑞 𝒟 2 |𝒚 𝑞 𝒚 𝑒𝒚 + 𝑞 𝒟 1 |𝒚 𝑞 𝒚 𝑒𝒚 ℛ 1 ℛ 2 Choose class with highest 𝑞 𝒟 𝑙 𝒚 as 𝛽 𝒚 7

  8. Bayes minimum error  Bayes minimum error classifier: 𝛽(.) 𝐹 𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧) min Zero-one loss  If we know the probabilities in advance then the above optimization problem will be solved easily.  𝛽 𝒚 = argmax 𝑞(𝑧|𝒚) 𝑧  In practice, we can estimate 𝑞(𝑧|𝒚) based on a set of training samples 𝒠 8

  9. Bayes theorem Likelihood Prior Posterior  Bayes ’ theorem 𝑞 𝒚|𝒟 𝑙 𝑞(𝒟 𝑙 ) 𝑞 𝒟 𝑙 𝒚 = 𝑞(𝒚)  Posterior probability: 𝑞 𝒟 𝑙 𝒚  Likelihood or class conditional probability: 𝑞 𝒚|𝒟 𝑙  Prior probability: 𝑞(𝒟 𝑙 ) 𝑞(𝒚) : pdf of feature vector 𝒚 ( 𝑞 𝒚 = 𝑙=1 𝐿 𝑞 𝒚 𝒟 𝑙 𝑞(𝒟 𝑙 ) ) 𝑞(𝒚|𝒟 𝑙 ) : pdf of feature vector 𝒚 for samples of class 𝒟 𝑙 𝑞(𝒟 𝑙 ) : probability of the label be 𝒟 𝑙 9

  10. Bayes decision rule: example  Bayes decision: Choose the class with highest 𝑞 𝒟 𝑙 𝒚 𝑞(𝒟 1 |𝑦) 𝑞(𝑦|𝒟 2 ) 𝑞(𝑦|𝒟 1 ) 𝑞(𝒟 2 |𝑦) ℛ 2 ℛ 2 𝑞 𝒟 1 = 2 𝑞 𝒟 𝑙 𝒚 = 𝑞 𝒚|𝒟 𝑙 𝑞(𝒟 𝑙 ) 3 𝑞(𝒚) 𝑞 𝒟 2 = 1 𝑞 𝒚 = 𝑞 𝒟 1 𝑞 𝒚 𝒟 1 + 𝑞 𝒟 2 𝑞 𝒚 𝒟 2 3 10

  11. Bayesian decision rule  If 𝑄 𝒟 1 |𝒚 > 𝑄(𝒟 2 |𝒚) decide 𝒟 1  otherwise decide 𝒟 2 Equivalent 𝑞 𝒚|𝒟 1 𝑄(𝒟 1 ) 𝑞 𝒚|𝒟 2 𝑄(𝒟 2 )  If decide 𝒟 1 > 𝑞(𝒚) 𝑞(𝑦)  otherwise decide 𝒟 2 Equivalent  If 𝑞 𝒚|𝒟 1 𝑄(𝒟 1 ) > 𝑞 𝒚|𝒟 2 𝑄(𝒟 2 ) decide 𝒟 1  otherwise decide 𝒟 2 11

  12. Bayes decision rule: example  Bayes decision: Choose the class with highest 𝑞 𝒟 𝑙 𝒚 2 × 𝑞(𝑦|𝒟 1 ) 𝑞(𝑦|𝒟 2 ) 𝑞(𝑦|𝒟 1 ) ℛ 2 ℛ 2 𝑞 𝒟 1 = 2 𝑞(𝒟 1 |𝑦) 3 𝑞 𝒟 2 = 1 3 𝑞(𝒟 2 |𝑦) 12 ℛ 2 ℛ 2

  13. Bayes Classier  Simple Bayes classifier: estimate posterior probability of each class  What should the decision criterion be?  Choose class with highest 𝑞 𝒟 𝑙 𝒚  The optimal decision is the one that minimizes the expected number of mistakes 13

  14. Diabetes example  white blood cell count 14 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411

  15. Diabetes example  Doctor has a prior 𝑞 𝑧 = 1 = 0.2  Prior: In the absence of any observation, what do I know about the probability of the classes?  A patient comes in with white blood cell count 𝑦  Does the patient have diabetes 𝑞 𝑧 = 1|𝑦 ?  given a new observation, we still need to compute the posterior 15

  16. Diabetes example 𝑞 𝑦 𝑧 = 0 𝑞 𝑦 𝑧 = 1 16 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411

  17. Estimate probability densities from data  If we assume Gaussian distributions for 𝑞(𝑦|𝒟 1 ) and 𝑞(𝑦|𝒟 2 )  Recall that for samples {𝑦 1 , … , 𝑦 𝑂 } , if we assume a Gaussian distribution, the MLE estimates will be 17

  18. Diabetes example 𝑞 𝑦 𝑧 = 0 𝑞 𝑦 𝑧 = 1 𝑜: 𝑧 (𝑜) =1 𝑦 (𝑜) 𝑜: 𝑧 (𝑜) =1 𝑦 (𝑜) 2 𝑞 𝑦 𝑧 = 1 = 𝑂 𝜈 1 , 𝜏 1 𝜈 1 = = 𝑂 1 𝑜: 𝑧 (𝑜) =1 1 2 𝑜: 𝑧(𝑜)=1 𝑦 𝑜 −𝜈 1 2 = 𝜏 1 𝑂 1 18 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411

  19. Diabetes example  Add a second observation: Plasma glucose value 19 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411

  20. Generative approach for this example  Multivariate Gaussian distributions for 𝑞(𝑦|𝒟 𝑙 ) : 𝑞 𝒚 𝑧 = 𝑙 2𝜌 𝑒/2 Σ 1/2 exp{− 1 1 −1 𝒚 − 𝝂 𝑙 } 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 𝑙 = 𝑙 = 1,2  Prior distribution 𝑞(𝑦|𝒟 𝑙 ) :  𝑞 𝑧 = 1 = 𝜌 , 𝑞 𝑧 = 0 = 1 − 𝜌 20

  21. MLE for multivariate Gaussian  For samples {𝑦 1 , … , 𝑦 𝑂 } , if we assume a multivariate Gaussian distribution, the MLE estimates will be: 𝑂 𝒚 (𝑜) 𝝂 = 𝑜=1 𝑂 𝑂 𝜯 = 1 𝑈 𝒚 (𝑜) − 𝝂 𝒚 (𝑜) − 𝝂 𝑂 𝑜=1 21

  22. Generative approach: example 𝑂 𝒚 𝑜 , 𝑧 𝑜 Maximum likelihood estimation ( 𝐸 = ): 𝑜=1 𝑂 1  𝜌 = 𝑂 𝑂 𝑂 𝑂 𝑧 (𝑜) 𝒚 (𝑜) (1−𝑧 (𝑜) )𝒚 (𝑜) 𝑜=1 𝑜=1 , 𝝂 2 =  𝝂 1 = 𝑧 (𝑜) 𝑂 1 = 𝑂 1 𝑂 2 𝑜=1 𝑈 1 𝑧 (𝑜) 𝒚 (𝑜) − 𝝂 𝒚 (𝑜) − 𝝂 𝑂 𝑂 1 𝑜=1  𝜯 1 = 𝑂 2 = 𝑂 − 𝑂 1 𝑈 1 (1 − 𝑧 𝑜 ) 𝒚 (𝑜) − 𝝂 𝒚 (𝑜) − 𝝂 𝑂 𝑂 2 𝑜=1  𝜯 2 = 22

  23. Decision boundary for Gaussian Bayes classifier 𝑞 𝒟 1 𝒚 = 𝑞(𝒟 2 |𝒚) 𝑞 𝒟 𝑙 𝒚 = 𝑞 𝒚|𝒟 𝑙 𝑞(𝒟 𝑙 ) 𝑞(𝒚) ln 𝑞(𝒟 1 |𝒚) = ln 𝑞(𝒟 2 |𝒚) ln 𝑞(𝒚|𝒟 1 ) + ln 𝑞(𝒟 1 ) − ln 𝑞(𝒚) = ln 𝑞(𝒚|𝒟 2 ) + ln 𝑞(𝒟 2 ) − ln 𝑞(𝒚) ln 𝑞(𝒚|𝒟 1 ) + ln 𝑞(𝒟 1 ) = ln 𝑞(𝒚|𝒟 2 ) + ln 𝑞(𝒟 2 ) ln 𝑞(𝒚|𝒟 𝑙 ) = − 𝑒 2 ln 2𝜌 − 1 −1 − 1 −1 𝒚 − 𝝂 𝑙 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 𝑙 2 ln 𝜯 𝑙 23

  24. Decision boundary 𝑞(𝒚|𝐷 1 ) 𝑞(𝒚|𝐷 2 ) 𝑞(𝐷 1 |𝒚) = 𝑞(𝐷 2 |𝒚) 𝑞(𝐷 1 |𝒚) 24

  25. Shared covariance matrix  When classes share a single covariance matrix 𝜯 = 𝜯 1 = 𝜯 2 2𝜌 𝑒/2 Σ 1/2 exp{− 1 1 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 −1 𝒚 − 𝝂 𝑙 } 𝑞 𝒚 𝐷 𝑙 = 𝑙 = 1,2  𝑞 𝐷 1 = 𝜌 , 𝑞 𝐷 2 = 1 − 𝜌 26

  26. Likelihood 𝑂 𝑞(𝒚 𝑜 , 𝑧 (𝑜) |𝜌, 𝝂 1 , 𝝂 2 , 𝜯) 𝑜=1 𝑂 𝑞(𝒚 𝑜 |𝑧 𝑜 , 𝝂 1 , 𝝂 2 , 𝜯)𝑞(𝑧 𝑜 |𝜌) = 𝑜=1 27

  27. Shared covariance matrix 𝑜 𝒚 𝑗 , 𝑧 𝑗  Maximum likelihood estimation ( 𝐸 = ): 𝑗=1 𝜌 = 𝑂 1 𝑂 𝑂 𝑧 (𝑜) 𝒚 (𝑜) 𝝂 1 = 𝑜=1 𝑂 1 𝑂 (1 − 𝑧 (𝑜) )𝒚 (𝑜) 𝝂 2 = 𝑜=1 𝑂 2 𝜯 = 1 𝑈 + 𝑈 𝒚 (𝑜) − 𝝂 1 𝒚 (𝑜) − 𝝂 1 𝒚 (𝑜) − 𝝂 2 𝒚 (𝑜) − 𝝂 2 𝑂 𝑜∈𝐷 1 𝑜∈𝐷 2 28

  28. Decision boundary when shared covariance matrix ln 𝑞(𝒚|𝒟 1 ) + ln 𝑞(𝒟 1 ) = ln 𝑞(𝒚|𝒟 2 ) + ln 𝑞(𝒟 2 ) ln 𝑞(𝒚|𝒟 𝑙 ) = − 𝑒 2 ln 2𝜌 − 1 −1 − 1 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 −1 𝒚 − 𝝂 𝑙 2 ln 𝜯 𝑙 29

  29. Bayes decision rule Multi-class misclassification rate  Multi-class problem: Probability of error of Bayesian decision rule  Simpler to compute the probability of correct decision 𝑄 𝑓𝑠𝑠𝑝𝑠 = 1 − 𝑄(𝑑𝑝𝑠𝑠𝑓𝑑𝑢) 𝐿 𝑄 𝐷𝑝𝑠𝑠𝑓𝑑𝑢 = 𝑞(𝒚, 𝒟 𝑗 ) 𝑒𝒚 ℛ 𝑗 𝑗=1 𝐿 = 𝑞 𝒟 𝑗 𝒚 𝑞(𝒚) 𝑒𝒚 ℛ 𝑗 𝑗=1 ℛ 𝑗 : the subset of feature space assigned to the class 𝒟 𝑗 using the classifier 31

  30. Bayes minimum error  Bayes minimum error classifier: 𝛽(.) 𝐹 𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧) min Zero-one loss 𝛽 𝒚 = argmax 𝑞(𝑧|𝒚) 𝑧 32

Recommend


More recommend