pattern recognition 2019 linear models for classification
play

Pattern Recognition 2019 Linear Models for Classification (2) Ad - PowerPoint PPT Presentation

Pattern Recognition 2019 Linear Models for Classification (2) Ad Feelders Universiteit Utrecht December 11, 2019 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 1 / 57 Two types of approaches to classification


  1. Pattern Recognition 2019 Linear Models for Classification (2) Ad Feelders Universiteit Utrecht December 11, 2019 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 1 / 57

  2. Two types of approaches to classification Discriminative Models (“regression”; section 4.3): only model conditional distribution of t given x . Generative Models (“density estimation”; section 4.2): model joint distribution of t and x . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 2 / 57

  3. Generative Models In classification we want to estimate p ( C k | x ). In generative models, we use Bayes’ rule p ( C k ) p ( x |C k ) p ( C k | x ) = , � K j =1 p ( C j ) p ( x |C j ) where p ( x |C j ) are the class conditional probability distributions and p ( C j ) are the unconditional (”prior”) probabilities of each class. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 3 / 57

  4. Generative Models The training data are partitioned into subsets D = {D 1 , . . . , D K } with the same class label. Data in D j is used to used to estimate p ( x |C j ). Prior probabilities p ( C j ) are estimated from observed class values. These estimates are plugged into Bayes’ formula to obtain probability estimates ˆ p ( C k | x ). Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 4 / 57

  5. Generative Models: example (discrete features) Test mailing data: respondents non-respondents age male female total male female total 18-25 15 10 25 7 3 10 26-35 15 20 35 10 10 20 36-50 10 10 20 10 20 30 51-64 10 5 15 40 40 80 65+ 5 0 5 40 20 60 total 55 45 100 107 93 200 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 5 / 57

  6. Generative Models: example p (respondent) = 100 / 300 = 1 / 3 and ˆ ˆ p (non-respondent) = 2 / 3. respondents non-respondents age male female total male female total 18-25 0.15 0.10 0.25 0.035 0.015 0.05 26-35 0.15 0.20 0.35 0.05 0.05 0.10 36-50 0.10 0.10 0.20 0.05 0.10 0.15 51-64 0.10 0.05 0.15 0.20 0.20 0.40 65+ 0.05 0 0.05 0.20 0.10 0.30 total 0.55 0.45 1 0.535 0.465 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 6 / 57

  7. Using Bayes’ Rule Estimated probability of response for a 18-25 year old male (R=Respondent, M=Male): p (18 − 25 , M | R )ˆ ˆ p ( R ) p ( R | 18 − 25 , M ) ˆ = p (18 − 25 , M ) ˆ 0 . 15 × 1 / 3 = 0 . 15 × 1 / 3 + 0 . 035 × 2 / 3 ≈ 0 . 68 Assign person to respondents because this is the class with the highest estimated probability for 18-25 year old males. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 7 / 57

  8. Curse of Dimensionality D input variables with m possible values each: have to estimate m D − 1 probabilities per group. For D = 10 and m = 5: 5 10 − 1 = 9 , 765 , 624 probabilities. If N = 1000, almost all cells are empty; we have 1000 / 9765624 ≈ 0 . 0001 observations per cell. Curse of dimensionality : in high dimensions almost all of the input space is empty. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 8 / 57

  9. Naive Bayes Assumption Assume the input variables are independent within each group, i.e. p ( x |C k ) = p ( x 1 |C k ) p ( x 2 |C k ) · · · p ( x D |C k ) Instead of m D − 1 parameters, we only have to estimate D ( m − 1) parameters per group. So with D = 10 and m = 5, we only have to estimate 40 probabilities per group. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 9 / 57

  10. Using Naive Bayes Estimated probability of response for a 18-25 year old male with naive Bayes p (18 − 25 | R )ˆ ˆ p ( M | R )ˆ p ( R ) p ( R | 18 − 25 , M ) ˆ = p (18 − 25 , M ) ˆ 0 . 25 × 0 . 55 × 1 / 3 = 0 . 25 × 0 . 55 × 1 / 3 + 0 . 05 × 0 . 535 × 2 / 3 ≈ 0 . 72 Probability estimate is higher, but both models lead to the same allocation. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 10 / 57

  11. Continuous features: normal distribution Suppose x ∼ N ( µ , Σ ), with � µ 1 � σ 2 � x 1 � � � σ 12 1 x = µ = Σ = σ 2 x 2 µ 2 σ 21 2 Correlation coefficient ρ 12 = σ 12 σ 1 σ 2 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 11 / 57

  12. Contour Plot 1: independent, same variance x 2 4 2 0 -2 -4 x 1 -4 -2 0 2 4 µ 1 = 0, µ 2 = 0, ρ 12 = 0, σ 2 1 = σ 2 2 = 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 12 / 57

  13. Contour Plot 2: positive correlation 30 x 2 28 26 24 22 20 x 1 6 8 10 12 14 µ 1 = 10, µ 2 = 25, ρ 12 = 0 . 7, σ 2 1 = σ 2 2 = 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 13 / 57

  14. Contour Plot 3: negative correlation 10 x 2 8 6 4 2 0 10 12 14 16 18 20 x 1 µ 1 = 15, µ 2 = 5, ρ 12 = − 0 . 6, σ 2 1 = σ 2 2 = 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 14 / 57

  15. Multivariate Normal Distribution D variables, i.e. x = [ x 1 , . . . , x D ] ⊤     σ 2 µ 1 σ 12 σ 13 . . . σ 1 D 1 σ 2 µ 2 σ 21 σ 23 . . . σ 2 D     2     µ = Σ = . .     . . . .     σ 2 µ D σ D 1 σ D 2 σ D 3 . . . D Formula for normal probability density: � � 1 − 1 2( x − µ ) ⊤ Σ − 1 ( x − µ ) p ( x ) = (2 π ) D / 2 | Σ | 1 / 2 exp Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 15 / 57

  16. Normality Assumption in Classification If in class k x ∼ N ( µ k , Σ k ) then the form of p ( x |C k ) is � � 1 − 1 2( x − µ k ) ⊤ Σ − 1 k ( x − µ k ) (2 π ) D / 2 | Σ k | 1 / 2 exp Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 16 / 57

  17. Normality Assumption in Classification Estimating p ( x |C k ) comes down to estimating the mean vector µ k , and the covariance matrix Σ k for each class. If there are D variables in x , then there are D means in the mean vector and D ( D + 1) / 2 elements in the covariance matrix, making a total of ( D 2 + 3 D ) / 2 parameters to be estimated for each class. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 17 / 57

  18. Optimal Allocation Rule Assign x to group k if p ( C k | x ) is larger than p ( C j | x ) for all j � = k . Via Bayes formula this leads to the rule to assign to group k if p ( x |C k ) p ( C k ) > p ( x |C j ) p ( C j ) ∀ j � = k (since the denominator cancels) Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 18 / 57

  19. Optimal Allocation Rule for Normal Densities Fill in the formula for the normal density for p ( x |C k ). Then we get the following optimal allocation rule: Assign to group k if � � p ( C k ) − 1 2( x − µ k ) ⊤ Σ − 1 (2 π ) D / 2 | Σ k | 1 / 2 exp k ( x − µ k ) > � � p ( C j ) − 1 2( x − µ j ) ⊤ Σ − 1 ( x − µ j ) (2 π ) D / 2 | Σ j | 1 / 2 exp j for all j � = k Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 19 / 57

  20. Optimal Allocation Rule for Normal Densities Take natural logarithm: � � � � 1 − 1 2( x − µ k ) ⊤ Σ − 1 k ( x − µ k ) p ( C k ) ln (2 π ) D / 2 | Σ k | 1 / 2 exp = − D 2 ln(2 π ) − (1 / 2) ln | Σ k | − (1 / 2)( x − µ k ) ⊤ Σ − 1 k ( x − µ k ) + ln p ( C k ) Cancel the terms that are common to all groups and multiply by − 2: ln | Σ k | + ( x − µ k ) ⊤ Σ − 1 k ( x − µ k ) − 2 ln p ( C k ) Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 20 / 57

  21. Optimal Allocation Rule for Normal Densities Discriminant function for class k ln | Σ k | − 2 ln p ( C k ) + ( x − µ k ) ⊤ Σ − 1 d k ( x ) = k ( x − µ k ) ln | Σ k | − 2 ln p ( C k ) + µ ⊤ k Σ − 1 − 2 µ ⊤ k Σ − 1 + x ⊤ Σ − 1 = k x k x k µ k � �� � � �� � � �� � constant linear quadratic Assign to class k if d k ( x ) < d j ( x ) for all j � = k . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 21 / 57

  22. Estimation Estimate p ( C k ), µ k , Σ k from training data: p ( t = k ) = N k p ( C k ) = ˆ ˆ N where N k is number of observations from group k . The mean of x i in group k is estimated by: x i , k = 1 � µ i , k = ¯ ˆ x n , i N k n : t n = k for k = 1 , . . . , K and i = 1 , . . . , D . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 22 / 57

  23. Estimation Unbiased estimate of the covariance between x i and x j in group k : 1 � Σ ij ˆ k = ( x n , i − ¯ x i , k )( x n , j − ¯ x j , k ) N k − 1 n : t n = k for k = 1 , . . . , K and i , j = 1 , . . . , D . If j = i , this is the variance of x i in group k . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 23 / 57

  24. Numeric Example Training data: x 1 x 2 t 2 4 1 3 6 1 4 14 1 4 18 2 5 10 2 6 8 2 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 24 / 57

  25. Estimates Group 1: � 3 � 1 � � p ( C 1 ) = 3 6 = 1 5 ˆ ˆ x 1 = Σ 1 = ¯ 8 5 28 2 Group 2: � 5 � � � p ( C 2 ) = 3 6 = 1 1 − 5 ˆ ˆ ¯ x 2 = Σ 2 = 12 − 5 28 2 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 25 / 57

Recommend


More recommend