lecture 7 multiclass classification
play

Lecture 7: Multiclass Classification Princeton University COS 495 - PowerPoint PPT Presentation

Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass) ImageNet figure borrowed from


  1. Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang

  2. Example: image classification indoor Indoor outdoor

  3. Example: image classification (multiclass) ImageNet figure borrowed from vision.standford.edu

  4. Multiclass classification โ€ข Given training data ๐‘ฆ ๐‘— , ๐‘ง ๐‘— : 1 โ‰ค ๐‘— โ‰ค ๐‘œ i.i.d. from distribution ๐ธ โ€ข ๐‘ฆ ๐‘— โˆˆ ๐‘† ๐‘’ , ๐‘ง ๐‘— โˆˆ {1,2, โ€ฆ , ๐ฟ} โ€ข Find ๐‘” ๐‘ฆ : ๐‘† ๐‘’ โ†’ {1,2, โ€ฆ , ๐ฟ} that outputs correct labels โ€ข What kind of ๐‘” ?

  5. Approaches for multiclass classification

  6. Approach 1: reduce to regression โ€ข Given training data ๐‘ฆ ๐‘— , ๐‘ง ๐‘— : 1 โ‰ค ๐‘— โ‰ค ๐‘œ i.i.d. from distribution ๐ธ 1 ๐‘ฅ ๐‘ฆ = ๐‘ฅ ๐‘ˆ ๐‘ฆ that minimizes เท  ๐‘œ ๐‘ฅ ๐‘ˆ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— 2 โ€ข Find ๐‘” ๐‘œ ฯƒ ๐‘—=1 ๐‘€ ๐‘” ๐‘ฅ = โ€ข Bad idea even for binary classification Reduce to linear regression; ignore the fact ๐‘ง โˆˆ {1,2. . . , ๐ฟ}

  7. Approach 1: reduce to regression Bad idea even for binary classification Figure from Pattern Recognition and Machine Learning , Bishop

  8. Approach 2: one-versus-the-rest โ€ข Find ๐ฟ โˆ’ 1 classifiers ๐‘” 1 , ๐‘” 2 , โ€ฆ , ๐‘” ๐ฟโˆ’1 โ€ข ๐‘” 1 classifies 1 ๐‘ค๐‘ก {2,3, โ€ฆ , ๐ฟ} โ€ข ๐‘” 2 classifies 2 ๐‘ค๐‘ก {1,3, โ€ฆ , ๐ฟ} โ€ข โ€ฆ โ€ข ๐‘” ๐ฟโˆ’1 classifies ๐ฟ โˆ’ 1 ๐‘ค๐‘ก {1,2, โ€ฆ , ๐ฟ โˆ’ 2} โ€ข Points not classified to classes {1,2, โ€ฆ , ๐ฟ โˆ’ 1} are put to class ๐ฟ โ€ข Problem of ambiguous region: some points may be classified to more than one classes

  9. Approach 2: one-versus-the-rest Figure from Pattern Recognition and Machine Learning , Bishop

  10. Approach 3: one-versus-one โ€ข Find ๐ฟ โˆ’ 1 ๐ฟ/2 classifiers ๐‘” (1,2) , ๐‘” (1,3) , โ€ฆ , ๐‘” (๐ฟโˆ’1,๐ฟ) โ€ข ๐‘” (1,2) classifies 1 ๐‘ค๐‘ก 2 โ€ข ๐‘” (1,3) classifies 1 ๐‘ค๐‘ก 3 โ€ข โ€ฆ โ€ข ๐‘” (๐ฟโˆ’1,๐ฟ) classifies ๐ฟ โˆ’ 1 ๐‘ค๐‘ก ๐ฟ โ€ข Computationally expensive: think of ๐ฟ = 1000 โ€ข Problem of ambiguous region

  11. Approach 3: one-versus-one Figure from Pattern Recognition and Machine Learning , Bishop

  12. Approach 4: discriminant functions โ€ข Find ๐ฟ scoring functions ๐‘ก 1 , ๐‘ก 2 , โ€ฆ , ๐‘ก ๐ฟ โ€ข Classify ๐‘ฆ to class ๐‘ง = argmax ๐‘— ๐‘ก ๐‘— (๐‘ฆ) โ€ข Computationally cheap โ€ข No ambiguous regions

  13. Linear discriminant functions โ€ข Find ๐ฟ discriminant functions ๐‘ก 1 , ๐‘ก 2 , โ€ฆ , ๐‘ก ๐ฟ โ€ข Classify ๐‘ฆ to class ๐‘ง = argmax ๐‘— ๐‘ก ๐‘— (๐‘ฆ) โ€ข Linear discriminant: ๐‘ก ๐‘— (๐‘ฆ) = ๐‘ฅ ๐‘— ๐‘ˆ ๐‘ฆ , with ๐‘ฅ ๐‘— โˆˆ ๐‘† ๐‘’

  14. Linear discriminant functions โ€ข Linear discriminant: ๐‘ก ๐‘— (๐‘ฆ) = ๐‘ฅ ๐‘— ๐‘ˆ ๐‘ฆ , with ๐‘ฅ ๐‘— โˆˆ ๐‘† ๐‘’ โ€ข Lead to convex region for each class: by ๐‘ง = argmax ๐‘— ๐‘ฅ ๐‘— ๐‘ˆ ๐‘ฆ Figure from Pattern Recognition and Machine Learning , Bishop

  15. Conditional distribution as discriminant โ€ข Find ๐ฟ discriminant functions ๐‘ก 1 , ๐‘ก 2 , โ€ฆ , ๐‘ก ๐ฟ โ€ข Classify ๐‘ฆ to class ๐‘ง = argmax ๐‘— ๐‘ก ๐‘— (๐‘ฆ) โ€ข Conditional distributions: ๐‘ก ๐‘— (๐‘ฆ) = ๐‘ž(๐‘ง = ๐‘—|๐‘ฆ) โ€ข Parametrize by ๐‘ฅ ๐‘— : ๐‘ก ๐‘— (๐‘ฆ) = ๐‘ž ๐‘ฅ ๐‘— (๐‘ง = ๐‘—|๐‘ฆ)

  16. Multiclass logistic regression

  17. Review: binary logistic regression โ€ข Sigmoid 1 ๐œ ๐‘ฅ ๐‘ˆ ๐‘ฆ + ๐‘ = 1 + exp(โˆ’(๐‘ฅ ๐‘ˆ ๐‘ฆ + ๐‘)) โ€ข Interpret as conditional probability ๐‘ž ๐‘ฅ ๐‘ง = 1 ๐‘ฆ = ๐œ ๐‘ฅ ๐‘ˆ ๐‘ฆ + ๐‘ ๐‘ž ๐‘ฅ ๐‘ง = 0 ๐‘ฆ = 1 โˆ’ ๐‘ž ๐‘ฅ ๐‘ง = 1 ๐‘ฆ = 1 โˆ’ ๐œ ๐‘ฅ ๐‘ˆ ๐‘ฆ + ๐‘ โ€ข How to extend to multiclass?

  18. Review: binary logistic regression โ€ข Suppose we model the class-conditional densities ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— and class probabilities ๐‘ž ๐‘ง = ๐‘— โ€ข Conditional probability by Bayesian rule: ๐‘ž ๐‘ฆ|๐‘ง = 1 ๐‘ž(๐‘ง = 1) 1 ๐‘ž ๐‘ง = 1|๐‘ฆ = ๐‘ž ๐‘ฆ|๐‘ง = 1 ๐‘ž ๐‘ง = 1 + ๐‘ž ๐‘ฆ|๐‘ง = 2 ๐‘ž(๐‘ง = 2) = 1 + exp(โˆ’๐‘) = ๐œ(๐‘) where we define ๐‘ โ‰” ln ๐‘ž ๐‘ฆ|๐‘ง = 1 ๐‘ž(๐‘ง = 1) ๐‘ž ๐‘ฆ|๐‘ง = 2 ๐‘ž(๐‘ง = 2) = ln ๐‘ž ๐‘ง = 1|๐‘ฆ ๐‘ž ๐‘ง = 2|๐‘ฆ

  19. Review: binary logistic regression โ€ข Suppose we model the class-conditional densities ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— and class probabilities ๐‘ž ๐‘ง = ๐‘— โ€ข ๐‘ž ๐‘ง = 1|๐‘ฆ = ๐œ ๐‘ = ๐œ(๐‘ฅ ๐‘ˆ ๐‘ฆ + ๐‘) is equivalent to setting log odds ๐‘ = ln ๐‘ž ๐‘ง = 1|๐‘ฆ ๐‘ž ๐‘ง = 2|๐‘ฆ = ๐‘ฅ ๐‘ˆ ๐‘ฆ + ๐‘ โ€ข Why linear log odds?

  20. Review: binary logistic regression โ€ข Suppose the class-conditional densities ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— is normal 2๐œŒ ๐‘’/2 exp{โˆ’ 1 1 2 } ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— = ๐‘‚ ๐‘ฆ|๐œˆ ๐‘— , ๐ฝ = ๐‘ฆ โˆ’ ๐œˆ ๐‘— 2 โ€ข log odd is ๐‘ = ln ๐‘ž ๐‘ฆ|๐‘ง = 1 ๐‘ž(๐‘ง = 1) ๐‘ž ๐‘ฆ|๐‘ง = 2 ๐‘ž(๐‘ง = 2) = ๐‘ฅ ๐‘ˆ ๐‘ฆ + ๐‘ where ๐‘ = โˆ’ 1 ๐‘ˆ ๐œˆ 1 + 1 ๐‘ˆ ๐œˆ 2 + ln ๐‘ž(๐‘ง = 1) ๐‘ฅ = ๐œˆ 1 โˆ’ ๐œˆ 2 , 2 ๐œˆ 1 2 ๐œˆ 2 ๐‘ž(๐‘ง = 2)

  21. Multiclass logistic regression โ€ข Suppose we model the class-conditional densities ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— and class probabilities ๐‘ž ๐‘ง = ๐‘— โ€ข Conditional probability by Bayesian rule: ๐‘ž ๐‘ฆ|๐‘ง = ๐‘— ๐‘ž(๐‘ง = ๐‘—) exp(๐‘ ๐‘— ) ๐‘ž ๐‘ง = ๐‘—|๐‘ฆ = ฯƒ ๐‘˜ ๐‘ž ๐‘ฆ|๐‘ง = ๐‘˜ ๐‘ž(๐‘ง = ๐‘˜) = ฯƒ ๐‘˜ exp(๐‘ ๐‘˜ ) where we define ๐‘ ๐‘— โ‰” ln [๐‘ž ๐‘ฆ ๐‘ง = ๐‘— ๐‘ž ๐‘ง = ๐‘— ]

  22. Multiclass logistic regression โ€ข Suppose the class-conditional densities ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— is normal 2๐œŒ ๐‘’/2 exp{โˆ’ 1 1 2 } ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— = ๐‘‚ ๐‘ฆ|๐œˆ ๐‘— , ๐ฝ = ๐‘ฆ โˆ’ ๐œˆ ๐‘— 2 โ€ข Then ๐‘ˆ = โˆ’ 1 2 ๐‘ฆ ๐‘ˆ ๐‘ฆ + ๐‘ฅ ๐‘— ๐‘ฆ + ๐‘ ๐‘— ๐‘ ๐‘— โ‰” ln ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— ๐‘ž ๐‘ง = ๐‘— where ๐‘ ๐‘— = โˆ’ 1 1 ๐‘ฅ ๐‘— = ๐œˆ ๐‘— , ๐‘ˆ ๐œˆ ๐‘— + ln ๐‘ž ๐‘ง = ๐‘— + ln 2 ๐œˆ ๐‘— 2๐œŒ ๐‘’/2

  23. Multiclass logistic regression โ€ข Suppose the class-conditional densities ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— is normal 2๐œŒ ๐‘’/2 exp{โˆ’ 1 1 2 } ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— = ๐‘‚ ๐‘ฆ|๐œˆ ๐‘— , ๐ฝ = ๐‘ฆ โˆ’ ๐œˆ ๐‘— 2 1 2 ๐‘ฆ ๐‘ˆ ๐‘ฆ , we have โ€ข Cancel out โˆ’ exp(๐‘ ๐‘— ) ๐‘ฅ ๐‘— ๐‘ˆ ๐‘ฆ + ๐‘ ๐‘— ๐‘ž ๐‘ง = ๐‘—|๐‘ฆ = ฯƒ ๐‘˜ exp(๐‘ ๐‘˜ ) , ๐‘ ๐‘— โ‰” where ๐‘ ๐‘— = โˆ’ 1 1 ๐‘ฅ ๐‘— = ๐œˆ ๐‘— , ๐‘ˆ ๐œˆ ๐‘— + ln ๐‘ž ๐‘ง = ๐‘— + ln 2 ๐œˆ ๐‘— 2๐œŒ ๐‘’/2

  24. Multiclass logistic regression: conclusion โ€ข Suppose the class-conditional densities ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— is normal 2๐œŒ ๐‘’/2 exp{โˆ’ 1 1 2 } ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— = ๐‘‚ ๐‘ฆ|๐œˆ ๐‘— , ๐ฝ = 2 ๐‘ฆ โˆ’ ๐œˆ ๐‘— โ€ข Then exp( ๐‘ฅ ๐‘— ๐‘ˆ ๐‘ฆ + ๐‘ ๐‘— ) ๐‘ž ๐‘ง = ๐‘—|๐‘ฆ = ฯƒ ๐‘˜ exp( ๐‘ฅ ๐‘˜ ๐‘ˆ ๐‘ฆ + ๐‘ ๐‘˜ ) which is the hypothesis class for multiclass logistic regression โ€ข It is softmax on linear transformation; it can be used to derive the negative log- likelihood loss (cross entropy)

  25. Softmax โ€ข A way to squash ๐‘ = (๐‘ 1 , ๐‘ 2 , โ€ฆ , ๐‘ ๐‘— , โ€ฆ ) into probability vector ๐‘ž ฯƒ ๐‘˜ exp(๐‘ ๐‘˜ ) , exp(๐‘ 2 ) exp(๐‘ 1 ) exp ๐‘ ๐‘— softmax ๐‘ = ฯƒ ๐‘˜ exp(๐‘ ๐‘˜ ) , โ€ฆ , , โ€ฆ ฯƒ ๐‘˜ exp ๐‘ ๐‘˜ โ€ข Behave like max: when ๐‘ ๐‘— โ‰ซ ๐‘ ๐‘˜ โˆ€๐‘˜ โ‰  ๐‘— , ๐‘ž ๐‘— โ‰… 1, ๐‘ž ๐‘˜ โ‰… 0

  26. Cross entropy for conditional distribution โ€ข Let ๐‘ž data (๐‘ง|๐‘ฆ) denote the empirical distribution of the data โ€ข Negative log-likelihood 1 ๐‘œ ๐‘œ ฯƒ ๐‘—=1 โˆ’ log ๐‘ž ๐‘ง = ๐‘ง ๐‘— ๐‘ฆ ๐‘— = โˆ’E ๐‘ž data (๐‘ง|๐‘ฆ) log ๐‘ž(๐‘ง|๐‘ฆ) is the cross entropy between ๐‘ž data and the model output ๐‘ž โ€ข Information theory viewpoint: KL divergence ๐‘ž data ๐ธ(๐‘ž data | ๐‘ž = E ๐‘ž data [log ๐‘ž ] = E ๐‘ž data [log ๐‘ž data ] โˆ’ E ๐‘ž data [log ๐‘ž] Entropy; constant Cross entropy

  27. Cross entropy for full distribution โ€ข Let ๐‘ž data (๐‘ฆ, ๐‘ง) denote the empirical distribution of the data โ€ข Negative log-likelihood 1 ๐‘œ ๐‘œ ฯƒ ๐‘—=1 โˆ’ log ๐‘ž(๐‘ฆ ๐‘— , ๐‘ง ๐‘— ) = โˆ’E ๐‘ž data (๐‘ฆ,๐‘ง) log ๐‘ž(๐‘ฆ, ๐‘ง) is the cross entropy between ๐‘ž data and the model output ๐‘ž

  28. Multiclass logistic regression: summary Label ๐‘ง ๐‘— Last hidden layer โ„Ž Cross entropy softmax (๐‘ฅ ๐‘˜ ) ๐‘ˆ โ„Ž + ๐‘ ๐‘˜ ๐‘ž ๐‘˜ Linear Convert to probability Loss

Recommend


More recommend