Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang
Example: image classification indoor Indoor outdoor
Example: image classification (multiclass) ImageNet figure borrowed from vision.standford.edu
Multiclass classification โข Given training data ๐ฆ ๐ , ๐ง ๐ : 1 โค ๐ โค ๐ i.i.d. from distribution ๐ธ โข ๐ฆ ๐ โ ๐ ๐ , ๐ง ๐ โ {1,2, โฆ , ๐ฟ} โข Find ๐ ๐ฆ : ๐ ๐ โ {1,2, โฆ , ๐ฟ} that outputs correct labels โข What kind of ๐ ?
Approaches for multiclass classification
Approach 1: reduce to regression โข Given training data ๐ฆ ๐ , ๐ง ๐ : 1 โค ๐ โค ๐ i.i.d. from distribution ๐ธ 1 ๐ฅ ๐ฆ = ๐ฅ ๐ ๐ฆ that minimizes เท ๐ ๐ฅ ๐ ๐ฆ ๐ โ ๐ง ๐ 2 โข Find ๐ ๐ ฯ ๐=1 ๐ ๐ ๐ฅ = โข Bad idea even for binary classification Reduce to linear regression; ignore the fact ๐ง โ {1,2. . . , ๐ฟ}
Approach 1: reduce to regression Bad idea even for binary classification Figure from Pattern Recognition and Machine Learning , Bishop
Approach 2: one-versus-the-rest โข Find ๐ฟ โ 1 classifiers ๐ 1 , ๐ 2 , โฆ , ๐ ๐ฟโ1 โข ๐ 1 classifies 1 ๐ค๐ก {2,3, โฆ , ๐ฟ} โข ๐ 2 classifies 2 ๐ค๐ก {1,3, โฆ , ๐ฟ} โข โฆ โข ๐ ๐ฟโ1 classifies ๐ฟ โ 1 ๐ค๐ก {1,2, โฆ , ๐ฟ โ 2} โข Points not classified to classes {1,2, โฆ , ๐ฟ โ 1} are put to class ๐ฟ โข Problem of ambiguous region: some points may be classified to more than one classes
Approach 2: one-versus-the-rest Figure from Pattern Recognition and Machine Learning , Bishop
Approach 3: one-versus-one โข Find ๐ฟ โ 1 ๐ฟ/2 classifiers ๐ (1,2) , ๐ (1,3) , โฆ , ๐ (๐ฟโ1,๐ฟ) โข ๐ (1,2) classifies 1 ๐ค๐ก 2 โข ๐ (1,3) classifies 1 ๐ค๐ก 3 โข โฆ โข ๐ (๐ฟโ1,๐ฟ) classifies ๐ฟ โ 1 ๐ค๐ก ๐ฟ โข Computationally expensive: think of ๐ฟ = 1000 โข Problem of ambiguous region
Approach 3: one-versus-one Figure from Pattern Recognition and Machine Learning , Bishop
Approach 4: discriminant functions โข Find ๐ฟ scoring functions ๐ก 1 , ๐ก 2 , โฆ , ๐ก ๐ฟ โข Classify ๐ฆ to class ๐ง = argmax ๐ ๐ก ๐ (๐ฆ) โข Computationally cheap โข No ambiguous regions
Linear discriminant functions โข Find ๐ฟ discriminant functions ๐ก 1 , ๐ก 2 , โฆ , ๐ก ๐ฟ โข Classify ๐ฆ to class ๐ง = argmax ๐ ๐ก ๐ (๐ฆ) โข Linear discriminant: ๐ก ๐ (๐ฆ) = ๐ฅ ๐ ๐ ๐ฆ , with ๐ฅ ๐ โ ๐ ๐
Linear discriminant functions โข Linear discriminant: ๐ก ๐ (๐ฆ) = ๐ฅ ๐ ๐ ๐ฆ , with ๐ฅ ๐ โ ๐ ๐ โข Lead to convex region for each class: by ๐ง = argmax ๐ ๐ฅ ๐ ๐ ๐ฆ Figure from Pattern Recognition and Machine Learning , Bishop
Conditional distribution as discriminant โข Find ๐ฟ discriminant functions ๐ก 1 , ๐ก 2 , โฆ , ๐ก ๐ฟ โข Classify ๐ฆ to class ๐ง = argmax ๐ ๐ก ๐ (๐ฆ) โข Conditional distributions: ๐ก ๐ (๐ฆ) = ๐(๐ง = ๐|๐ฆ) โข Parametrize by ๐ฅ ๐ : ๐ก ๐ (๐ฆ) = ๐ ๐ฅ ๐ (๐ง = ๐|๐ฆ)
Multiclass logistic regression
Review: binary logistic regression โข Sigmoid 1 ๐ ๐ฅ ๐ ๐ฆ + ๐ = 1 + exp(โ(๐ฅ ๐ ๐ฆ + ๐)) โข Interpret as conditional probability ๐ ๐ฅ ๐ง = 1 ๐ฆ = ๐ ๐ฅ ๐ ๐ฆ + ๐ ๐ ๐ฅ ๐ง = 0 ๐ฆ = 1 โ ๐ ๐ฅ ๐ง = 1 ๐ฆ = 1 โ ๐ ๐ฅ ๐ ๐ฆ + ๐ โข How to extend to multiclass?
Review: binary logistic regression โข Suppose we model the class-conditional densities ๐ ๐ฆ ๐ง = ๐ and class probabilities ๐ ๐ง = ๐ โข Conditional probability by Bayesian rule: ๐ ๐ฆ|๐ง = 1 ๐(๐ง = 1) 1 ๐ ๐ง = 1|๐ฆ = ๐ ๐ฆ|๐ง = 1 ๐ ๐ง = 1 + ๐ ๐ฆ|๐ง = 2 ๐(๐ง = 2) = 1 + exp(โ๐) = ๐(๐) where we define ๐ โ ln ๐ ๐ฆ|๐ง = 1 ๐(๐ง = 1) ๐ ๐ฆ|๐ง = 2 ๐(๐ง = 2) = ln ๐ ๐ง = 1|๐ฆ ๐ ๐ง = 2|๐ฆ
Review: binary logistic regression โข Suppose we model the class-conditional densities ๐ ๐ฆ ๐ง = ๐ and class probabilities ๐ ๐ง = ๐ โข ๐ ๐ง = 1|๐ฆ = ๐ ๐ = ๐(๐ฅ ๐ ๐ฆ + ๐) is equivalent to setting log odds ๐ = ln ๐ ๐ง = 1|๐ฆ ๐ ๐ง = 2|๐ฆ = ๐ฅ ๐ ๐ฆ + ๐ โข Why linear log odds?
Review: binary logistic regression โข Suppose the class-conditional densities ๐ ๐ฆ ๐ง = ๐ is normal 2๐ ๐/2 exp{โ 1 1 2 } ๐ ๐ฆ ๐ง = ๐ = ๐ ๐ฆ|๐ ๐ , ๐ฝ = ๐ฆ โ ๐ ๐ 2 โข log odd is ๐ = ln ๐ ๐ฆ|๐ง = 1 ๐(๐ง = 1) ๐ ๐ฆ|๐ง = 2 ๐(๐ง = 2) = ๐ฅ ๐ ๐ฆ + ๐ where ๐ = โ 1 ๐ ๐ 1 + 1 ๐ ๐ 2 + ln ๐(๐ง = 1) ๐ฅ = ๐ 1 โ ๐ 2 , 2 ๐ 1 2 ๐ 2 ๐(๐ง = 2)
Multiclass logistic regression โข Suppose we model the class-conditional densities ๐ ๐ฆ ๐ง = ๐ and class probabilities ๐ ๐ง = ๐ โข Conditional probability by Bayesian rule: ๐ ๐ฆ|๐ง = ๐ ๐(๐ง = ๐) exp(๐ ๐ ) ๐ ๐ง = ๐|๐ฆ = ฯ ๐ ๐ ๐ฆ|๐ง = ๐ ๐(๐ง = ๐) = ฯ ๐ exp(๐ ๐ ) where we define ๐ ๐ โ ln [๐ ๐ฆ ๐ง = ๐ ๐ ๐ง = ๐ ]
Multiclass logistic regression โข Suppose the class-conditional densities ๐ ๐ฆ ๐ง = ๐ is normal 2๐ ๐/2 exp{โ 1 1 2 } ๐ ๐ฆ ๐ง = ๐ = ๐ ๐ฆ|๐ ๐ , ๐ฝ = ๐ฆ โ ๐ ๐ 2 โข Then ๐ = โ 1 2 ๐ฆ ๐ ๐ฆ + ๐ฅ ๐ ๐ฆ + ๐ ๐ ๐ ๐ โ ln ๐ ๐ฆ ๐ง = ๐ ๐ ๐ง = ๐ where ๐ ๐ = โ 1 1 ๐ฅ ๐ = ๐ ๐ , ๐ ๐ ๐ + ln ๐ ๐ง = ๐ + ln 2 ๐ ๐ 2๐ ๐/2
Multiclass logistic regression โข Suppose the class-conditional densities ๐ ๐ฆ ๐ง = ๐ is normal 2๐ ๐/2 exp{โ 1 1 2 } ๐ ๐ฆ ๐ง = ๐ = ๐ ๐ฆ|๐ ๐ , ๐ฝ = ๐ฆ โ ๐ ๐ 2 1 2 ๐ฆ ๐ ๐ฆ , we have โข Cancel out โ exp(๐ ๐ ) ๐ฅ ๐ ๐ ๐ฆ + ๐ ๐ ๐ ๐ง = ๐|๐ฆ = ฯ ๐ exp(๐ ๐ ) , ๐ ๐ โ where ๐ ๐ = โ 1 1 ๐ฅ ๐ = ๐ ๐ , ๐ ๐ ๐ + ln ๐ ๐ง = ๐ + ln 2 ๐ ๐ 2๐ ๐/2
Multiclass logistic regression: conclusion โข Suppose the class-conditional densities ๐ ๐ฆ ๐ง = ๐ is normal 2๐ ๐/2 exp{โ 1 1 2 } ๐ ๐ฆ ๐ง = ๐ = ๐ ๐ฆ|๐ ๐ , ๐ฝ = 2 ๐ฆ โ ๐ ๐ โข Then exp( ๐ฅ ๐ ๐ ๐ฆ + ๐ ๐ ) ๐ ๐ง = ๐|๐ฆ = ฯ ๐ exp( ๐ฅ ๐ ๐ ๐ฆ + ๐ ๐ ) which is the hypothesis class for multiclass logistic regression โข It is softmax on linear transformation; it can be used to derive the negative log- likelihood loss (cross entropy)
Softmax โข A way to squash ๐ = (๐ 1 , ๐ 2 , โฆ , ๐ ๐ , โฆ ) into probability vector ๐ ฯ ๐ exp(๐ ๐ ) , exp(๐ 2 ) exp(๐ 1 ) exp ๐ ๐ softmax ๐ = ฯ ๐ exp(๐ ๐ ) , โฆ , , โฆ ฯ ๐ exp ๐ ๐ โข Behave like max: when ๐ ๐ โซ ๐ ๐ โ๐ โ ๐ , ๐ ๐ โ 1, ๐ ๐ โ 0
Cross entropy for conditional distribution โข Let ๐ data (๐ง|๐ฆ) denote the empirical distribution of the data โข Negative log-likelihood 1 ๐ ๐ ฯ ๐=1 โ log ๐ ๐ง = ๐ง ๐ ๐ฆ ๐ = โE ๐ data (๐ง|๐ฆ) log ๐(๐ง|๐ฆ) is the cross entropy between ๐ data and the model output ๐ โข Information theory viewpoint: KL divergence ๐ data ๐ธ(๐ data | ๐ = E ๐ data [log ๐ ] = E ๐ data [log ๐ data ] โ E ๐ data [log ๐] Entropy; constant Cross entropy
Cross entropy for full distribution โข Let ๐ data (๐ฆ, ๐ง) denote the empirical distribution of the data โข Negative log-likelihood 1 ๐ ๐ ฯ ๐=1 โ log ๐(๐ฆ ๐ , ๐ง ๐ ) = โE ๐ data (๐ฆ,๐ง) log ๐(๐ฆ, ๐ง) is the cross entropy between ๐ data and the model output ๐
Multiclass logistic regression: summary Label ๐ง ๐ Last hidden layer โ Cross entropy softmax (๐ฅ ๐ ) ๐ โ + ๐ ๐ ๐ ๐ Linear Convert to probability Loss
Recommend
More recommend