Machine Learning Basics Lecture 7: Multiclass Classification
Princeton University COS 495 Instructor: Yingyu Liang
Lecture 7: Multiclass Classification Princeton University COS 495 - - PowerPoint PPT Presentation
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass) ImageNet figure borrowed from
Princeton University COS 495 Instructor: Yingyu Liang
indoor
Indoor
ImageNet figure borrowed from vision.standford.edu
π₯ π¦ = π₯ππ¦ that minimizes ΰ·
π π
π₯ = 1 π Οπ=1 π
π₯ππ¦π β π§π 2
Reduce to linear regression; ignore the fact π§ β {1,2. . . , πΏ}
Figure from Pattern Recognition and Machine Learning, Bishop
Bad idea even for binary classification
1, π 2, β¦ , π πΏβ1
1 classifies 1 π€π‘ {2,3, β¦ , πΏ}
2 classifies 2 π€π‘ {1,3, β¦ , πΏ}
πΏβ1 classifies πΏ β 1 π€π‘ {1,2, β¦ , πΏ β 2}
than one classes
Figure from Pattern Recognition and Machine Learning, Bishop
(1,2), π (1,3), β¦ , π (πΏβ1,πΏ)
(1,2) classifies 1 π€π‘ 2
(1,3) classifies 1 π€π‘ 3
(πΏβ1,πΏ) classifies πΏ β 1 π€π‘ πΏ
Figure from Pattern Recognition and Machine Learning, Bishop
Figure from Pattern Recognition and Machine Learning, Bishop
π π₯ππ¦ + π = 1 1 + exp(β(π₯ππ¦ + π))
ππ₯ π§ = 1 π¦ = π π₯ππ¦ + π ππ₯ π§ = 0 π¦ = 1 β ππ₯ π§ = 1 π¦ = 1 β π π₯ππ¦ + π
class probabilities π π§ = π
π π§ = 1|π¦ = π π¦|π§ = 1 π(π§ = 1) π π¦|π§ = 1 π π§ = 1 + π π¦|π§ = 2 π(π§ = 2) = 1 1 + exp(βπ) = π(π)
where we define
π β ln π π¦|π§ = 1 π(π§ = 1) π π¦|π§ = 2 π(π§ = 2) = ln π π§ = 1|π¦ π π§ = 2|π¦
class probabilities π π§ = π
π = ln π π§ = 1|π¦ π π§ = 2|π¦ = π₯ππ¦ + π
π π¦ π§ = π = π π¦|ππ, π½ = 1 2π π/2 exp{β 1 2 π¦ β ππ
2}
π = ln π π¦|π§ = 1 π(π§ = 1) π π¦|π§ = 2 π(π§ = 2) = π₯ππ¦ + π where π₯ = π1 β π2, π = β 1 2 π1
ππ1 + 1
2 π2
ππ2 + ln π(π§ = 1)
π(π§ = 2)
class probabilities π π§ = π
π π§ = π|π¦ = π π¦|π§ = π π(π§ = π) Οπ π π¦|π§ = π π(π§ = π) = exp(ππ) Οπ exp(ππ) where we define ππ β ln [π π¦ π§ = π π π§ = π ]
π π¦ π§ = π = π π¦|ππ, π½ = 1 2π π/2 exp{β 1 2 π¦ β ππ
2}
ππ β ln π π¦ π§ = π π π§ = π = β 1 2 π¦ππ¦ + π₯π
π
π¦ + ππ where π₯π = ππ, ππ = β 1 2 ππ
πππ + ln π π§ = π + ln
1 2π π/2
π π¦ π§ = π = π π¦|ππ, π½ = 1 2π π/2 exp{β 1 2 π¦ β ππ
2}
1 2 π¦ππ¦, we have
π π§ = π|π¦ = exp(ππ) Οπ exp(ππ) , ππ β π₯π ππ¦ + ππ where π₯π = ππ, ππ = β 1 2 ππ
πππ + ln π π§ = π + ln
1 2π π/2
π π¦ π§ = π = π π¦|ππ, π½ = 1 2π π/2 exp{β 1 2 π¦ β ππ
2}
π π§ = π|π¦ = exp( π₯π ππ¦ + ππ) Οπ exp( π₯π ππ¦ + ππ) which is the hypothesis class for multiclass logistic regression
likelihood loss (cross entropy)
softmax π = exp(π1) Οπ exp(ππ) , exp(π2) Οπ exp(ππ) , β¦ , exp ππ Οπ exp ππ , β¦
β
1 π Οπ=1 π
log π π§ = π§π π¦π = βEπdata(π§|π¦) log π(π§|π¦) is the cross entropy between πdata and the model output π
πΈ(πdata| π = Eπdata[log
πdata π ] = Eπdata [log πdata] β Eπdata[log π]
Entropy; constant Cross entropy
β
1 π Οπ=1 π
log π(π¦π, π§π) = βEπdata(π¦,π§) log π(π¦, π§) is the cross entropy between πdata and the model output π
Label π§π (π₯π)πβ + ππ softmax Last hidden layer β ππ Cross entropy Linear Convert to probability Loss