Lecture 7: Multiclass Classification Princeton University COS 495 - - PowerPoint PPT Presentation

β–Ά
lecture 7 multiclass classification
SMART_READER_LITE
LIVE PREVIEW

Lecture 7: Multiclass Classification Princeton University COS 495 - - PowerPoint PPT Presentation

Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass) ImageNet figure borrowed from


slide-1
SLIDE 1

Machine Learning Basics Lecture 7: Multiclass Classification

Princeton University COS 495 Instructor: Yingyu Liang

slide-2
SLIDE 2

Example: image classification

indoor

  • utdoor

Indoor

slide-3
SLIDE 3

Example: image classification (multiclass)

ImageNet figure borrowed from vision.standford.edu

slide-4
SLIDE 4

Multiclass classification

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • 𝑦𝑗 ∈ 𝑆𝑒, π‘§π‘—βˆˆ {1,2, … , 𝐿}
  • Find 𝑔 𝑦 : 𝑆𝑒 β†’ {1,2, … , 𝐿} that outputs correct labels
  • What kind of 𝑔?
slide-5
SLIDE 5

Approaches for multiclass classification

slide-6
SLIDE 6

Approach 1: reduce to regression

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ that minimizes ΰ· 

𝑀 𝑔

π‘₯ = 1 π‘œ σ𝑗=1 π‘œ

π‘₯π‘ˆπ‘¦π‘— βˆ’ 𝑧𝑗 2

  • Bad idea even for binary classification

Reduce to linear regression; ignore the fact 𝑧 ∈ {1,2. . . , 𝐿}

slide-7
SLIDE 7

Approach 1: reduce to regression

Figure from Pattern Recognition and Machine Learning, Bishop

Bad idea even for binary classification

slide-8
SLIDE 8

Approach 2: one-versus-the-rest

  • Find 𝐿 βˆ’ 1 classifiers 𝑔

1, 𝑔 2, … , 𝑔 πΏβˆ’1

  • 𝑔

1 classifies 1 𝑀𝑑 {2,3, … , 𝐿}

  • 𝑔

2 classifies 2 𝑀𝑑 {1,3, … , 𝐿}

  • …
  • 𝑔

πΏβˆ’1 classifies 𝐿 βˆ’ 1 𝑀𝑑 {1,2, … , 𝐿 βˆ’ 2}

  • Points not classified to classes {1,2, … , 𝐿 βˆ’ 1} are put to class 𝐿
  • Problem of ambiguous region: some points may be classified to more

than one classes

slide-9
SLIDE 9

Approach 2: one-versus-the-rest

Figure from Pattern Recognition and Machine Learning, Bishop

slide-10
SLIDE 10

Approach 3: one-versus-one

  • Find 𝐿 βˆ’ 1 𝐿/2 classifiers 𝑔

(1,2), 𝑔 (1,3), … , 𝑔 (πΏβˆ’1,𝐿)

  • 𝑔

(1,2) classifies 1 𝑀𝑑 2

  • 𝑔

(1,3) classifies 1 𝑀𝑑 3

  • …
  • 𝑔

(πΏβˆ’1,𝐿) classifies 𝐿 βˆ’ 1 𝑀𝑑 𝐿

  • Computationally expensive: think of 𝐿 = 1000
  • Problem of ambiguous region
slide-11
SLIDE 11

Approach 3: one-versus-one

Figure from Pattern Recognition and Machine Learning, Bishop

slide-12
SLIDE 12

Approach 4: discriminant functions

  • Find 𝐿 scoring functions 𝑑1, 𝑑2, … , 𝑑𝐿
  • Classify 𝑦 to class 𝑧 = argmax𝑗 𝑑𝑗(𝑦)
  • Computationally cheap
  • No ambiguous regions
slide-13
SLIDE 13

Linear discriminant functions

  • Find 𝐿 discriminant functions 𝑑1, 𝑑2, … , 𝑑𝐿
  • Classify 𝑦 to class 𝑧 = argmax𝑗 𝑑𝑗(𝑦)
  • Linear discriminant: 𝑑𝑗(𝑦) = π‘₯𝑗 π‘ˆπ‘¦, with π‘₯𝑗 ∈ 𝑆𝑒
slide-14
SLIDE 14

Linear discriminant functions

  • Linear discriminant: 𝑑𝑗(𝑦) = π‘₯𝑗 π‘ˆπ‘¦, with π‘₯𝑗 ∈ 𝑆𝑒
  • Lead to convex region for each class: by 𝑧 = argmax𝑗 π‘₯𝑗 π‘ˆπ‘¦

Figure from Pattern Recognition and Machine Learning, Bishop

slide-15
SLIDE 15

Conditional distribution as discriminant

  • Find 𝐿 discriminant functions 𝑑1, 𝑑2, … , 𝑑𝐿
  • Classify 𝑦 to class 𝑧 = argmax𝑗 𝑑𝑗(𝑦)
  • Conditional distributions: 𝑑𝑗(𝑦) = π‘ž(𝑧 = 𝑗|𝑦)
  • Parametrize by π‘₯𝑗: 𝑑𝑗(𝑦) = π‘žπ‘₯𝑗(𝑧 = 𝑗|𝑦)
slide-16
SLIDE 16

Multiclass logistic regression

slide-17
SLIDE 17

Review: binary logistic regression

  • Sigmoid

𝜏 π‘₯π‘ˆπ‘¦ + 𝑐 = 1 1 + exp(βˆ’(π‘₯π‘ˆπ‘¦ + 𝑐))

  • Interpret as conditional probability

π‘žπ‘₯ 𝑧 = 1 𝑦 = 𝜏 π‘₯π‘ˆπ‘¦ + 𝑐 π‘žπ‘₯ 𝑧 = 0 𝑦 = 1 βˆ’ π‘žπ‘₯ 𝑧 = 1 𝑦 = 1 βˆ’ 𝜏 π‘₯π‘ˆπ‘¦ + 𝑐

  • How to extend to multiclass?
slide-18
SLIDE 18

Review: binary logistic regression

  • Suppose we model the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 and

class probabilities π‘ž 𝑧 = 𝑗

  • Conditional probability by Bayesian rule:

π‘ž 𝑧 = 1|𝑦 = π‘ž 𝑦|𝑧 = 1 π‘ž(𝑧 = 1) π‘ž 𝑦|𝑧 = 1 π‘ž 𝑧 = 1 + π‘ž 𝑦|𝑧 = 2 π‘ž(𝑧 = 2) = 1 1 + exp(βˆ’π‘) = 𝜏(𝑏)

where we define

𝑏 ≔ ln π‘ž 𝑦|𝑧 = 1 π‘ž(𝑧 = 1) π‘ž 𝑦|𝑧 = 2 π‘ž(𝑧 = 2) = ln π‘ž 𝑧 = 1|𝑦 π‘ž 𝑧 = 2|𝑦

slide-19
SLIDE 19

Review: binary logistic regression

  • Suppose we model the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 and

class probabilities π‘ž 𝑧 = 𝑗

  • π‘ž 𝑧 = 1|𝑦 = 𝜏 𝑏 = 𝜏(π‘₯π‘ˆπ‘¦ + 𝑐) is equivalent to setting log odds

𝑏 = ln π‘ž 𝑧 = 1|𝑦 π‘ž 𝑧 = 2|𝑦 = π‘₯π‘ˆπ‘¦ + 𝑐

  • Why linear log odds?
slide-20
SLIDE 20

Review: binary logistic regression

  • Suppose the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 is normal

π‘ž 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|πœˆπ‘—, 𝐽 = 1 2𝜌 𝑒/2 exp{βˆ’ 1 2 𝑦 βˆ’ πœˆπ‘—

2}

  • log odd is

𝑏 = ln π‘ž 𝑦|𝑧 = 1 π‘ž(𝑧 = 1) π‘ž 𝑦|𝑧 = 2 π‘ž(𝑧 = 2) = π‘₯π‘ˆπ‘¦ + 𝑐 where π‘₯ = 𝜈1 βˆ’ 𝜈2, 𝑐 = βˆ’ 1 2 𝜈1

π‘ˆπœˆ1 + 1

2 𝜈2

π‘ˆπœˆ2 + ln π‘ž(𝑧 = 1)

π‘ž(𝑧 = 2)

slide-21
SLIDE 21

Multiclass logistic regression

  • Suppose we model the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 and

class probabilities π‘ž 𝑧 = 𝑗

  • Conditional probability by Bayesian rule:

π‘ž 𝑧 = 𝑗|𝑦 = π‘ž 𝑦|𝑧 = 𝑗 π‘ž(𝑧 = 𝑗) Οƒπ‘˜ π‘ž 𝑦|𝑧 = π‘˜ π‘ž(𝑧 = π‘˜) = exp(𝑏𝑗) Οƒπ‘˜ exp(π‘π‘˜) where we define 𝑏𝑗 ≔ ln [π‘ž 𝑦 𝑧 = 𝑗 π‘ž 𝑧 = 𝑗 ]

slide-22
SLIDE 22

Multiclass logistic regression

  • Suppose the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 is normal

π‘ž 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|πœˆπ‘—, 𝐽 = 1 2𝜌 𝑒/2 exp{βˆ’ 1 2 𝑦 βˆ’ πœˆπ‘—

2}

  • Then

𝑏𝑗 ≔ ln π‘ž 𝑦 𝑧 = 𝑗 π‘ž 𝑧 = 𝑗 = βˆ’ 1 2 π‘¦π‘ˆπ‘¦ + π‘₯𝑗

π‘ˆ

𝑦 + 𝑐𝑗 where π‘₯𝑗 = πœˆπ‘—, 𝑐𝑗 = βˆ’ 1 2 πœˆπ‘—

π‘ˆπœˆπ‘— + ln π‘ž 𝑧 = 𝑗 + ln

1 2𝜌 𝑒/2

slide-23
SLIDE 23

Multiclass logistic regression

  • Suppose the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 is normal

π‘ž 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|πœˆπ‘—, 𝐽 = 1 2𝜌 𝑒/2 exp{βˆ’ 1 2 𝑦 βˆ’ πœˆπ‘—

2}

  • Cancel out βˆ’

1 2 π‘¦π‘ˆπ‘¦, we have

π‘ž 𝑧 = 𝑗|𝑦 = exp(𝑏𝑗) Οƒπ‘˜ exp(π‘π‘˜) , 𝑏𝑗 ≔ π‘₯𝑗 π‘ˆπ‘¦ + 𝑐𝑗 where π‘₯𝑗 = πœˆπ‘—, 𝑐𝑗 = βˆ’ 1 2 πœˆπ‘—

π‘ˆπœˆπ‘— + ln π‘ž 𝑧 = 𝑗 + ln

1 2𝜌 𝑒/2

slide-24
SLIDE 24

Multiclass logistic regression: conclusion

  • Suppose the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 is normal

π‘ž 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|πœˆπ‘—, 𝐽 = 1 2𝜌 𝑒/2 exp{βˆ’ 1 2 𝑦 βˆ’ πœˆπ‘—

2}

  • Then

π‘ž 𝑧 = 𝑗|𝑦 = exp( π‘₯𝑗 π‘ˆπ‘¦ + 𝑐𝑗) Οƒπ‘˜ exp( π‘₯π‘˜ π‘ˆπ‘¦ + π‘π‘˜) which is the hypothesis class for multiclass logistic regression

  • It is softmax on linear transformation; it can be used to derive the negative log-

likelihood loss (cross entropy)

slide-25
SLIDE 25

Softmax

  • A way to squash 𝑏 = (𝑏1, 𝑏2, … , 𝑏𝑗, … ) into probability vector π‘ž

softmax 𝑏 = exp(𝑏1) Οƒπ‘˜ exp(π‘π‘˜) , exp(𝑏2) Οƒπ‘˜ exp(π‘π‘˜) , … , exp 𝑏𝑗 Οƒπ‘˜ exp π‘π‘˜ , …

  • Behave like max: when 𝑏𝑗 ≫ π‘π‘˜ βˆ€π‘˜ β‰  𝑗 , π‘žπ‘— β‰… 1, π‘žπ‘˜ β‰… 0
slide-26
SLIDE 26

Cross entropy for conditional distribution

  • Let π‘ždata(𝑧|𝑦) denote the empirical distribution of the data
  • Negative log-likelihood

βˆ’

1 π‘œ σ𝑗=1 π‘œ

log π‘ž 𝑧 = 𝑧𝑗 𝑦𝑗 = βˆ’Eπ‘ždata(𝑧|𝑦) log π‘ž(𝑧|𝑦) is the cross entropy between π‘ždata and the model output π‘ž

  • Information theory viewpoint: KL divergence

𝐸(π‘ždata| π‘ž = Eπ‘ždata[log

π‘ždata π‘ž ] = Eπ‘ždata [log π‘ždata] βˆ’ Eπ‘ždata[log π‘ž]

Entropy; constant Cross entropy

slide-27
SLIDE 27

Cross entropy for full distribution

  • Let π‘ždata(𝑦, 𝑧) denote the empirical distribution of the data
  • Negative log-likelihood

βˆ’

1 π‘œ σ𝑗=1 π‘œ

log π‘ž(𝑦𝑗, 𝑧𝑗) = βˆ’Eπ‘ždata(𝑦,𝑧) log π‘ž(𝑦, 𝑧) is the cross entropy between π‘ždata and the model output π‘ž

slide-28
SLIDE 28

Multiclass logistic regression: summary

Label 𝑧𝑗 (π‘₯π‘˜)π‘ˆβ„Ž + π‘π‘˜ softmax Last hidden layer β„Ž π‘žπ‘˜ Cross entropy Linear Convert to probability Loss