lecture 2 linear classification
play

Lecture 2: Linear Classification Princeton University COS 495 - PowerPoint PPT Presentation

Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data , : 1 i.i.d. from


  1. Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang

  2. Review: machine learning basics

  3. Math formulation β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 β€’ Find 𝑧 = 𝑔(𝑦) ∈ π“˜ that minimizes ΰ·  π‘œ π‘œ Οƒ 𝑗=1 𝑀 𝑔 = π‘š(𝑔, 𝑦 𝑗 , 𝑧 𝑗 ) β€’ s.t. the expected loss is small 𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸 [π‘š(𝑔, 𝑦, 𝑧)]

  4. Machine learning 1-2-3 β€’ Collect data and extract features β€’ Build model: choose hypothesis class π“˜ and loss function π‘š β€’ Optimization: minimize the empirical loss

  5. Machine learning 1-2-3 Experience β€’ Collect data and extract features β€’ Build model: choose hypothesis class π“˜ and loss function π‘š β€’ Optimization: minimize the empirical loss Prior knowledge

  6. Example: Linear regression β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 π‘₯ 𝑦 = π‘₯ π‘ˆ 𝑦 that minimizes ΰ·  π‘œ π‘₯ π‘ˆ 𝑦 𝑗 βˆ’ 𝑧 𝑗 2 β€’ Find 𝑔 π‘œ Οƒ 𝑗=1 𝑀 𝑔 π‘₯ = π‘š 2 loss Linear model π“˜

  7. Why π‘š 2 loss β€’ Why not choose another loss β€’ π‘š 1 loss, hinge loss, exponential loss, … β€’ Empirical: easy to optimize β€’ For linear case: w = π‘Œ π‘ˆ π‘Œ βˆ’1 π‘Œ π‘ˆ 𝑧 β€’ Theoretical: a way to encode prior knowledge Questions: β€’ What kind of prior knowledge? β€’ Principal way to derive loss?

  8. Maximum likelihood Estimation

  9. Maximum likelihood Estimation (MLE) β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ β€’ Would like to pick πœ„ so that 𝑄 πœ„ (𝑦, 𝑧) fits the data well

  10. Maximum likelihood Estimation (MLE) β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ β€’ β€œfitness” of πœ„ to one data point 𝑦 𝑗 , 𝑧 𝑗 likelihood πœ„; 𝑦 𝑗 , 𝑧 𝑗 ≔ 𝑄 πœ„ (𝑦 𝑗 , 𝑧 𝑗 )

  11. Maximum likelihood Estimation (MLE) β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ β€’ β€œfitness” of πœ„ to i.i.d. data points { 𝑦 𝑗 , 𝑧 𝑗 } ≔ 𝑄 πœ„ {𝑦 𝑗 , 𝑧 𝑗 } = Ο‚ 𝑗 𝑄 πœ„ (𝑦 𝑗 , 𝑧 𝑗 ) likelihood πœ„; {𝑦 𝑗 , 𝑧 𝑗 }

  12. Maximum likelihood Estimation (MLE) β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ β€’ MLE: maximize β€œfitness” of πœ„ to i.i.d. data points { 𝑦 𝑗 , 𝑧 𝑗 } πœ„ 𝑁𝑀 = argmax θ∈Θ Ο‚ 𝑗 𝑄 πœ„ (𝑦 𝑗 , 𝑧 𝑗 )

  13. Maximum likelihood Estimation (MLE) β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ β€’ MLE: maximize β€œfitness” of πœ„ to i.i.d. data points { 𝑦 𝑗 , 𝑧 𝑗 } πœ„ 𝑁𝑀 = argmax θ∈Θ log[Ο‚ 𝑗 𝑄 πœ„ 𝑦 𝑗 , 𝑧 𝑗 ] πœ„ 𝑁𝑀 = argmax θ∈Θ Οƒ 𝑗 log[𝑄 πœ„ 𝑦 𝑗 , 𝑧 𝑗 ]

  14. Maximum likelihood Estimation (MLE) β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ β€’ MLE: negative log-likelihood loss πœ„ 𝑁𝑀 = argmax θ∈Θ Οƒ 𝑗 log(𝑄 πœ„ 𝑦 𝑗 , 𝑧 𝑗 ) π‘š 𝑄 πœ„ , 𝑦 𝑗 , 𝑧 𝑗 = βˆ’ log(𝑄 πœ„ 𝑦 𝑗 , 𝑧 𝑗 ) ΰ·  𝑀 𝑄 πœ„ = βˆ’ Οƒ 𝑗 log(𝑄 πœ„ 𝑦 𝑗 , 𝑧 𝑗 )

  15. MLE: conditional log-likelihood β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑧 𝑦 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ Only care about predicting y β€’ MLE: negative conditional log-likelihood loss from x; do not care about p(x) πœ„ 𝑁𝑀 = argmax θ∈Θ Οƒ 𝑗 log(𝑄 πœ„ 𝑧 𝑗 |𝑦 𝑗 ) π‘š 𝑄 πœ„ , 𝑦 𝑗 , 𝑧 𝑗 = βˆ’ log(𝑄 πœ„ 𝑧 𝑗 |𝑦 𝑗 ) ΰ·  𝑀 𝑄 πœ„ = βˆ’ Οƒ 𝑗 log(𝑄 πœ„ 𝑧 𝑗 |𝑦 𝑗 )

  16. MLE: conditional log-likelihood β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑧 𝑦 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ P(y|x): discriminative; β€’ MLE: negative conditional log-likelihood loss P(x,y): generative πœ„ 𝑁𝑀 = argmax θ∈Θ Οƒ 𝑗 log(𝑄 πœ„ 𝑧 𝑗 |𝑦 𝑗 ) π‘š 𝑄 πœ„ , 𝑦 𝑗 , 𝑧 𝑗 = βˆ’ log(𝑄 πœ„ 𝑧 𝑗 |𝑦 𝑗 ) ΰ·  𝑀 𝑄 πœ„ = βˆ’ Οƒ 𝑗 log(𝑄 πœ„ 𝑧 𝑗 |𝑦 𝑗 )

  17. Example: π‘š 2 loss β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 πœ„ 𝑦 that minimizes ΰ·  π‘œ πœ„ (𝑦 𝑗 ) βˆ’ 𝑧 𝑗 2 β€’ Find 𝑔 π‘œ Οƒ 𝑗=1 𝑀 𝑔 πœ„ = 𝑔

  18. Example: π‘š 2 loss β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 πœ„ 𝑦 that minimizes ΰ·  π‘œ πœ„ (𝑦 𝑗 ) βˆ’ 𝑧 𝑗 2 β€’ Find 𝑔 π‘œ Οƒ 𝑗=1 𝑀 𝑔 πœ„ = 𝑔 π‘š 2 loss: Normal + MLE β€’ Define 𝑄 πœ„ 𝑧 𝑦 = Normal 𝑧; 𝑔 πœ„ 𝑦 , 𝜏 2 βˆ’1 1 βˆ’ 𝑧 𝑗 ) 2 βˆ’log(𝜏) βˆ’ β€’ log(𝑄 πœ„ 𝑧 𝑗 |𝑦 𝑗 ) = 2𝜏 2 (𝑔 πœ„ 𝑦 𝑗 2 log(2𝜌) 1 π‘œ πœ„ (𝑦 𝑗 ) βˆ’ 𝑧 𝑗 2 β€’ πœ„ 𝑁𝑀 = argmin θ∈Θ π‘œ Οƒ 𝑗=1 𝑔

  19. Linear classification

  20. Example 1: image classification indoor Indoor outdoor

  21. Example 2: Spam detection #”$” #”Mr.” #”sale” … Spam? Email 1 2 1 1 Yes Email 2 0 1 0 No Email 3 1 1 1 Yes … Email n 0 0 0 No New email 0 0 1 ??

  22. Why classification β€’ Classification: a kind of summary β€’ Easy to interpret β€’ Easy for making decisions

  23. Linear classification π‘₯ π‘ˆ 𝑦 = 0 π‘₯ π‘ˆ 𝑦 > 0 π‘₯ π‘ˆ 𝑦 < 0 π‘₯ Class 1 Class 0

  24. Linear classification: natural attempt β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 π‘₯ 𝑦 = π‘₯ π‘ˆ 𝑦 β€’ Hypothesis 𝑔 β€’ 𝑧 = 1 if π‘₯ π‘ˆ 𝑦 > 0 Linear model π“˜ β€’ 𝑧 = 0 if π‘₯ π‘ˆ 𝑦 < 0 π‘₯ 𝑦 ) = step(π‘₯ π‘ˆ 𝑦) β€’ Prediction: 𝑧 = step(𝑔

  25. Linear classification: natural attempt β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 π‘₯ 𝑦 = π‘₯ π‘ˆ 𝑦 to minimize ΰ·  π‘œ 𝕁[step(π‘₯ π‘ˆ 𝑦 𝑗 ) β‰  𝑧 𝑗 ] β€’ Find 𝑔 π‘œ Οƒ 𝑗=1 𝑀 𝑔 π‘₯ = β€’ Drawback: difficult to optimize β€’ NP-hard in the worst case 0-1 loss

  26. Linear classification: simple approach β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 π‘₯ 𝑦 = π‘₯ π‘ˆ 𝑦 that minimizes ΰ·  π‘œ π‘₯ π‘ˆ 𝑦 𝑗 βˆ’ 𝑧 𝑗 2 β€’ Find 𝑔 π‘œ Οƒ 𝑗=1 𝑀 𝑔 π‘₯ = Reduce to linear regression; ignore the fact 𝑧 ∈ {0,1}

  27. Linear classification: simple approach Drawback: not robust to β€œoutliers” Figure borrowed from Pattern Recognition and Machine Learning , Bishop

  28. Compare the two 𝑧 𝑧 = π‘₯ π‘ˆ 𝑦 𝑧 = step(π‘₯ π‘ˆ 𝑦) π‘₯ π‘ˆ 𝑦

  29. Between the two β€’ Prediction bounded in [0,1] β€’ Smooth 1 β€’ Sigmoid: 𝜏 𝑏 = 1+exp(βˆ’π‘) Figure borrowed from Pattern Recognition and Machine Learning , Bishop

  30. Linear classification: sigmoid prediction β€’ Squash the output of the linear function 1 Sigmoid π‘₯ π‘ˆ 𝑦 = 𝜏 π‘₯ π‘ˆ 𝑦 = 1 + exp(βˆ’π‘₯ π‘ˆ 𝑦) 1 β€’ Find π‘₯ that minimizes ΰ·  π‘œ 𝜏(π‘₯ π‘ˆ 𝑦 𝑗 ) βˆ’ 𝑧 𝑗 2 π‘œ Οƒ 𝑗=1 𝑀 𝑔 π‘₯ =

  31. Linear classification: logistic regression β€’ Squash the output of the linear function 1 Sigmoid π‘₯ π‘ˆ 𝑦 = 𝜏 π‘₯ π‘ˆ 𝑦 = 1 + exp(βˆ’π‘₯ π‘ˆ 𝑦) β€’ A better approach: Interpret as a probability 1 π‘₯ (𝑧 = 1|𝑦) = 𝜏 π‘₯ π‘ˆ 𝑦 = 𝑄 1 + exp(βˆ’π‘₯ π‘ˆ 𝑦) π‘₯ 𝑧 = 1 𝑦 = 1 βˆ’ 𝜏 π‘₯ π‘ˆ 𝑦 𝑄 π‘₯ 𝑧 = 0 𝑦 = 1 βˆ’ 𝑄

  32. Linear classification: logistic regression β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Find π‘₯ that minimizes π‘œ 𝑀 π‘₯ = βˆ’ 1 ΰ·  π‘œ ෍ log 𝑄 π‘₯ 𝑧 𝑦 𝑗=1 𝑀 π‘₯ = βˆ’ 1 log𝜏(π‘₯ π‘ˆ 𝑦 𝑗 ) βˆ’ 1 ΰ·  log[1 βˆ’ 𝜏 π‘₯ π‘ˆ 𝑦 𝑗 ] π‘œ ෍ π‘œ ෍ 𝑧 𝑗 =1 𝑧 𝑗 =0 Logistic regression: MLE with sigmoid

  33. Linear classification: logistic regression β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Find π‘₯ that minimizes 𝑀 π‘₯ = βˆ’ 1 log𝜏(π‘₯ π‘ˆ 𝑦 𝑗 ) βˆ’ 1 ΰ·  log[1 βˆ’ 𝜏 π‘₯ π‘ˆ 𝑦 𝑗 ] π‘œ ෍ π‘œ ෍ 𝑧 𝑗 =1 𝑧 𝑗 =0 No close form solution; Need to use gradient descent

Recommend


More recommend