lecture 4 svm i
play

Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu - PowerPoint PPT Presentation

Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data , : 1 i.i.d. from distribution 1


  1. Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang

  2. Review: machine learning basics

  3. Math formulation β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 β€’ Find 𝑧 = 𝑔(𝑦) ∈ π“˜ that minimizes ΰ·  π‘œ π‘œ Οƒ 𝑗=1 𝑀 𝑔 = π‘š(𝑔, 𝑦 𝑗 , 𝑧 𝑗 ) β€’ s.t. the expected loss is small 𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸 [π‘š(𝑔, 𝑦, 𝑧)]

  4. Machine learning 1-2-3 β€’ Collect data and extract features β€’ Build model: choose hypothesis class π“˜ and loss function π‘š β€’ Optimization: minimize the empirical loss

  5. Loss function β€’ π‘š 2 loss: linear regression β€’ Cross-entropy: logistic regression β€’ Hinge loss: Perceptron β€’ General principle: maximum likelihood estimation (MLE) β€’ π‘š 2 loss: corresponds to Normal distribution β€’ logistic regression: corresponds to sigmoid conditional distribution

  6. Optimization β€’ Linear regression: closed form solution β€’ Logistic regression: gradient descent β€’ Perceptron: stochastic gradient descent β€’ General principle: local improvement β€’ SGD: Perceptron; can also be applied to linear regression/logistic regression

  7. Principle for hypothesis class? β€’ Yes, there exists a general principle (at least philosophically) β€’ Different names/faces/connections β€’ Occam’s razor β€’ VC dimension theory β€’ Minimum description length β€’ Tradeoff between Bias and variance; uniform convergence β€’ The curse of dimensionality β€’ Running example: Support Vector Machine (SVM)

  8. Motivation

  9. Linear classification (π‘₯ βˆ— ) π‘ˆ 𝑦 = 0 (π‘₯ βˆ— ) π‘ˆ 𝑦 > 0 (π‘₯ βˆ— ) π‘ˆ 𝑦 < 0 π‘₯ βˆ— Class +1 Class -1 Assume perfect separation between the two classes

  10. Attempt β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 π‘₯ 𝑦 ) = sign(π‘₯ π‘ˆ 𝑦) β€’ Hypothesis 𝑧 = sign(𝑔 β€’ 𝑧 = +1 if π‘₯ π‘ˆ 𝑦 > 0 β€’ 𝑧 = βˆ’1 if π‘₯ π‘ˆ 𝑦 < 0 β€’ Let’s assume that we can optimize to find π‘₯

  11. Multiple optimal solutions? Class +1 π‘₯ 1 π‘₯ 2 π‘₯ 3 Class -1 Same on empirical loss; Different on test/expected loss

  12. What about π‘₯ 1 ? New test data Class +1 π‘₯ 1 Class -1

  13. What about π‘₯ 3 ? New test data Class +1 π‘₯ 3 Class -1

  14. Most confident: π‘₯ 2 New test data Class +1 π‘₯ 2 Class -1

  15. Intuition: margin large margin Class +1 π‘₯ 2 Class -1

  16. Margin

  17. Margin |𝑔 π‘₯ 𝑦 | π‘₯ 𝑦 = π‘₯ π‘ˆ 𝑦 = 0 β€’ Lemma 1: 𝑦 has distance | π‘₯ | to the hyperplane 𝑔 Proof: β€’ π‘₯ is orthogonal to the hyperplane π‘₯ β€’ The unit direction is 𝑦 | π‘₯ | 0 π‘ˆ π‘₯ π‘₯ 𝑔 π‘₯ (𝑦) β€’ The projection of 𝑦 is 𝑦 = | π‘₯ | π‘₯ | π‘₯ | π‘ˆ π‘₯ 𝑦 π‘₯

  18. Margin: with bias π‘₯,𝑐 𝑦 = π‘₯ π‘ˆ 𝑦 + 𝑐 = 0 β€’ Claim 1: π‘₯ is orthogonal to the hyperplane 𝑔 Proof: β€’ pick any 𝑦 1 and 𝑦 2 on the hyperplane β€’ π‘₯ π‘ˆ 𝑦 1 + 𝑐 = 0 β€’ π‘₯ π‘ˆ 𝑦 2 + 𝑐 = 0 β€’ So π‘₯ π‘ˆ (𝑦 1 βˆ’ 𝑦 2 ) = 0

  19. Margin: with bias βˆ’π‘ | π‘₯ | to the hyperplane π‘₯ π‘ˆ 𝑦 + 𝑐 = 0 β€’ Claim 2: 0 has distance Proof: β€’ pick any 𝑦 1 the hyperplane π‘₯ β€’ Project 𝑦 1 to the unit direction | π‘₯ | to get the distance π‘ˆ π‘₯ βˆ’π‘ | π‘₯ | since π‘₯ π‘ˆ 𝑦 1 + 𝑐 = 0 β€’ 𝑦 1 = π‘₯

  20. Margin: with bias |𝑔 π‘₯,𝑐 𝑦 | π‘₯,𝑐 𝑦 = π‘₯ π‘ˆ 𝑦 + β€’ Lemma 2: 𝑦 has distance to the hyperplane 𝑔 | π‘₯ | 𝑐 = 0 Proof: π‘₯ β€’ Let 𝑦 = 𝑦 βŠ₯ + 𝑠 | π‘₯ | , then |𝑠| is the distance β€’ Multiply both sides by π‘₯ π‘ˆ and add 𝑐 β€’ Left hand side: π‘₯ π‘ˆ 𝑦 + 𝑐 = 𝑔 π‘₯,𝑐 𝑦 π‘₯ π‘ˆ π‘₯ β€’ Right hand side: π‘₯ π‘ˆ 𝑦 βŠ₯ + 𝑠 | π‘₯ | + 𝑐 = 0 + 𝑠| π‘₯ |

  21. The notation here is: 𝑧 𝑦 = π‘₯ π‘ˆ 𝑦 + π‘₯ 0 Figure from Pattern Recognition and Machine Learning , Bishop

  22. Support Vector Machine (SVM)

  23. SVM: objective β€’ Margin over all training data points: |𝑔 π‘₯,𝑐 𝑦 𝑗 | 𝛿 = min | π‘₯ | 𝑗 β€’ Since only want correct 𝑔 π‘₯,𝑐 , and recall 𝑧 𝑗 ∈ {+1, βˆ’1} , we have 𝑧 𝑗 𝑔 π‘₯,𝑐 𝑦 𝑗 𝛿 = min | π‘₯ | 𝑗 β€’ If 𝑔 π‘₯,𝑐 incorrect on some 𝑦 𝑗 , the margin is negative

  24. SVM: objective β€’ Maximize margin over all training data points: 𝑧 𝑗 (π‘₯ π‘ˆ 𝑦 𝑗 + 𝑐) 𝑧 𝑗 𝑔 π‘₯,𝑐 𝑦 𝑗 max π‘₯,𝑐 𝛿 = max π‘₯,𝑐 min = max π‘₯,𝑐 min | π‘₯ | | π‘₯ | 𝑗 𝑗 β€’ A bit complicated …

  25. SVM: simplified objective β€’ Observation: when (π‘₯, 𝑐) scaled by a factor 𝑑 , the margin unchanged 𝑧 𝑗 (𝑑π‘₯ π‘ˆ 𝑦 𝑗 + 𝑑𝑐) = 𝑧 𝑗 (π‘₯ π‘ˆ 𝑦 𝑗 + 𝑐) | 𝑑π‘₯ | | π‘₯ | β€’ Let’s consider a fixed scale such that 𝑧 𝑗 βˆ— π‘₯ π‘ˆ 𝑦 𝑗 βˆ— + 𝑐 = 1 where 𝑦 𝑗 βˆ— is the point closest to the hyperplane

  26. SVM: simplified objective β€’ Let’s consider a fixed scale such that 𝑧 𝑗 βˆ— π‘₯ π‘ˆ 𝑦 𝑗 βˆ— + 𝑐 = 1 where 𝑦 𝑗 βˆ— is the point closet to the hyperplane β€’ Now we have for all data 𝑧 𝑗 π‘₯ π‘ˆ 𝑦 𝑗 + 𝑐 β‰₯ 1 and at least for one 𝑗 the equality holds 1 β€’ Then the margin is | π‘₯ |

  27. SVM: simplified objective β€’ Optimization simplified to 2 1 min π‘₯ 2 π‘₯,𝑐 𝑧 𝑗 π‘₯ π‘ˆ 𝑦 𝑗 + 𝑐 β‰₯ 1, βˆ€π‘— π‘₯ βˆ— ? β€’ How to find the optimum ෝ

  28. SVM: principle for hypothesis class

  29. Thought experiment β€’ Suppose pick an 𝑆 , and suppose can decide if exists π‘₯ satisfying 1 2 ≀ 𝑆 π‘₯ 2 𝑧 𝑗 π‘₯ π‘ˆ 𝑦 𝑗 + 𝑐 β‰₯ 1, βˆ€π‘— β€’ Decrease 𝑆 until cannot find π‘₯ satisfying the inequalities

  30. Thought experiment π‘₯ βˆ— is the best weight (i.e., satisfying the smallest 𝑆 ) β€’ ෝ π‘₯ βˆ— ෝ

  31. Thought experiment π‘₯ βˆ— is the best weight (i.e., satisfying the smallest 𝑆 ) β€’ ෝ π‘₯ βˆ— ෝ

  32. Thought experiment π‘₯ βˆ— is the best weight (i.e., satisfying the smallest 𝑆 ) β€’ ෝ π‘₯ βˆ— ෝ

  33. Thought experiment π‘₯ βˆ— is the best weight (i.e., satisfying the smallest 𝑆 ) β€’ ෝ π‘₯ βˆ— ෝ

  34. Thought experiment π‘₯ βˆ— is the best weight (i.e., satisfying the smallest 𝑆 ) β€’ ෝ π‘₯ βˆ— ෝ

  35. Thought experiment β€’ To handle the difference between empirical and expected losses οƒ  β€’ Choose large margin hypothesis (high confidence) οƒ  β€’ Choose a small hypothesis class π‘₯ βˆ— ෝ Corresponds to the hypothesis class

  36. Thought experiment β€’ Principle: use smallest hypothesis class still with a correct/good one β€’ Also true beyond SVM β€’ Also true for the case without perfect separation between the two classes β€’ Math formulation: VC-dim theory, etc. π‘₯ βˆ— ෝ Corresponds to the hypothesis class

  37. Thought experiment β€’ Principle: use smallest hypothesis class still with a correct/good one β€’ Whatever you know about the ground truth, add it as constraint/regularizer π‘₯ βˆ— ෝ Corresponds to the hypothesis class

  38. SVM: optimization β€’ Optimization (Quadratic Programming): 2 1 min π‘₯ 2 π‘₯,𝑐 𝑧 𝑗 π‘₯ π‘ˆ 𝑦 𝑗 + 𝑐 β‰₯ 1, βˆ€π‘— β€’ Solved by Lagrange multiplier method: 2 β„’ π‘₯, 𝑐, 𝜷 = 1 𝛽 𝑗 [𝑧 𝑗 π‘₯ π‘ˆ 𝑦 𝑗 + 𝑐 βˆ’ 1] 2 π‘₯ βˆ’ ෍ 𝑗 where 𝜷 is the Lagrange multiplier β€’ Details in next lecture

  39. Reading β€’ Review Lagrange multiplier method β€’ E.g. Section 5 in Andrew Ng’s note on SVM β€’ posted on the course website: http://www.cs.princeton.edu/courses/archive/spring16/cos495/

Recommend


More recommend