support vector machines
play

Support Vector Machines Part 1 CS 760@UW-Madison Goals for the - PowerPoint PPT Presentation

Support Vector Machines Part 1 CS 760@UW-Madison Goals for the lecture you should understand the following concepts the margin the linear support vector machine the primal and dual formulations of SVM learning support vectors


  1. Support Vector Machines Part 1 CS 760@UW-Madison

  2. Goals for the lecture you should understand the following concepts • the margin • the linear support vector machine • the primal and dual formulations of SVM learning • support vectors • Optional: variants of SVM • Optional: Lagrange Multiplier 2

  3. Motivation

  4. Linear classification (𝑥 ∗ ) 𝑈 𝑦 = 0 (𝑥 ∗ ) 𝑈 𝑦 > 0 (𝑥 ∗ ) 𝑈 𝑦 < 0 𝑥 ∗ Class +1 Class -1 Assume perfect separation between the two classes

  5. Attempt • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 𝑥 𝑦 ) = sign(𝑥 𝑈 𝑦) • Hypothesis 𝑧 = sign(𝑔 • 𝑧 = +1 if 𝑥 𝑈 𝑦 > 0 • 𝑧 = −1 if 𝑥 𝑈 𝑦 < 0 • Let’s assume that we can optimize to find 𝑥

  6. Multiple optimal solutions? Class +1 𝑥 1 𝑥 2 𝑥 3 Class -1 Same on empirical loss; Different on test/expected loss

  7. What about 𝑥 1 ? New test data Class +1 𝑥 1 Class -1

  8. What about 𝑥 3 ? New test data Class +1 𝑥 3 Class -1

  9. Most confident: 𝑥 2 New test data Class +1 𝑥 2 Class -1

  10. Intuition: margin large margin Class +1 𝑥 2 Class -1

  11. Margin

  12. Margin We are going to prove the following math expression for margin using a geometric argument |𝑔 𝑥 𝑦 | • Lemma 1: 𝑦 has distance | 𝑥 | to the hyperplane 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 = 0 |𝑔 𝑥,𝑐 𝑦 | • Lemma 2: 𝑦 has distance to the hyperplane 𝑔 𝑥,𝑐 𝑦 = | 𝑥 | 𝑥 𝑈 𝑦 + 𝑐 = 0 Need two geometric facts: • 𝑥 is orthogonal to the hyperplane 𝑔 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 = 0 • Let 𝑤 be a direction (i.e., unit vector). Then the length of the projection of 𝑦 on 𝑤 is 𝑤 𝑈 𝑦

  13. Margin |𝑔 𝑥 𝑦 | • Lemma 1: 𝑦 has distance | 𝑥 | to the hyperplane 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 = 0 Proof: • 𝑥 is orthogonal to the hyperplane 𝑥 • The unit direction is | 𝑥 | 𝑈 𝑥 𝑔 𝑥 (𝑦) • The projection of 𝑦 is 𝑦 = 𝑥 | 𝑥 | 𝑦 0 𝑥 | 𝑥 | 𝑈 𝑥 𝑦 𝑥

  14. Margin: with bias |𝑔 𝑥,𝑐 𝑦 | • Lemma 2: 𝑦 has distance to the hyperplane 𝑔 𝑥,𝑐 𝑦 = | 𝑥 | 𝑥 𝑈 𝑦 + 𝑐 = 0 Proof: 𝑥 • Let 𝑦 = 𝑦 ⊥ + 𝑠 | 𝑥 | , then |𝑠| is the distance • Multiply both sides by 𝑥 𝑈 and add 𝑐 • Left hand side: 𝑥 𝑈 𝑦 + 𝑐 = 𝑔 𝑥,𝑐 𝑦 𝑥 𝑈 𝑥 • Right hand side: 𝑥 𝑈 𝑦 ⊥ + 𝑠 | 𝑥 | + 𝑐 = 0 + 𝑠| 𝑥 |

  15. Margin: with bias The notation here is: 𝑧 𝑦 = 𝑥 𝑈 𝑦 + 𝑥 0 Figure from Pattern Recognition and Machine Learning , Bishop

  16. Support Vector Machine (SVM)

  17. SVM: objective • Absolute margin over all training data points: |𝑔 𝑥,𝑐 𝑦 𝑗 | 𝛿 = min | 𝑥 | 𝑗 • Since only want correct 𝑔 𝑥,𝑐 , and recall 𝑧 𝑗 ∈ {+1, −1} , we define the margin to be 𝑧 𝑗 𝑔 𝑥,𝑐 𝑦 𝑗 𝛿 = min | 𝑥 | 𝑗 • If 𝑔 𝑥,𝑐 incorrect on some 𝑦 𝑗 , the margin is negative

  18. SVM: objective • Maximize margin over all training data points: 𝑧 𝑗 (𝑥 𝑈 𝑦 𝑗 + 𝑐) 𝑧 𝑗 𝑔 𝑥,𝑐 𝑦 𝑗 max 𝑥,𝑐 𝛿 = max 𝑥,𝑐 min = max 𝑥,𝑐 min | 𝑥 | | 𝑥 | 𝑗 𝑗 • A bit complicated …

  19. SVM: simplified objective • Observation: when (𝑥, 𝑐) scaled by a factor 𝑑 , the margin unchanged 𝑧 𝑗 (𝑑𝑥 𝑈 𝑦 𝑗 + 𝑑𝑐) = 𝑧 𝑗 (𝑥 𝑈 𝑦 𝑗 + 𝑐) | 𝑑𝑥 | | 𝑥 | • Let’s consider a fixed scale such that 𝑧 𝑗 ∗ 𝑥 𝑈 𝑦 𝑗 ∗ + 𝑐 = 1 where 𝑦 𝑗 ∗ is the point closest to the hyperplane

  20. SVM: simplified objective • Let’s consider a fixed scale such that 𝑧 𝑗 ∗ 𝑥 𝑈 𝑦 𝑗 ∗ + 𝑐 = 1 where 𝑦 𝑗 ∗ is the point closet to the hyperplane • Now we have for all data 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1 and at least for one 𝑗 the equality holds 1 • Then the margin over all training points is | 𝑥 |

  21. SVM: simplified objective • Optimization simplified to 2 1 min 2 𝑥 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 𝑥 ∗ ? • How to find the optimum ෝ • Solved by Lagrange multiplier method

  22. SVM: optimization

  23. SVM: optimization • Optimization (Quadratic Programming): 2 1 min 2 𝑥 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 • Generalized Lagrangian: 2 ℒ 𝑥, 𝑐, 𝜷 = 1 𝛽 𝑗 [𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 − 1] 2 𝑥 − ෍ 𝑗 where 𝜷 is the Lagrange multiplier

  24. SVM: optimization • KKT conditions: 𝜖ℒ 𝜖𝑥 = 0, → 𝑥 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 (1) 𝜖ℒ 𝜖𝑐 = 0, → 0 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 (2) • Plug into ℒ : 1 𝑈 𝑦 𝑘 (3) ℒ 𝑥, 𝑐, 𝜷 = σ 𝑗 𝛽 𝑗 − 2 σ 𝑗𝑘 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 combined with 0 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 , 𝛽 𝑗 ≥ 0

  25. SVM: optimization Only depend on inner products • Reduces to dual problem: 𝛽 𝑗 − 1 𝑈 𝑦 𝑘 ℒ 𝑥, 𝑐, 𝜷 = ෍ 2 ෍ 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 𝑗 𝑗𝑘 ෍ 𝛽 𝑗 𝑧 𝑗 = 0, 𝛽 𝑗 ≥ 0 𝑗 • Since 𝑥 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 , we have 𝑥 𝑈 𝑦 + 𝑐 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 𝑈 𝑦 + 𝑐

  26. Support Vectors • final solution is a sparse linear combination of the training instances • those instances with α i > 0 are called support vectors • they lie on the margin boundary • solution NOT changed if delete the instances with α i = 0 support vectors

  27. Optional: Lagrange Multiplier

  28. Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 ℎ 𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚 • Lagrangian: ℒ 𝑥, 𝜸 = 𝑔 𝑥 + ෍ 𝛾 𝑗 ℎ 𝑗 (𝑥) 𝑗 where 𝛾 𝑗 ’s are called Lagrange multipliers

  29. Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 ℎ 𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚 • Solved by setting derivatives of Lagrangian to 0 𝜖ℒ 𝜖ℒ = 0; = 0 𝜖𝑥 𝑗 𝜖𝛾 𝑗

  30. Generalized Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 𝑕 𝑗 𝑥 ≤ 0, ∀1 ≤ 𝑗 ≤ 𝑙 ℎ 𝑘 𝑥 = 0, ∀1 ≤ 𝑘 ≤ 𝑚 • Generalized Lagrangian: ℒ 𝑥, 𝜷, 𝜸 = 𝑔 𝑥 + ෍ 𝛽 𝑗 𝑕 𝑗 (𝑥) + ෍ 𝛾 𝑘 ℎ 𝑘 (𝑥) 𝑗 𝑘 where 𝛽 𝑗 , 𝛾 𝑘 ’s are called Lagrange multipliers

  31. Generalized Lagrangian • Consider the quantity: 𝜄 𝑄 𝑥 ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max • Why? if 𝑥 satisfies all the constraints 𝜄 𝑄 𝑥 = ቊ 𝑔 𝑥 , +∞, if 𝑥 does not satisfy the constraints • So minimizing 𝑔 𝑥 is the same as minimizing 𝜄 𝑄 𝑥 min 𝑥 𝑔 𝑥 = min 𝑥 𝜄 𝑄 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥

  32. Lagrange duality • The primal problem 𝑞 ∗ ≔ min 𝑥 𝑔 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥 • The dual problem 𝑒 ∗ ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 min max 𝑥 ℒ 𝑥, 𝜷, 𝜸 • Always true: 𝑒 ∗ ≤ 𝑞 ∗

  33. Lagrange duality • The primal problem 𝑞 ∗ ≔ min 𝑥 𝑔 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥 • The dual problem 𝑒 ∗ ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 min max 𝑥 ℒ 𝑥, 𝜷, 𝜸 • Interesting case: when do we have 𝑒 ∗ = 𝑞 ∗ ?

  34. Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑕 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑕 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0

  35. Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that dual 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ complementarity Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑕 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑕 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0

  36. Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ primal constraints dual constraints • Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑕 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑕 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0

  37. Lagrange duality • What are the proper conditions? • A set of conditions (Slater conditions): • 𝑔, 𝑕 𝑗 convex, ℎ 𝑘 affine, and exists 𝑥 satisfying all 𝑕 𝑗 𝑥 < 0 • There exist other sets of conditions • Check textbooks, e.g., Convex Optimization by Boyd and Vandenberghe

  38. Optional: Variants of SVM

  39. Hard-margin SVM • Optimization (Quadratic Programming): 2 1 min 2 𝑥 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗

Recommend


More recommend