Support Vector Machines Part 1 CS 760@UW-Madison
Goals for the lecture you should understand the following concepts • the margin • the linear support vector machine • the primal and dual formulations of SVM learning • support vectors • Optional: variants of SVM • Optional: Lagrange Multiplier 2
Motivation
Linear classification (𝑥 ∗ ) 𝑈 𝑦 = 0 (𝑥 ∗ ) 𝑈 𝑦 > 0 (𝑥 ∗ ) 𝑈 𝑦 < 0 𝑥 ∗ Class +1 Class -1 Assume perfect separation between the two classes
Attempt • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 𝑥 𝑦 ) = sign(𝑥 𝑈 𝑦) • Hypothesis 𝑧 = sign(𝑔 • 𝑧 = +1 if 𝑥 𝑈 𝑦 > 0 • 𝑧 = −1 if 𝑥 𝑈 𝑦 < 0 • Let’s assume that we can optimize to find 𝑥
Multiple optimal solutions? Class +1 𝑥 1 𝑥 2 𝑥 3 Class -1 Same on empirical loss; Different on test/expected loss
What about 𝑥 1 ? New test data Class +1 𝑥 1 Class -1
What about 𝑥 3 ? New test data Class +1 𝑥 3 Class -1
Most confident: 𝑥 2 New test data Class +1 𝑥 2 Class -1
Intuition: margin large margin Class +1 𝑥 2 Class -1
Margin
Margin We are going to prove the following math expression for margin using a geometric argument |𝑔 𝑥 𝑦 | • Lemma 1: 𝑦 has distance | 𝑥 | to the hyperplane 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 = 0 |𝑔 𝑥,𝑐 𝑦 | • Lemma 2: 𝑦 has distance to the hyperplane 𝑔 𝑥,𝑐 𝑦 = | 𝑥 | 𝑥 𝑈 𝑦 + 𝑐 = 0 Need two geometric facts: • 𝑥 is orthogonal to the hyperplane 𝑔 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 = 0 • Let 𝑤 be a direction (i.e., unit vector). Then the length of the projection of 𝑦 on 𝑤 is 𝑤 𝑈 𝑦
Margin |𝑔 𝑥 𝑦 | • Lemma 1: 𝑦 has distance | 𝑥 | to the hyperplane 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 = 0 Proof: • 𝑥 is orthogonal to the hyperplane 𝑥 • The unit direction is | 𝑥 | 𝑈 𝑥 𝑔 𝑥 (𝑦) • The projection of 𝑦 is 𝑦 = 𝑥 | 𝑥 | 𝑦 0 𝑥 | 𝑥 | 𝑈 𝑥 𝑦 𝑥
Margin: with bias |𝑔 𝑥,𝑐 𝑦 | • Lemma 2: 𝑦 has distance to the hyperplane 𝑔 𝑥,𝑐 𝑦 = | 𝑥 | 𝑥 𝑈 𝑦 + 𝑐 = 0 Proof: 𝑥 • Let 𝑦 = 𝑦 ⊥ + 𝑠 | 𝑥 | , then |𝑠| is the distance • Multiply both sides by 𝑥 𝑈 and add 𝑐 • Left hand side: 𝑥 𝑈 𝑦 + 𝑐 = 𝑔 𝑥,𝑐 𝑦 𝑥 𝑈 𝑥 • Right hand side: 𝑥 𝑈 𝑦 ⊥ + 𝑠 | 𝑥 | + 𝑐 = 0 + 𝑠| 𝑥 |
Margin: with bias The notation here is: 𝑧 𝑦 = 𝑥 𝑈 𝑦 + 𝑥 0 Figure from Pattern Recognition and Machine Learning , Bishop
Support Vector Machine (SVM)
SVM: objective • Absolute margin over all training data points: |𝑔 𝑥,𝑐 𝑦 𝑗 | 𝛿 = min | 𝑥 | 𝑗 • Since only want correct 𝑔 𝑥,𝑐 , and recall 𝑧 𝑗 ∈ {+1, −1} , we define the margin to be 𝑧 𝑗 𝑔 𝑥,𝑐 𝑦 𝑗 𝛿 = min | 𝑥 | 𝑗 • If 𝑔 𝑥,𝑐 incorrect on some 𝑦 𝑗 , the margin is negative
SVM: objective • Maximize margin over all training data points: 𝑧 𝑗 (𝑥 𝑈 𝑦 𝑗 + 𝑐) 𝑧 𝑗 𝑔 𝑥,𝑐 𝑦 𝑗 max 𝑥,𝑐 𝛿 = max 𝑥,𝑐 min = max 𝑥,𝑐 min | 𝑥 | | 𝑥 | 𝑗 𝑗 • A bit complicated …
SVM: simplified objective • Observation: when (𝑥, 𝑐) scaled by a factor 𝑑 , the margin unchanged 𝑧 𝑗 (𝑑𝑥 𝑈 𝑦 𝑗 + 𝑑𝑐) = 𝑧 𝑗 (𝑥 𝑈 𝑦 𝑗 + 𝑐) | 𝑑𝑥 | | 𝑥 | • Let’s consider a fixed scale such that 𝑧 𝑗 ∗ 𝑥 𝑈 𝑦 𝑗 ∗ + 𝑐 = 1 where 𝑦 𝑗 ∗ is the point closest to the hyperplane
SVM: simplified objective • Let’s consider a fixed scale such that 𝑧 𝑗 ∗ 𝑥 𝑈 𝑦 𝑗 ∗ + 𝑐 = 1 where 𝑦 𝑗 ∗ is the point closet to the hyperplane • Now we have for all data 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1 and at least for one 𝑗 the equality holds 1 • Then the margin over all training points is | 𝑥 |
SVM: simplified objective • Optimization simplified to 2 1 min 2 𝑥 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 𝑥 ∗ ? • How to find the optimum ෝ • Solved by Lagrange multiplier method
SVM: optimization
SVM: optimization • Optimization (Quadratic Programming): 2 1 min 2 𝑥 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 • Generalized Lagrangian: 2 ℒ 𝑥, 𝑐, 𝜷 = 1 𝛽 𝑗 [𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 − 1] 2 𝑥 − 𝑗 where 𝜷 is the Lagrange multiplier
SVM: optimization • KKT conditions: 𝜖ℒ 𝜖𝑥 = 0, → 𝑥 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 (1) 𝜖ℒ 𝜖𝑐 = 0, → 0 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 (2) • Plug into ℒ : 1 𝑈 𝑦 𝑘 (3) ℒ 𝑥, 𝑐, 𝜷 = σ 𝑗 𝛽 𝑗 − 2 σ 𝑗𝑘 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 combined with 0 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 , 𝛽 𝑗 ≥ 0
SVM: optimization Only depend on inner products • Reduces to dual problem: 𝛽 𝑗 − 1 𝑈 𝑦 𝑘 ℒ 𝑥, 𝑐, 𝜷 = 2 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 𝑗 𝑗𝑘 𝛽 𝑗 𝑧 𝑗 = 0, 𝛽 𝑗 ≥ 0 𝑗 • Since 𝑥 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 , we have 𝑥 𝑈 𝑦 + 𝑐 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 𝑈 𝑦 + 𝑐
Support Vectors • final solution is a sparse linear combination of the training instances • those instances with α i > 0 are called support vectors • they lie on the margin boundary • solution NOT changed if delete the instances with α i = 0 support vectors
Optional: Lagrange Multiplier
Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 ℎ 𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚 • Lagrangian: ℒ 𝑥, 𝜸 = 𝑔 𝑥 + 𝛾 𝑗 ℎ 𝑗 (𝑥) 𝑗 where 𝛾 𝑗 ’s are called Lagrange multipliers
Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 ℎ 𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚 • Solved by setting derivatives of Lagrangian to 0 𝜖ℒ 𝜖ℒ = 0; = 0 𝜖𝑥 𝑗 𝜖𝛾 𝑗
Generalized Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 𝑗 𝑥 ≤ 0, ∀1 ≤ 𝑗 ≤ 𝑙 ℎ 𝑘 𝑥 = 0, ∀1 ≤ 𝑘 ≤ 𝑚 • Generalized Lagrangian: ℒ 𝑥, 𝜷, 𝜸 = 𝑔 𝑥 + 𝛽 𝑗 𝑗 (𝑥) + 𝛾 𝑘 ℎ 𝑘 (𝑥) 𝑗 𝑘 where 𝛽 𝑗 , 𝛾 𝑘 ’s are called Lagrange multipliers
Generalized Lagrangian • Consider the quantity: 𝜄 𝑄 𝑥 ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max • Why? if 𝑥 satisfies all the constraints 𝜄 𝑄 𝑥 = ቊ 𝑔 𝑥 , +∞, if 𝑥 does not satisfy the constraints • So minimizing 𝑔 𝑥 is the same as minimizing 𝜄 𝑄 𝑥 min 𝑥 𝑔 𝑥 = min 𝑥 𝜄 𝑄 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥
Lagrange duality • The primal problem 𝑞 ∗ ≔ min 𝑥 𝑔 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥 • The dual problem 𝑒 ∗ ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 min max 𝑥 ℒ 𝑥, 𝜷, 𝜸 • Always true: 𝑒 ∗ ≤ 𝑞 ∗
Lagrange duality • The primal problem 𝑞 ∗ ≔ min 𝑥 𝑔 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥 • The dual problem 𝑒 ∗ ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 min max 𝑥 ℒ 𝑥, 𝜷, 𝜸 • Interesting case: when do we have 𝑒 ∗ = 𝑞 ∗ ?
Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0
Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that dual 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ complementarity Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0
Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ primal constraints dual constraints • Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0
Lagrange duality • What are the proper conditions? • A set of conditions (Slater conditions): • 𝑔, 𝑗 convex, ℎ 𝑘 affine, and exists 𝑥 satisfying all 𝑗 𝑥 < 0 • There exist other sets of conditions • Check textbooks, e.g., Convex Optimization by Boyd and Vandenberghe
Optional: Variants of SVM
Hard-margin SVM • Optimization (Quadratic Programming): 2 1 min 2 𝑥 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗
Recommend
More recommend