Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU)
Linear Separator Ham Spam
From Perceptron to SVM batch 1997 +soft-margin Cortes/Vapnik s l e SVM n online approx. r e fall of USSR k + subgradient descent max margin 2007--2010* n 1964 i g r Singer group a Vapnik m x Pegasos a Chervonenkis m + minibatch online 2003 2006 Crammer/Singer Singer group c o n s e r v a t i v e u p d a t e s MIRA aggressive 1959 1962 1999 Rosenblatt Novikoff Freund/Schapire invention proof voted/avg: revived i n s e p a r a b l e c a s e 2002 2005* Collins McDonald/Crammer/Pereira structured MIRA structured *mentioned in lectures but optional (others papers all covered in detail) AT&T Research ex-AT&T and students 3
Large Margin Classifier h w, x i + b � 1 h w, x i + b � 1 linear function f ( x ) = h w, x i + b
Large Margin Classifier h w, x i + b � 1 h w, x i + b � 1 linear function f ( x ) = h w, x i + b
Why large margins? • Maximum robustness relative o to uncertainty r o o • Symmetry breaking + • Independent of correctly classified o + instances ρ • Easy to find for + easy problems +
Feature Map Φ • SVM is often used with kernels
Large Margin Classifier h w, x i + b = � 1 h w, x i + b = 1 h w, x i + b � 1 h w, x i + b � 1 w functional margin: y i ( w · x i ) geometric margin: y i ( w · x i ) 1 = k w k k w k
Large Margin Classifier h w, x i + b = � 1 Q1: what if we want functional margin of 2? h w, x i + b = 1 Q2: what if we want geometric margin of 1? h w, x i + b � 1 h w, x i + b � 1 w SVM objective (max version): max. geometric margin 1 s.t. functional margin max k w k s.t. 8 ( x , y ) 2 D, y ( w · x ) � 1 is at least 1 w
Large Margin Classifier h w, x i + b = � 1 h w, x i + b = 1 h w, x i + b � 1 h w, x i + b � 1 w SVM objective (min version): min. weight vector s.t. functional margin min w k w k s.t. 8 ( x , y ) 2 D, y ( w · x ) � 1 is at least 1 interpretation: small models generalize better
Large Margin Classifier h w, x i + b = � 1 h w, x i + b = 1 h w, x i + b � 1 h w, x i + b � 1 w SVM objective (min version): min. weight vector 1 2 k w k 2 s.t. 8 ( x , y ) 2 D, y ( w · x ) � 1 s.t. functional margin min is at least 1 w |w| not differentiable, |w| 2 is.
SVM vs. MIRA • SVM: min weight vector to enforce functional margin of at least 1 on ALL EXAMPLES • MIRA: min weight change to enforce functional margin of at least 1 on THIS EXAMPLE • MIRA is 1-step or online approximation of SVM • Aggressive MIRA → SVM as p → 1 x i 1 MIRA 2 k w k 2 s.t. 8 ( x , y ) 2 D, y ( w · x ) � 1 min w perceptron w 0 k w 0 � w k 2 min s.t. w 0 · x � 1 w
Convex Hull Interpretation max. distance between convex hulls how many support vectors in 2D? weight vector is determined by the support vectors alone why don’t use convex hulls c.f. perceptron: X w = y · x for SVMs in practice?? what about MIRA? ( x ,y ) ∈ errors
Convexity and Convex Hulls convex combination
Optimization • Primal optimization problem 1 2 k w k 2 s.t. 8 ( x , y ) 2 D, y ( w · x ) � 1 min w constraint • Convex optimization: convex function over convex set! • Quadratic prog.: quadratic function w/ linear constraints linear quadratic
MIRA as QP • MIRA is a trivial QP; can be solved geometrically • what about multiple constraints (e.g. minibatch)? x i ⊕ w 0 k w 0 � w k 2 min � w i · x i k x i k k x i k 1 s.t. w 0 · x � 1 MIRA 1 � w i · x i k x i k w i
Optimization • Primal optimization problem 1 2 k w k 2 s.t. 8 ( x , y ) 2 D, y ( w · x ) � 1 min w • Convex optimization: convex function over convex set! • Lagrange function L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i constraint Derivatives in w need to vanish model is a linear combo X α ∂ w L ( w, b, a ) = w − α i y i x i = 0 of a small subset of input i (the support vectors) X X ∂ b L ( w, b, a ) = α i y i = 0 i.e., those with α i > 0 w = y i α i x i i i
Lagrangian & Saddle Point • equality: min x 2 s.t. x = 1 • inequality: min x 2 s.t. x >= 1 • Lagrangian: L(x, α )=x 2 - α (x-1) • derivative in x need to vanish x • optimality is at saddle point with α • min x in primal => max α in dual α
Constrained Optimization constraint 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i • Quadratic Programming Karush–Kuhn–Tucker i • Quadratic Objective KKT condition ( complementary slackness ) • Linear Constraints optimal point is achieved at active constraints where α i > 0 ( α i =0 => inactive) α i [ y i [ h w, x i i + b ] � 1] = 0
KKT => Support Vectors 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w Karush Kuhn Tucker (KKT) α i = 0 Optimality Condition α i > 0 = ) y i [ h w, x i i + b ] = 1 α i [ y i [ h w, x i i + b ] � 1] = 0
Properties X w = y i α i x i i w • Weight vector w as weighted linear combination of instances • Only points on margin matter (ignore the rest and get same solution) • Only inner products matter • Quadratic program • We can replace the inner product by a kernel • Keeps instances away from the margin
Alternative: Primal=>Dual • Lagrange function L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i • Derivatives in w need to vanish X α ∂ w L ( w, b, a ) = w − α i y i x i = 0 i X X ∂ b L ( w, b, a ) = α i y i = 0 w = y i α i x i i i • Plugging w back into L yields � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i dual variables X α i y i = 0 and α i � 0 subject to
Primal vs. Dual Primal 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w � 1 Dual X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i X dual variables α i y i = 0 and α i � 0 subject to i
Solving the optimization problem • Dual problem � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i dual variables X α i y i = 0 and α i � 0 subject to i • If problem is small enough (1000s of variables) we can use off-the-shelf solver (CVXOPT, CPLEX, OOQP, LOQO) • For larger problem use fact that only SVs matter and solve in blocks (active set method).
Quadratic Program in Dual • Methods • Dual problem • Gradient Descent − 1 2 α T Q α − α T b • Coordinate Descent maximize α • aka., Hildreth Algorithm subject to α ≥ 0 • Sequential Minimal Optimization (SMO) Q: what’s the Q in SVM primal? how about Q in SVM dual? • Quadratic Programming • Objective: Quadratic function • Q is positive semidefinite • Constraints: Linear functions
Convex QP • if Q is positive (semi) definite, i.e., x T Qx ≥ 0 for all x, then convex QP => local min/max is global min/max svm • if Q = 0, it reduces to linear programming QP CP LP • if Q is indefinite => saddlepoint • general QP is NP-hard; convex QP is polynomial-time
QP: Hildreth Algorithm • idea 1: • update one coordinate while fixing all other coordinates • e.g., update coordinate i is to solve: − 1 2 α T Q α − α T b argmax α i subject to α ≥ 0 Quadratic function with only one variable Maximum => first-order derivative is 0
QP: Hildreth Algorithm • idea 2: • choose another coordinate and repeat until meet stopping criterion • reach maximum or • increase between 2 consecutive iterations is very small or • after some # of iterations • how to choose coordinate: sweep pattern Sequential: • • 1, 2, ..., n, 1, 2, ..., n, ... • 1, 2, ..., n, n-1, n-2, ...,1, 2, ... • Random: permutation of 1,2, ..., n • Maximal Descent • choose i with maximal descent in objective
QP: Hildreth Algorithm initialize for all α i = 0 i repeat pick following sweep pattern i solve − 1 2 α T Q α − α T b α i ← argmax α i subject to α ≥ 0 until meet stopping criterion
QP: Hildreth Algorithm ✓ 4 ✓ − 6 ◆ ◆ − 1 1 2 α T α − α T maximize 1 2 − 4 α subject to α ≥ 0 • choose coordinates • 1, 2, 1, 2, ...
QP: Hildreth Algorithm • pros: • extremely simple • no gradient calculation • easy to implement • cons: • converges slow, compared to other methods • can’t deal with too many constraints • works for minibatch MIRA but not SVM
Linear Separator Ham Spam
Large Margin Classifier h w, x i + b � 1 h w, x i + b � 1 + + linear separator linear function is impossible f ( x ) = h w, x i + b
Large Margin Classifier h w, x i + b � 1 h w, x i + b � 1 + minimum error separator is impossible Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard
Adding slack variables h w, x i + b � 1 � ξ h w, x i + b � 1 + ξ + - minimize amount of slack Convex optimization problem
margin violation vs. misclassification misclassification is also margin violation ( ξ >0)
Recommend
More recommend