Introduction to Machine Learning 5. Support Vector Classification Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701
Outline • Support Vector Classification Large Margin Separation, optimization problem • Properties Support Vectors, kernel expansion • Soft margin classifier Dual problem, robustness
Support Vector Machines http://maktoons.blogspot.com/2009/03/support-vector-machine.html
Linear Separator Ham Spam
Linear Separator Ham Spam
Linear Separator Ham Spam
Linear Separator Ham Spam
Linear Separator Ham Spam
Linear Separator Ham Spam
Linear Separator Ham Spam
Large Margin Classifier h w, x i + b � 1 h w, x i + b � 1 linear function f ( x ) = h w, x i + b
Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w margin h x + � x − , w i 1 1 = 2 k w k [[ h x + , w i + b ] � [ h x − , w i + b ]] = 2 k w k k w k
Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 maximize k w k subject to y i [ h x i , w i + b ] � 1 w,b
Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b
Dual Problem • Primal optimization problem 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b constraint • Lagrange function L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i Optimality in w, b is at saddle point with α • Derivatives in w, b need to vanish
Dual Problem • Lagrange function L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i • Derivatives in w, b need to vanish X ∂ w L ( w, b, a ) = w − α i y i x i = 0 i X ∂ b L ( w, b, a ) = α i y i = 0 i • Plugging terms back into L yields � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i X α i y i = 0 and α i � 0 subject to
Support Vector Machines 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i X α i y i = 0 and α i � 0 subject to i
Support Vectors 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w Karush Kuhn Tucker α i = 0 Optimality condition α i > 0 = ) y i [ h w, x i i + b ] = 1 α i [ y i [ h w, x i i + b ] � 1] = 0
Properties X w = y i α i x i i w • Weight vector w as weighted linear combination of instances • Only points on margin matter (ignore the rest and get same solution) • Only inner products matter • Quadratic program • We can replace the inner product by a kernel • Keeps instances away from the margin
Example
Example
Why large margins? • Maximum robustness relative o to uncertainty r o o • Symmetry breaking + • Independent of correctly classified o + instances ρ • Easy to find for + easy problems +
Support Vector CLASSIFIERS Machines
Large Margin Classifier h w, x i + b � 1 h w, x i + b � 1 linear function f ( x ) = h w, x i + b
Large Margin Classifier h w, x i + b � 1 h w, x i + b � 1 linear function f ( x ) = h w, x i + b
Large Margin Classifier h w, x i + b � 1 h w, x i + b � 1 linear separator linear function is impossible f ( x ) = h w, x i + b
Large Margin Classifier h w, x i + b � 1 h w, x i + b � 1 Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard
Large Margin Classifier h w, x i + b � 1 h w, x i + b � 1 Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard
Large Margin Classifier h w, x i + b � 1 h w, x i + b � 1 minimum error separator is impossible Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard
Adding slack variables h w, x i + b � 1 � ξ h w, x i + b � 1 + ξ Convex optimization problem
Adding slack variables h w, x i + b � 1 � ξ h w, x i + b � 1 + ξ Convex optimization problem
Adding slack variables h w, x i + b � 1 � ξ h w, x i + b � 1 + ξ minimize amount of slack Convex optimization problem
Intermezzo Convex Programs for Dummies • Primal optimization problem minimize f ( x ) subject to c i ( x ) ≤ 0 x • Lagrange function X L ( x, α ) = f ( x ) + α i c i ( x ) i • First order optimality conditions in x X ∂ x L ( x, α ) = ∂ x f ( x ) + α i ∂ x c i ( x ) = 0 i • Solve for x and plug it back into L maximize L ( x ( α ) , α ) α (keep explicit constraints)
Adding slack variables h w, x i + b � 1 � ξ h w, x i + b � 1 + ξ Convex optimization problem
Adding slack variables h w, x i + b � 1 � ξ h w, x i + b � 1 + ξ Convex optimization problem
Adding slack variables h w, x i + b � 1 � ξ h w, x i + b � 1 + ξ minimize amount of slack Convex optimization problem
Adding slack variables • Hard margin problem 1 2 k w k 2 subject to y i [ h w, x i i + b ] � 1 minimize w,b • With slack variables 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 Problem is always feasible. Proof: (also yields upper bound) w = 0 and b = 0 and ξ i = 1
Dual Problem • Primal optimization problem 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 • Lagrange function L ( w, b, α ) = 1 2 k w k 2 + C X X X ξ i � α i [ y i [ h x i , w i + b ] + ξ i � 1] � η i ξ i i i i Optimality in w,b, ξ is at saddle point with α , η • Derivatives in w,b, ξ need to vanish
Dual Problem • Lagrange function L ( w, b, α ) = 1 2 k w k 2 + C X X X ξ i � α i [ y i [ h x i , w i + b ] + ξ i � 1] � η i ξ i i i i • Derivatives in w, b need to vanish X ∂ w L ( w, b, ξ , α , η ) = w − α i y i x i = 0 i X ∂ b L ( w, b, ξ , α , η ) = α i y i = 0 i ∂ ξ i L ( w, b, ξ , α , η ) = C − α i − η i = 0 • Plugging terms back into L yields bound � 1 X X maximize α i α j y i y j h x i , x j i + influence α i 2 α i,j i X subject to α i y i = 0 and α i 2 [0 , C ] i
Karush Kuhn Tucker Conditions � 1 X X maximize α i α j y i y j h x i , x j i + α i 2 α i,j i X subject to α i y i = 0 and α i 2 [0 , C ] i X w = y i α i x i i w α i = 0 = ) y i [ h w, x i i + b ] � 1 α i [ y i [ h w, x i i + b ] + ξ i � 1] = 0 0 < α i < C = ) y i [ h w, x i i + b ] = 1 η i ξ i = 0 α i = C = ) y i [ h w, x i i + b ] 1
C=1
C=2
C=5
C=10
C=20
C=50
C=100
C=1
C=2
C=5
C=10
C=20
C=50
C=100
C=1
C=2
C=5
C=10
C=20
C=50
C=100
C=1
C=2
C=5
C=10
C=20
C=50
C=100
Solving the optimization problem • Dual problem � 1 X X maximize α i α j y i y j h x i , x j i + α i 2 α i,j i X subject to α i y i = 0 and α i 2 [0 , C ] i • If problem is small enough (1000s of variables) we can use off-the-shelf solver (CVXOPT, CPLEX, OOQP, LOQO) • For larger problem use fact that only SVs matter and solve in blocks (active set method).
Nonlinear Separation
The Kernel Trick • Linear soft margin problem 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 • Dual problem � 1 X X maximize α i α j y i y j h x i , x j i + α i 2 α i,j i X subject to α i y i = 0 and α i 2 [0 , C ] i • Support vector expansion X f ( x ) = α i y i h x i , x i + b i
The Kernel Trick • Linear soft margin problem 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, φ ( x i ) i + b ] � 1 � ξ i and ξ i � 0 • Dual problem − 1 X X maximize α i α j y i y j k ( x i , x j ) + α i 2 α i,j i X subject to α i y i = 0 and α i ∈ [0 , C ] • Support vector expansion i X f ( x ) = α i y i k ( x i , x ) + b i
C=1
C=2
C=5
C=10
C=20
C=50
C=100
C=1
C=2
C=5
C=10
C=20
C=50
C=100
C=1
C=2
C=5
C=10
C=20
C=50
C=100
C=1
C=2
C=5
C=10
C=20
C=50
C=100
Recommend
More recommend