Support vector machines Course of Machine Learning Master Degree in Computer Science University of Rome “Tor Vergata” Giorgio Gambosi a.a. 2018-2019 1
Idea The binary classification problem is approached in a direct way, that is: We try and find a plane that separates the classes in feature space (indeed, a “best” plane, according to a reasonable characteristic) If this is not possible, we get creative in two ways: • We soften what we mean by “separates”, and • We enrich and enlarge the feature space so that separation is (more) possible 2
Margins 3 A can be assigned to C 1 with greater confidence than B and even greater confidence than C .
Binary classifiers Moreover, we consider linear classifier such as belongs to each class. 4 Consider a binary classifier which, for any element x , returns a value y ∈ {− 1 , 1 } , where we assume that x is assigned to C 0 if y = − 1 and to C 1 if y = 1 . h ( x ) = g ( w T φ ( x i ) + w 0 ) where g ( z ) = 1 if z ≥ 0 and g ( z ) = − 1 if z < 0 . The prediction on the class of x is then provided by deriving a value in {− 1 , 1 } just as in the case of a perceptron, that is with no estimation of the probabilities p ( C i | x ) that x
Margins item is defined as 5 For any training set item ( x i , t i ) , the functional margin of ( w , w 0 ) wrt such γ i = t i ( w T φ ( x i ) + w 0 ) Observe that the resulting prediction is correct iff γ i > 0 . Moreover, larger values of γ i denote greater confidence on the prediction. Given a training set T = { ( x 1 , t 1 ) , . . . , ( x n , t n ) } the functional margin of ( w , w 0 ) wrt T is the minimum functional margin for all items in T γ = min γ i i
Margins hyperplane 6 The geometric margin γ i of a training set item x i , t i is defined as the product of t i and the distance from x i to the boundary hyperplane, that is as the length of the line segment from x i to its projection on the boundary β A x x x x x γ i x x B
Margins 7 Since, in general, the distance of a point x from a hyperplane w T x = 0 is w T x || w || , it results ( w T ) || w || φ ( x i ) + w 0 γ i γ i = t i = || w || || w || So, differently from γ i , the geometric margin γ i is invariant wrt parameter scaling. In fact, by substituting c w to w and cw 0 to w 0 , we get γ i = t i ( c w T φ ( x i ) + cw 0 ) = ct i ( w T φ ( x i ) + w 0 ) ( c w T ( w T || c w || φ ( x i ) + cw 0 ) || w || φ ( x i ) + w 0 ) γ i = t i = t i || c w || || w ||
the hyperplane and passing (at least one of them) through some point Margins 8 • The geometric margin wrt the training set T = { ( x 1 , t 1 ) , . . . , ( x n , t n ) } is then defined as the smallest geometric margin for all items ( x i , t i ) γ = min γ i i • a useful interpretation of γ is as half the width of the largest strip, centered on the hyperplane w T φ ( x ) + w 0 = 0 , containing none of the points x 1 , . . . , x n • the hyperplanes on the boundary of such strip, each at distance γ from x i are said maximum margin hyperplanes.
Margins x x x x x x x x x x x 9 2 γ
Optimal margin classifiers max max That is, 10 Assume classes are linearly separable in the training set: hence, there exists as large as possible, the confidence on the provided classification increases. between the hyperplanes and the set of points corresponding to elements Given a training set T , we wish to find the hyperplanes which separates the two classes (if one does exist) and has maximum γ : by making the distance a hyperplane (an infinity of them, indeed) separating elements in C 1 from elements in C 2 . In order to find the one among those hyperplanes which maximizes γ , we have to solve the following optimization problem w ,w 0 γ t i || w || ( w T φ ( x i ) + w 0 ) ≥ γ where γ i = i = 1 , . . . , n w ,w 0 γ where t i ( w T φ ( x i ) + w 0 ) ≥ γ || w || i = 1 , . . . , n
Optimal margin classifiers exists at least one active point. 11 exploit this freedom to introduce the constraint As observed, if all parameters are scaled by any constant c , all geometric margins γ i between elements and hyperplane are unchanged: we may then t i ( w T φ ( x i ) + w 0 ) = 1 γ = min i This can be obtained by assuming || w || = 1 γ , which corresponds to considering a scale where the maximum margin has width 2 . This results, for each element x i , t i , into a constraint γ i = t i ( w T φ ( x i ) + w 0 ) ≥ 1 An element (point) is said active if the equality holds, that is if t i ( w T φ ( x i ) + w 0 ) = 1 and inactive if this does not hold. Observe that, by definition, there must
Optimal margin classifiers outside the margin strip on the maximum margin hyperplane class, inside the margin strip other class, inside the margin strip other class, on the maximum margin hyperplane other class, outside the margin strip 12 For any element x , t , 1. t ( w T φ ( x ) + w 0 ) > 1 if φ ( x ) is in the region corresponding to its class, 2. t ( w T φ ( x ) + w 0 ) = 1 if φ ( x ) is in the region corresponding to its class, 3. 0 < t ( w T φ ( x ) + w 0 ) < 1 if φ ( x ) is in the region corresponding to its 4. t ( w T φ ( x ) + w 0 ) = 0 if φ ( x ) is on the separating hyperplane 5. − 1 < t ( w T φ ( x ) + w 0 ) < 0 if φ ( x ) is in the region corresponding to the 6. t ( w T φ ( x ) + w 0 ) = − 1 if φ ( x ) is in the region corresponding to the 7. t ( w T φ ( x ) + w 0 ) < − 1 if φ ( x ) is in the region corresponding to the
Optimal margin classifiers formulate the problem as a convex polyhedron (intersection of half-spaces). minimized is in fact convex and the set of points satisfying the constraint is The optimization problem, is then transformed into min 13 max w ,w 0 γ = || w || − 1 where t i ( w T φ ( x i ) + w 0 ) ≥ 1 i = 1 , . . . , n Maximizing || w || − 1 is equivalent to minimizing || w || 2 (we prefer minimizing || w || 2 instead of || w || since it is smooth everywhere): hence we may 1 2 || w || 2 w ,w 0 where t i ( w T φ ( x i ) + w 0 ) ≥ 1 i = 1 , . . . , n This is a convex quadratic optimization problem. The function to be
Duality From optimization theory it derives that, given the problem structure (linear constraints + convexity): • the optimum of the dual problem is the same the the original (primal) problem 14 • there exists a dual formulation of the problem
Karush-Kuhn-Tucker theorem Consider the optimization problem max Then, the solution of the original problem is the same as the solution of and the minimum 15 min x ∈ Ω f ( x ) g i ( x ) ≥ 0 i = 1 , . . . , k i = 1 , . . . , k ′ h j ( x ) = 0 where f ( x ) , g i ( x ) , h j ( x ) are convex functions and Ω is a convex set. Define the Lagrangian k ′ k ∑ ∑ L ( x , λ , µ ) = h ( x ) + λ i g i ( x ) + µ j h j ( x ) i =1 j =1 θ ( λ , µ ) = min L ( x , λ , µ ) x λ , µ θ ( λ , µ ) λ i ≥ 0 i = 1 , . . . , k
Karush-Kuhn-Tucker theorem The following necessary and sufficient conditions apply for the existence of 16 an optimum ( x ∗ , λ ∗ , µ ∗ ) . ∂L ( x , λ , µ ) � x ∗ , λ ∗ , µ ∗ = 0 � ∂ x � ∂L ( x , λ , µ ) � x ∗ , λ ∗ , µ ∗ = g i ( x ∗ ) ≥ 0 i = 1 , . . . , k � ∂λ i � ∂L ( x , λ , µ ) � x ∗ , λ ∗ , µ ∗ = h j ( x ∗ ) = 0 i = j, . . . , k ′ � ∂µ j � λ ∗ i ≥ 0 i = 1 , . . . , k λ ∗ i g i ( x ∗ ) = 0 i = 1 , . . . , k Note: the last condition states that a Lagrangian multiplier λ ∗ i can be non-zero only if g i ( x ∗ ) = 0 , that is of x ∗ is“at the limit” for the constraint g i ( x ) . In this case, the constraint is said active.
Applying Kuhn-Tucker theorem max under the constraints min In our case, min min 17 By the KKT theorem, the solution is then the same as the solution of hence convex. • f ( x ) corresponds to 1 2 || w || 2 • g i ( x ) corresponds to t i ( w T φ ( x i ) + w 0 ) − 1 ≥ 0 • there is no h j ( x ) • Ω is the intersection of a set of hyperplanes, that is a polyhedron, ( n )) 1 2 w T w − t i ( w T φ ( x i ) + w 0 ) − 1 ( ∑ w ,w 0 L ( w , w 0 , λ ) = max λ i λ λ w ,b i =1 ( ) n n 1 2 w T w − λ i t i ( w T φ ( x i ) + w 0 ) + ∑ ∑ = max λ i λ w ,w 0 i =1 i =1 λ i ≥ 0 i = 1 , . . . , k
Applying the KKT conditions Since the KKT conditions hold for the maximum point, it must be, at that 18 point: n ∂L ( w , w 0 , λ ) ∑ = w − λ i t i φ ( x i ) = 0 ∂ w i =1 n ∂L ( w , w 0 , λ ) ∑ = λ i t i = 0 ∂w 0 i =1 t i ( w T φ ( x i ) + w 0 ) − 1 ≥ 0 i = 1 , . . . , n λ i ≥ 0 i = 1 , . . . , n ( t i ( w T φ ( x i ) + w 0 ) − 1 ) λ i = 0 i = 1 , . . . , n
Lagrange method: dual problem max where 19 problem We may apply the above relations to drop w and w 0 from L ( w , w 0 , λ ) and from all constraints. As a result, we get a new dual formulation of the ( n ) n n λ i − 1 ∑ ∑ ∑ L ( λ ) = max λ i λ j t i t j φ ( x i ) φ ( x j ) 2 λ λ i =1 i =1 j =1 λ i ≥ 0 i = . . . , n n ∑ λ i t i = 0 i =1
Dual problem and kernel function max By defining the kernel function 20 the dual problem’s formulation can be given as κ ( x i , x j ) = φ ( x i ) T φ ( x j ) ( n n n ) λ i − 1 ˜ ∑ ∑ ∑ L ( λ ) = max λ i λ j t i t j κ ( x i , x j ) 2 λ λ i =1 i =1 j =1 λ i ≥ 0 i = 1 , . . . , n n ∑ λ i t i = 0 i =1
Recommend
More recommend