smo algorithm
play

SMO Algorithm Milan Straka December 02, 2019 Charles University in - PowerPoint PPT Presentation

NPFL129, Lecture 7 SMO Algorithm Milan Straka December 02, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Kernel Linear Regression 3 O ( D ) D When


  1. NPFL129, Lecture 7 SMO Algorithm Milan Straka December 02, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

  2. Kernel Linear Regression 3 O ( D ) D When dimensionality of input is , one step of SGD takes . Surprisingly, we can do better under some circumstances. We start by noting that we can write φ ( x ) w i the parameters as a linear combination of the input features . w = 0 = 0 ⋅ φ ( x ) w = i φ ( x ⋅ ) ∑ i ∑ i β i i By induction, , and assuming , after a SGD update we get ∑ w ← w + α − w φ ( x T ) ) φ ( x ) ( t i i i i ∑ ( β ) ) ) φ ( x = + α ( t − w φ ( x T ). i i i i i α ( t ) ) ← + − w φ ( x T β β w i i i i A individual update is , and substituting for we get α ( t ∑ ) ) . ← + − φ ( x ) φ ( x T β β β i i i j j i j NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 2/22

  3. Kernel Linear Regression We can formulate the alternative linear regression algorithm (it would be called a dual formulation ): R N × D t ∈ R N α ∈ R + X = { x , x , … , x } ∈ 1 2 N Input : Dataset ( , ), learning rate . ← 0 β i Set K ( x , x ) = φ ( x ) φ ( x ) i T i j j Compute all values Repeat Update the coordinates, either according to a full gradient update: β ← β + α ( t − K β ) or alternatively use single-batch SGD, arriving at: {1, … , N } i for in random permutation of : α ( t ) ) ← β + − K ( x , x ∑ j β β i i i j i j β ← β + α ( t − K β ) In vector notation, we can write . y ( x ) = w φ ( x ) = φ ( x ) φ ( x ) T ∑ i i T β i The predictions are then performed by computing . NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 3/22

  4. Support Vector Machines X ∈ R N × D t ∈ {−1, 1} N φ Assume we have a dataset , , feature map and model def φ ( x ) w + y ( x ) = T b . x i We already know that the distance of a point to the decision boundary is ∣ y ( x )∣ y ( x ) t i i i = . ∣∣ w ∣∣ ∣∣ w ∣∣  We therefore want to maximize 1 arg max ∣∣ w ∣∣ min ( φ ( x ) w + T b ) ] . [ t Figure 4.1 of Pattern Recognition and Machine Learning. i i w , b However, this problem is difficult to optimize directly. NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 4/22

  5. Support Vector Machines w b Because the model is invariant to multiplying and by a constant, we can say that for the points closest to the decision boundary, it will hold that y ( x ) = 1. t i i y ( x ) ≥ 1 t i i Then for all the points we will have and we can simplify 1 arg max ∣∣ w ∣∣ min ( φ ( x ) w + T b ) ] [ t i i w , b to 1 2 arg min 2 ∣∣ w ∣∣ given that t y ( x ) ≥ 1. i i w , b NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 5/22

  6. Support Vector Machines In order to solve the constrained problem of 1 2 arg min 2 ∣∣ w ∣∣ given that t y ( x ) ≥ 1, i i w , b a = ( a , … , a ) 1 N we write the Lagrangian with multipliers as 1 ∑ 2 L = ∣∣ w ∣∣ − y ( x ) − 1 ] . [ t a i i i 2 i w b Setting the derivatives with respect to and to zero, we get ∑ w = φ ( x ) a t i i i i ∑ 0 = a t i i i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 6/22

  7. Support Vector Machines Substituting these to the Lagrangian, we get 1 ∑ ∑ ∑ L = − K ( x , x ) a a a t t i i j i j i j 2 i i j ∀ : ≥ 0 = 0 K ( x , z ) = φ ( x ) φ ( z ). ∑ i T a a t i i i i with respect to the constraints , and kernel The solution of this Lagrangian will fulfil the KKT conditions, meaning that ≥ 0 a i y ( x ) − 1 ≥ 0 t i i y ( x ) − 1 ) = 0. ( t a i i i = 0 a x i Therefore, either a point is on a boundary, or . Given that the predictions for point are y ( x ) = K ( x , x ) + ∑ a t b i i i given by , we need to keep only the points on the boundary, the so-called support vectors . NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 7/22

  8. Support Vector Machines The dual formulation allows us to use non-linear kernels.                                                                                Figure 7.2 of Pattern Recognition and Machine Learning. NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 8/22

  9. Support Vector Machines for Non-linearly Separable Data Until now, we assumed the data to be linearly separable – the hard-margin SVM variant. We now relax this condition to arrive at soft-margin SVM . The idea is to allow points to be in the margin or even on the wrong side of the decision boundary. ≥ 0 ξ i We introduce slack variables , one for each training instance, defined as { 0 for points fulfilling t y ( x ) ≥ 1, i i = ξ i ∣ t − y ( x )∣ otherwise. i i = 0 0 < ξ < 1 ξ i i Therefore, signifies a point outside of margin, denotes a point inside the = 1 > 1 ξ ξ i i margin, is a point on the decision boundary and indicates the point is on the opposite side of the separating hyperplane. Therefore, we want to optimize 1 ∑ i 2 arg min + ∣∣ w ∣∣ given that t y ( x ) ≥ 1 − ξ and ξ ≥ 0. C ξ i i i i 2 w , b i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 9/22

  10. Support Vector Machines for Non-linearly Separable Data a = ( a , … , a ) μ = 1 N We again create a Lagrangian, this time with multipliers and also ( μ , … , μ ) 1 N : 1 ∑ i ∑ ∑ 2 L = ∣∣ w ∣∣ + − y ( x ) − 1 + ξ ] − . [ t C ξ a μ ξ i i i i i i 2 i i i w b ξ Solving for the critical points and substituting for , and (obtaining an additional = C − a μ i i constraint compared to the previous case), we obtain the Lagrangian in the form 1 ∑ ∑ ∑ L = − K ( x , x ), a a a t t i i j i j i j 2 i i j which is identical to the previous case, but the constraints are a bit different: ∑ ∀ : C ≥ a ≥ 0 and = 0. a t i i i i i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 10/22

  11. Support Vector Machines for Non-linearly Separable Data > 0 a i Using KKT conditions, we can see that the support vectors (examples with ) are the y ( x ) = 1 − ξ t i i i ones with , i.e., the examples on the margin boundary, inside the margin and on the opposite side of the decision boundary.       Figure 7.4 of Pattern Recognition and Machine Learning. NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 11/22

  12. Sequential Minimal Optimization Algorithm To solve the dual formulation of a SVM, usually Sequential Minimal Optimization (SMO; John Platt, 1998) algorithm is used. Before we introduce it, we start by introducing coordinate descent optimization algorithm. Consider solving unconstrained optimization problem arg min L ( w , w , … , w ). 1 2 D w Instead of the usual SGD approach, we could optimize the weights one by one, using the following algorithm loop until convergence {1, 2, … , D } i for in : w ← arg min L ( w , w , … , w ) 1 2 i D w i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 12/22

  13. Sequential Minimal Optimization Algorithm loop until convergence {1, 2, … , D } i for in : w ← arg min L ( w , w , … , w ) 1 2 i D w i arg min  If the inner can be performed efficiently, the  coordinate descent can be fairly efficient.  w i Note that we might want to choose in different  w i order, for example by trying to choose providing the  L largest decrease of .                CS229 Lecture 3 Notes, http://cs229.stanford.edu/notes/cs229-notes3.pdf NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 13/22

  14. Sequential Minimal Optimization Algorithm In soft-margin SVM, we try to minimize 1 ∑ ∑ ∑ L = − K ( x , x ), a a a t t i i j i j i j 2 i i j such that ∑ ∀ : C ≥ a ≥ 0 and = 0. a t i i i i i The KKT conditions for the solution can be reformulated (while staying equivalent) as > 0 ⇒ t y ( x ) ≤ 1, because a > 0 ⇒ t y ( x ) = 1 − ξ and we have ξ ≥ 0, a i i i i i i i i < C ⇒ t y ( x ) ≥ 1, because a < C ⇒ μ > 0 ⇒ ξ = 0 and t y ( x ) ≥ 1 − ξ , a i i i i i i i i i 0 < a < C ⇒ t y ( x ) = 1, a combination of both. i i i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 14/22

Recommend


More recommend