NPFL129, Lecture 6 Soft-margin SVM, SMO Algorithm, Decision Trees Milan Straka November 25, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Kernel Linear Regression 3 O ( D ) D When dimensionality of input is , one step of SGD takes . Surprisingly, we can do better under some circumstances. We start by noting that we can write φ ( x ) w i the parameters as a linear combination of the input features . w = 0 = 0 ⋅ φ ( x ) w = i φ ( x ⋅ ) ∑ i ∑ i β i i By induction, , and assuming , after a SGD update we get ∑ w ← w + α − w φ ( x T ) ) φ ( x ) ( t i i i i ∑ ( β ) ) ) φ ( x = + α ( t − w φ ( x T ). i i i i i α ( t ) ) ← + − w φ ( x T β β w i i i i A individual update is , and substituting for we get α ( t ∑ ) ) . ← + − φ ( x ) φ ( x T β β β i i i j j i j NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 2/27
Kernel Linear Regression We can formulate the alternative linear regression algorithm (it would be called a dual formulation ): R N × D t ∈ R N α ∈ R + X = { x , x , … , x } ∈ 1 2 N Input : Dataset ( , ), learning rate . ← 0 β i Set K ( x , x ) = φ ( x ) φ ( x ) i T i j j Compute all values Repeat Update the coordinates, either according to a full gradient update: β ← β + α ( t − K β ) or alternatively use single-batch SGD, arriving at: {1, … , N } i for in random permutation of : α ( t ) ) ← β + − K ( x , x ∑ j β β i i i j i j β ← β + α ( t − K β ) In vector notation, we can write . y ( x ) = w φ ( x ) = φ ( x ) φ ( x ) T ∑ i i T β i The predictions are then performed by computing . NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 3/27
Kernels φ We define a kernel corresponding to a feature map as a function def φ ( x ) φ ( z ). K ( x , z ) = t There is quite a lot of theory behind kernel construction. The most often used kernels are: d polynomial kernel or degree K ( x , z ) = ( γ x z + T 1) , d d which corresponds to a feature map generating all combinations of up to input features; Gaussian (or RBF) kernel − γ ∣∣ x − z ∣∣ 2 K ( x , z ) = e , corresponding to a scalar product in an infinite-dimensional space (it is in a sense a combination of polynomial kernels of all degrees). NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 4/27
Support Vector Machines X ∈ R N × D t ∈ {−1, 1} N φ Assume we have a dataset , , feature map and model def φ ( x ) w + y ( x ) = T b . x i We already know that the distance of a point to the decision boundary is ∣ y ( x )∣ y ( x ) t i i i = . ∣∣ w ∣∣ ∣∣ w ∣∣ We therefore want to maximize 1 arg max ∣∣ w ∣∣ min ( φ ( x ) w + T b ) ] . [ t Figure 4.1 of Pattern Recognition and Machine Learning. i i w , b However, this problem is difficult to optimize directly. NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 5/27
Support Vector Machines w b Because the model is invariant to multiplying and by a constant, we can say that for the points closest to the decision boundary, it will hold that y ( x ) = 1. t i i y ( x ) ≥ 1 t i i Then for all the points we will have and we can simplify 1 arg max ∣∣ w ∣∣ min ( φ ( x ) w + T b ) ] [ t i i w , b to 1 2 arg min 2 ∣∣ w ∣∣ given that t y ( x ) ≥ 1. i i w , b NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 6/27
Support Vector Machines In order to solve the constrained problem of 1 2 arg min 2 ∣∣ w ∣∣ given that t y ( x ) ≥ 1, i i w , b a = ( a , … , a ) 1 N we write the Lagrangian with multipliers as 1 ∑ 2 L = ∣∣ w ∣∣ − y ( x ) − 1 ] . [ t a i i i 2 i w b Setting the derivatives with respect to and to zero, we get ∑ w = φ ( x ) a t i i i i ∑ 0 = a t i i i NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 7/27
Support Vector Machines Substituting these to the Lagrangian, we get 1 ∑ ∑ ∑ L = − K ( x , x ) a a a t t i i j i j i j 2 i i j ∀ : ≥ 0 = 0 K ( x , z ) = φ ( x ) φ ( z ). ∑ i T a a t i i i i with respect to the constraints , and kernel The solution of this Lagrangian will fulfil the KKT conditions, meaning that ≥ 0 a i y ( x ) − 1 ≥ 0 t i i y ( x ) − 1 ) = 0. ( t a i i i = 0 a x i Therefore, either a point is on a boundary, or . Given that the predictions for point are y ( x ) = K ( x , x ) + ∑ a t b i i i given by , we need to keep only the points on the boundary, the so-called support vectors . NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 8/27
Support Vector Machines The dual formulation allows us to use non-linear kernels. Figure 7.2 of Pattern Recognition and Machine Learning. NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 9/27
Support Vector Machines for Non-linearly Separable Data Until now, we assumed the data to be linearly separable – the hard-margin SVM variant. We now relax this condition to arrive at soft-margin SVM . The idea is to allow points to be in the margin or even on the wrong side of the decision boundary. We introduce ≥ 0 ξ i slack variables , one for each training instance, defined as { 0 for points fulfilling t y ( x ) ≥ 1, i i = ξ i ∣ t − y ( x )∣ otherwise. i i Figure 7.3 of Pattern Recognition and Machine Learning. = 0 0 < ξ < 1 ξ i i Therefore, signifies a point outside of margin, denotes a point inside the = 1 > 1 ξ ξ i i margin, is a point on the decision boundary and indicates the point is on the opposite side of the separating hyperplane. Therefore, we want to optimize 1 ∑ i 2 arg min + ∣∣ w ∣∣ given that t y ( x ) ≥ 1 − ξ and ξ ≥ 0. C ξ i i i i 2 w , b i NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 10/27
Support Vector Machines for Non-linearly Separable Data a = ( a , … , a ) μ = 1 N We again create a Lagrangian, this time with multipliers and also ( μ , … , μ ) 1 N : 1 ∑ i ∑ ∑ 2 L = ∣∣ w ∣∣ + − y ( x ) − 1 + ξ ] − . [ t C ξ a μ ξ i i i i i i 2 i i i w b ξ Solving for the critical points and substituting for , and (obtaining an additional = C − a μ i i constraint compared to the previous case), we obtain the Lagrangian in the form 1 ∑ ∑ ∑ L = − K ( x , x ), a a a t t i i j i j i j 2 i i j which is identical to the previous case, but the constraints are a bit different: ∑ ∀ : C ≥ a ≥ 0 and = 0. a t i i i i i NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 11/27
Support Vector Machines for Non-linearly Separable Data > 0 a i Using KKT conditions, we can see that the support vectors (examples with ) are the y ( x ) = 1 − ξ t i i i ones with , i.e., the examples on the margin boundary, inside the margin and on the opposite side of the decision boundary. Figure 7.4 of Pattern Recognition and Machine Learning. NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 12/27
SGD-like Formulation of Soft-Margin SVM Note that the slack variables can be written as = max ( 0, 1 − t y ( x ) ) , ξ i i i so we can reformulate the soft-margin SVM objective using hinge loss def max(0, 1 − ty ) ( t , y ) = L hinge to 1 ∑ 2 arg min , y ( x ) ) + ∣∣ w ∣∣ . ( t L C hinge i i 2 w , b i C Such formulation is analogous to a regularized loss, where is an inverse regularization C = ∞ C = 0 strength, so implies no regularization and ignores the data entirely. NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 13/27
Recommend
More recommend