Data Separation x 2 x 2 Class #13: Support Vector Machines (SVMs) and Kernel Functions x 1 x 1 Machine Learning (COMP 135): M. Allen, 02 Mar. 20 Linear classification with a perceptron or logistic function look for a dividing line in } the data (or a plane, or other linearly defined structure) Often multiple lines are possible } Essentially, the algorithms are indifferent : they don’t care which line we pick } In the example seen here, either classification line separates data perfectly well } 2 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 1 2 “Fragile” Separation “Robust” Separation x 2 x 2 x 2 x 2 New data x 1 x 1 x 1 x 1 } What we want is a large margin separator: a separation that has the } As more data comes in, these classifiers may start to fail largest distance possible from each part of our data-set } A separator that is too close to one cluster or the other now makes mistakes } May happen even if new data follows same distribution seen in the training set } This will often give much better performance when used on new data 4 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 3 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 3 4 1
Large Margin Separation Linear Classifiers and SVMs x 2 Linear This is sometimes called the Weight equation w · x = w 0 + w 1 x 1 + w 2 x 2 + · · · + w n x n “widest road” approach ( 1 w · x ≥ 0 Threshold function h w = A support vector machine (SVM) 0 w · x < 0 is a technique that finds this road. The points that define the edges SVM of the road are knows as the support vectors. x 1 Weight equation w · x + b = ( w 1 x 1 + w 2 x 2 + · · · + w n x n ) + b } A new learning problem: find the separator with the largest margin ( +1 w · x ≥ 0 } This will be measured from the data points, on opposite sides, that Threshold function h w = w · x < 0 − 1 are closest together 6 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 5 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 5 6 Large Margin Separation Mathematics of SVMs x 2 x 2 w · x + + b = +1 w · x − + b = − 1 w · x + b = 0 +1 x + w · ( x + − x − ) = 2 A key difference: the SVM is going –1 2 to do this without learning and w || w || · ( x + − x − ) = remembering weight vector w . || w || x – Instead, it will use features of the data-items themselves . q || w || = w 2 1 + w 2 2 + · · · + w 2 x 1 x 1 n } Like a linear classifier, the SVM separates at the line where its learned } It turns out that the weight-vector w for the largest margin separator vector of weights is zero has some important properties relative to the closest data-points on each side ( x + and x – ) 8 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 7 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 7 8 2
<latexit sha1_base64="JvrWTP+V4MhX4CkPh1bvWi8oeuk=">ACNnicZVC7TsMwFHV4Q3kEGFksHhISqErCAsSgoURJApIbRU5jtNaOHbk3LSgKL8F7OzsDLABKwvMuAkMhStZOvf4nvs4QSJ4Co7zZI2Mjo1PTE5Nz9Rm5+YX7MWl81RlmrIGVULpy4CkTHDJGsBsMtEMxIHgl0EV0eD/4se0ylX8gxuEtaOSUfyiFMChvLt01ZMoBtEeb/ALRoqwL/EdeFzvI/7vouv/ZxvY7fAWyb1qtQbZKUiLWlZ0bLw7TWn7pSB/wP3B6wdrH/ePvRqXye+/dgKFc1iJoEKkqZN10mgnRMNnApWzLSylCWEXpEOy8uDC7xhqBHSpsnAZfsUJ1UB4pG5mEO21cy6TDJikVZsoExgUHniDQ64ZBXFjAKGam/mYdokmFIyDQ510Jli4jXsD20Ozq+goU9+NPbOvMcD9e+5/cO7V3Z26d2qcOERVTKEVtIo2kYt20QE6RieogSi6Q8/oDb1b9aL9Wq9V6Uj1o9mGQ2F9fENL4OuPg=</latexit> <latexit sha1_base64="aLzcBP9rm1dtxRDLF3rQp4nERCU=">ACQnicZVG7TsMwFHXKo1BeBUYWiwoJBKqSMsCVMECG0i0RWqyHGc1tSJI+cGqKJ+BZ8EMzsSPwAbsDLgpmUIvZKlo+Nzru89diPBYzDNV6MwMzs3X1xYLC0tr6yuldc3mrFMFGUNKoVUNy6JmeAhawAHwW4ixUjgCtZy+2ej+9YdUzGX4TUMItYJSDfkPqcENOWUO3ZAoOf6f0Q29STgP+Ih6HD8T528Qm24yRwbrFNRNQjGgz02c3pqy3eyOvU6YVTMrPA2sCajULx7f6sXW8qVTfrE9SZOAhUAFieO2ZUbQSYkCTgUbluwkZhGhfdJlab8EO9oysO+VPqEgDM2pwslZMvm3O0E/ONOysMoARbScRs/ERgkHuWEPa4YBTHQgFDF9fuY9ogiFHSauU4qEcw7wHejL/D0rKIrtb4X1PS8OgDr/7rToFmrWofV2pVO4hSNawFtoW20iyx0hOroHF2iBqLoCb2jL/RtPBsfxqfxPZYWjIlnE+XK+PkFZrSzCg=</latexit> <latexit sha1_base64="aLzcBP9rm1dtxRDLF3rQp4nERCU=">ACQnicZVG7TsMwFHXKo1BeBUYWiwoJBKqSMsCVMECG0i0RWqyHGc1tSJI+cGqKJ+BZ8EMzsSPwAbsDLgpmUIvZKlo+Nzru89diPBYzDNV6MwMzs3X1xYLC0tr6yuldc3mrFMFGUNKoVUNy6JmeAhawAHwW4ixUjgCtZy+2ej+9YdUzGX4TUMItYJSDfkPqcENOWUO3ZAoOf6f0Q29STgP+Ih6HD8T528Qm24yRwbrFNRNQjGgz02c3pqy3eyOvU6YVTMrPA2sCajULx7f6sXW8qVTfrE9SZOAhUAFieO2ZUbQSYkCTgUbluwkZhGhfdJlab8EO9oysO+VPqEgDM2pwslZMvm3O0E/ONOysMoARbScRs/ERgkHuWEPa4YBTHQgFDF9fuY9ogiFHSauU4qEcw7wHejL/D0rKIrtb4X1PS8OgDr/7rToFmrWofV2pVO4hSNawFtoW20iyx0hOroHF2iBqLoCb2jL/RtPBsfxqfxPZYWjIlnE+XK+PkFZrSzCg=</latexit> <latexit sha1_base64="eCN1fAxnt98afSPVlPhRGpC+A=">ACRHicZVFNS8QwE39/nbVo5fgBwguS1sPehFELx5XcFfBXUqaprvRNCnpdFK/5Z/QPTo3d+gN12PYtpdwdWBkMebeZOZFz8WPAHbfrHGxicmp6ZnZufmFxaXlisrq81EpZqyBlVC6UufJExwyRrAQbDLWDMS+YJd+DcnRf6ix3TClTyHu5i1I9KRPOSUgKG8iteKCHT9MLvNPY5bNFCAf1HX+BDfehmvOnlxX1exAbsDyh1SbsGUyuQnJYcpmXuVTbtml4H/A2cINo+2+vcPvfnPuld5bgWKphGTQAVJkivHjqGdEQ2cCpbPtdKExYTekA7LyvVzvG2oAIdKmyMBl+xInVRQrjuivkohPGhnXMYpMEkHbcJUYFC4cAoHXDMK4s4AQjU372PaJZpQMH6OdNKpYEV94pPCMysoqNMfTdyzbzGAOfvuv9B0605ezX3zDhxjAYxg9bRBtpBDtpHR+gU1VEDUfSIXtEH6ltP1pv1bvUHpWPWULOGRsL6+gY3orRB</latexit> Mathematics of SVMs Mathematics of SVMs } Although complex, a constrained optimization problem like this can be } Through the magic of mathematics (Lagrangian multipliers, to algorithmically solved to get the 𝛽 i values we want: be specific), we can derive a quadratic programming problem α i − 1 X X W ( α ) = α i α j y i y j ( x i · x j ) We start with our data-set: 1. 2 i i,j A note about notation: these equations involve { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) } [ ∀ i, y i ∈ { +1 , − 1 } ] ∀ i, α i ≥ 0 two different, necessary products: 1. The usual application of weights to points : We then solve a constrained optimization problem: X α i y i = 0 2. α i − 1 w · x i = w 1 x i, 1 + w 2 x i, 2 + · · · + w n x i,n i X X W ( α ) = α i α j y i y j ( x i · x j ) 2. Products of points and other points : 2 i i,j x i · x j = x i, 1 x j, 1 + x i, 2 x j, 2 + · · · + x i,n x j,n The goal : based on known values ( ) ∀ i, α i ≥ 0 x i , y i find the values we don’t know ( 𝛽 i ) that: } Once done, we can find the weight-vector and bias term if we want: X α i y i = 0 1. Will maximize value of margin W ( 𝛽 ) b = − 1 X w = α i y i x i 2( max i | y i = − 1 w · x i + j | y j =+1 w · x j ) min 2. Satisfy the two numerical constraints i i 10 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 9 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 9 10 The Dual Formulation The Dual Formulation } It turns out that we don’t need to use the weights at all X w · x i + b = α j y j ( x i · x j ) + b } Instead, we can simply use the 𝛽 i values directly : j } Now, if we had to sum over every data-point as on the X w · x i + b = α j y j ( x i · x j ) + b right-hand side of this equation, this would look very bad j for a large data-set } It turns out that these 𝛽 i values have a special property, What we usually look for in however, that makes it feasible to use them as part of our a parametric method: the What we can use instead : we compute an weights, w , and offset, b , classification function… equivalent result based upon the 𝛽 defining the classifier parameters, the outputs y , and products between data-points themselves (along with the standard offset) 12 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 11 Monday, 2 Mar. 2020 Machine Learning (COMP 135) 11 12 3
Recommend
More recommend