support vector manifold learning for solving regression
play

Support Vector Manifold Learning for Solving Regression Problems via - PowerPoint PPT Presentation

Support Vector Manifold Learning for Solving Regression Problems via Clustering Marcin Orchel AGH University of Science and Technology in Poland 1 / 29 2 / 29 1.0 y -1.0 1.0 x (a) 3 / 29 new kernel c )) T ( ( ( ( x i ) +


  1. Support Vector Manifold Learning for Solving Regression Problems via Clustering Marcin Orchel AGH University of Science and Technology in Poland 1 / 29

  2. 2 / 29

  3. 1.0 y -1.0 1.0 x (a) 3 / 29

  4. new kernel c )) T ( ϕ ( � ( ϕ ( � x i ) + y i t ϕ ( � x j ) + y j t ϕ ( � c )) = (1) x i ) T ϕ ( � x i ) T ϕ ( � ϕ ( � x j ) + y j t ϕ ( � c ) (2) c ) T φ ( � c ) T ϕ ( � x j ) + y j y i t 2 ϕ ( � + y i t ϕ ( � c ) . (3) K ( � x i , � x j ) + y j tK ( � x i ,� c ) (4) x j ) + y j y i t 2 K ( � + y i tK ( � c , � c ,� c ) . (5) cross kernel c )) T ϕ ( � ( ϕ ( � x i ) + y i t ϕ ( � x ) = (6) x i ) T ϕ ( � c ) T ϕ ( � ϕ ( � x ) + y i t ϕ ( � x ) . (7) So K ( � x i ,� x ) + y i tK ( � c ,� x ) . (8) the number of support vectors is maximally equal to n + 1 4 / 29

  5. Proposition 1 Shifting a hyperplane with any vector � c gives a new hyperplane which differs from the original by a free term b. Lemma 1 After duplicating and shifting a n − 1 dimensional hyperplane constrained by n-dimensional hypersphere, the maximal distance from an original center of a hypersphere to any point belonging to the shifted hyperplane will be for a point such as after projecting this point to the n − 1 dimensional hyperplane (before shift), a vector from � 0 to this point will be parallel to a vector from � 0 to a projected center of one of new hyperspheres (a shifted hyperplane). 5 / 29

  6. Proposition 2 The radius R n of a minimal hypersphere containing all points after shifting is equal to R n = � � c + R � c m / � � c m �� (9) where c m is defined as c − b + � w · � c c m = � � w . � (10) � w � 2 and � c m � � = 0 . For � c m � = 0 , we get R n = � � c � = � � c p � . 6 / 29

  7. Proposition 3 Consider hyperplanes � w c · � x = 0 , where � w c is normalized such that they are in a canonical form, that is for a set of points A = { � x 1 , . . . , � x n } min | � w c · � x i | = 1 . (11) i The set of decision functions f w ( � x ) = sgn � x · � w c defined on A, satisfying the constraint � � w c � ≤ D has a Vapnik-Chervonenkis (VC) dimension satisfying � � R 2 D 2 , m + 1 h ≤ min , (12) where R is the radius of the smallest sphere centered at the origin and containing A. 7 / 29

  8. We can improve generalization bounds when D 2 c m �� 2 c p � ) 2 ≤ R 2 D 2 � � c + R � c m / � � (13) (1 + D � � c m �� 2 � � c + R � c m / � � ≤ R 2 (14) c p � ) 2 (1 + D � � For a special case, when � c m � = 0, we get � � c p � c p � ) 2 ≤ R 2 . (15) (1 + D � � 8 / 29

  9. Performance measure For OCSVM the distance between a point � r and the minimal hypersphere in a kernel-induced feature space can be computed as  n n � � α i α j K ( � x i , � R − x j ) (16)  i =1 j =1 1 / 2  n � − 2 α j K ( � x j ,� r ) + K ( � r ,� r ) . (17)  j =1 For kernels for which K ( � x ,� x ) is constant, such as the radial basis function (RBF) kernel, the radius R can be computed as follows � n n � x j ) + 2 b ∗ . � � � R = � K ( � x ,� x ) + α i α j K ( � x i , � (18) i =1 j =1 9 / 29

  10. Performance measure For SVML, the distance between a point � r and the hyperplane in a kernel-induced feature space can be computed as | � w c · � r + b c | = (19) � w c � 2 � � � + b c ��� n i =1 y i c α ∗ � � i K ( � x i ,� r ) � . (20) �� n c y j � n c α ∗ i α ∗ j =1 y i j K ( � x i , � x j ) i =1 10 / 29

  11. 1.0 1.0 y y 0.0 0.0 0.0 0.0 x (a) (b) 11 / 29

  12. 1.0 y 0.0 0.0 x Fig. 3: Clustering based on curve learning. Points—examples. (a) Solid line—solution of SVCL. (b) Solid line—solution of SVMLC. (c) Solid line—solution of KPCA 12 / 29

  13. 1.0 1.0 y y 0.0 0.0 0.0 0.9 0.0 0.9 x x (a) (b) Fig. 4: Regression via clustering. (a) Clustering with SVCL. (b) Clustering with SVMLC. (c) Corresponding two regression functions for (a). (d) Corresponding two regression functions for (b). 13 / 29

  14. 1.0 1.0 y y 0.0 0.0 0.0 0.9 0.0 0.9 x x () () Fig. 5: Regression via clustering. (a) Clustering with SVCL. (b) Clustering with SVMLC. (c) Corresponding two regression functions for (a). (d) Corresponding two regression functions for (b). 14 / 29

  15. Goal develop a method for dimensionality reduction based on support vector machines (SVM) reduce dimensionality by fitting a curve to data in the form of vectors (not for classification and not for regression data) it might be seen as a generalization of regression: regression fits a function to data, curve fitting fits a curve to data idea: duplicate points, shift them in a kernel space and use support vector classification (SVC) use recursive dimensionality reduction for linear decision boundary in kernel space: project points to the solution curve, repeat all steps we could also use it for clustering, similar as in self organizing maps we could use it for visualization 15 / 29

  16. Shifting in kernel space shifting points in kernel space: c )) T ( ϕ ( � x i ) T ϕ ( � x i ) T ϕ ( � ( ϕ ( � x i ) + y i t ϕ ( � x j ) + y j t ϕ ( � c )) = ϕ ( � x j ) + y j t ϕ ( � c ) (21) c ) T φ ( � x j ) + y j y i t 2 ϕ ( � c ) T ϕ ( � + y i t ϕ ( � c ) (22) where t is a translation parameter, � c is a shifting point, ϕ ( · ) is some symmetrical kernel. cross kernel: c )) T ϕ ( � x i ) T ϕ ( � c ) T ϕ ( � ( ϕ ( � x i ) + y i t ϕ ( � x ) = ϕ ( � x ) + y i t ϕ ( � x ) (23) we preserve sparsity, for two duplicated points, where y i = 1, y i + size = − 1 � � x i ) T ϕ ( � c ) T ϕ ( � y i α i ϕ ( � x ) + t ϕ ( � x ) (24) � � x i ) T ϕ ( � c ) T ϕ ( � ϕ ( � x ) + y i + size t ϕ ( � + y i + size α i + size x ) = (25) � � x i ) T ϕ ( � c ) T ϕ ( � ( y i α i + y i + size α i + size ) ϕ ( � x ) + ( y i α i + α i + size ) t ϕ ( � x ) (26) 16 / 29

  17. Shifting in a kernel space, cont. The second term can be summed up for all i . � c ) T ϕ ( � ( y i α i + α i + size ) t ϕ ( � x ) (27) i when α i = α i + size = C , c ) T ϕ ( � 2 Ct ϕ ( � x ) (28) so it is like adding artificial point � c to the solution curve with the parameter 2 Ct , we can sum them for multiple points 17 / 29

  18. Shifting in kernel space when ϕ is a linear kernel, we get δ support vector regression ( δ -SVR) hypothesis: it does not matter, how we choose a shifting point due to linear decision boundary in kernel space, for example we can shift only in one direction for three dimensions: � c = (0 . 0 , 0 . 0 , 1 . 0). the shifting strategy has already been tested for an input space for regression in δ -SVR and works fine 18 / 29

  19. Dimensionality reduction parametric form of a straight line through the point ϕ ( � x 1 ) in the w is � direction of � l = ϕ ( � x 1 ) + t � w � l point must belong to the hyperplane, so after substituting w T � � l + b = 0 (29) w T ( ϕ ( � � x 1 ) + t � w ) + b = 0 (30) we need to compute t , so w T ϕ ( � t = − b − � x 1 ) (31) w � 2 � � after substituting t we get the projected point w T ϕ ( � x 1 ) − b + � x 1 ) � z = ϕ ( � w � (32) w � 2 � � 19 / 29

  20. Dimensionality reduction z 1 and z 2 are new points in a kernel space, so in order to compute a kernel we just compute a dot product: � T � � � w T ϕ ( � w T ϕ ( � x 1 ) − b + � x 1 ) x 2 ) − b + � x 2 ) T � z 1 � z 2 = ϕ ( � w � ϕ ( � w � w � 2 w � 2 � � � � (33) w T ϕ ( � x 2 ) − b + � x 1 ) T � x 1 ) T ϕ ( � w T ϕ ( � z 1 � z 2 = ϕ ( � � x 2 ) (34) w � 2 � � x 1 ) T b + � w ϕ ( � w + b + � w ϕ ( � w T b + � w ϕ ( � x 2 ) x 1 ) x 2 ) − ϕ ( � � � � w (35) w � 2 w � 2 w � 2 � � � � � � 20 / 29

  21. Dimensionality reduction w T ϕ ( � x 2 ) − b + � x 1 ) x 1 ) T ϕ ( � T � w T ϕ ( � z 1 � z 2 = ϕ ( � � x 2 ) (36) w � 2 � � w T ϕ ( � x 1 ) T b + � x 2 ) � � � � w T ϕ ( � w T ϕ ( � − ϕ ( � w + � b + � x 1 ) b + � x 2 ) (37) w � 2 � � w T ϕ ( � w T ϕ ( � w T ϕ ( � x 2 ) − b � x 2 ) − 2 � x 1 ) � x 2 ) T � z 1 � z 2 = ϕ ( � x 1 ) ϕ ( � (38) w � 2 w � 2 � � � � x 1 ) T � − b ϕ ( � w + b 2 + b � w T ϕ ( � w T ϕ ( � w T ϕ ( � w T ϕ ( � x 2 ) + b � x 1 ) + � x 1 ) � x 2 ) � � w � 2 (39) we use this iteratively, in the next reduction, we will use kernel values from the previous reduction, in the first iteration we use the shift kernel, � w will be perpendicular to the previously computed � w 21 / 29

  22. Example of proposed curve fitting for folium of Descartes 2.0 y -2.0 -2.0 1.7 x Fig. 6: Prediction of folium of descartes. Parameters are RBF kernel, σ = 1 . 5, C = 100 . 0, t = 0 . 005, � c = (0 . 1 , 0 . 0) 22 / 29

  23. Example of proposed curve fitting for folium of Descartes 2.0 y -2.0 -2.0 1.7 x Fig. 7: Prediction of folium of descartes. Parameters are RBF kernel, σ = 1 . 5, C = 10000 . 0, t = 0 . 005, � c = (0 . 1 , 0 . 0) 23 / 29

More recommend