Clustering by Support Vector Manifold Learning Marcin Orchel AGH University of Science and Technology in Poland 1 / 12
Problem and My Contributions Problem Characterizations of clusters are boundary, center (prototype), cluster core, characteristic manifold of a cluster. The multiple manifold learning problem is to fit multiple manifolds (hypersurfaces) to data points and generalize to unseen data. Approach The support vector manifold learning (SVML) transforms a feature space to a kernel-induced feature space and then fits to the data with the hypothesis space containing only hyperplanes and generalize well. For fitting to the data with SVML, we need a regression method that works completely in a kernel-induced feature space. SVML duplicates and shifts points in the kernel-induced feature space in the direction of any training vector and solves a classification problem. 2 / 12
Comparison of Manifold Learning Methods 1.0 1.0 1.0 y y y 0.0 -1.0 -1.0 1.0 0.0 1.0 1.0 x x x (a) (b) (c) Fig. 1: Manifold learning. Points—examples. (a) For points generated from a circle. Solid line—solution of one-class support vector machines (OCSVM) for C = 1 . 0, σ = 0 . 9, dashed line—solution of SVML for C = 100 . 0, σ = 0 . 9, t = 0 . 01, thin dotted line—solution of kernel principal component analysis (KPCA) for σ = 0 . 9. (b) For points generated from a Lissajous curve. Solid line—solution of OCSVM for C = 1000 . 0, σ = 0 . 5, dashed line—solution of SVML for C = 100000 . 0, σ = 0 . 8, t = 0 . 01, thin dotted line—solution of c = � KPCA for σ = 0 . 5. (c) Solid line— solution of SVML for � 0, C = 100 . 0, σ = 0 . 9, t = 0 . 01, dashed line—solution of SVML for random values of � c , C = 100 . 0, σ = 0 . 9, t = 0 . 01. 3 / 12
Support Vector Manifold Learning (SVML) The kernel function for two data points � x i and � x j for i , j = 1 , . . . , n is � � � � K � x i , � x j = K o x i , � � x j + y j tK o ( � x i ,� c ) (1) + y j y i t 2 K o ( � � � + y i tK o � c , � x j c ,� c ) , (2) where � c is the shifting direction defined in an original feature space, t is the translation parameter, y i = 1 for the point shifted up, and y i = − 1 for the point shifted down. The cross kernel is K ( � x i ,� x ) = K o ( � x i ,� x ) + y i tK o ( � c ,� x ) . (3) The number of support vectors is maximally equal to n + 1. The solution is n n � � ( α i − α i + n ) K ( � x i ,� x ) + ( α i + α i + n ) tK ( � c ,� x ) + b = 0 . (4) i = 1 i = 1 4 / 12
Model with Shifted Hyperplanes Proposition 1 Shifting a hyperplane with any value of � c gives a new hyperplane which differs from the original by a free term b. Lemma 1 After duplicating and shifting a n − 1 dimensional hyperplane constrained by n − 1 -dimensional hypersphere, the maximal distance from an original center of a hypersphere to any point belonging to the shifted n − 2 hypersphere is for a point such as after projecting this point to the n − 1 dimensional hyperplane (before shift), a vector from � 0 to this point is parallel to a vector from � 0 to a projected center of one of the shifted n − 2 hyperspheres. 5 / 12
Model with Shifted Hyperplanes Lemma 2 The radius R n of a minimal hypersphere containing both hyperplanes constrained by n − 1 dimensional hypersphere after shifting is equal to R n = � � c + R � c m / � � c m �� (5) where c m is defined as c − b + � w · � c c m = � � w . � (6) � w � 2 � c � 2 + R 2 . and � c m � � = 0 . For � c m � = 0 , we get R n = � � 6 / 12
Generalization bounds for Shifted Hyperplanes We can improve generalization bounds when D 2 c m �� 2 �� 2 ≤ R 2 D 2 � � c + R � c m / � � (7) � � � 1 + D � � c p c m �� 2 � � c + R � c m / � � ≤ R 2 (8) �� 2 � � � 1 + D � � c p For a special case, when � c m � = 0, we get � � � � c p �� 2 ≤ R 2 . � (9) � � � 1 + D � � c p 7 / 12
Model with Shifted Hyperplanes Proposition 2 When � c p is constant and 2 � � c m � ≤ R, then the solution of maximizing a margin between two n − 2 hyperspheres is equivalent to the hyperplane that contains the n − 2 hypersphere before duplicating and shifting. 8 / 12
Performance measure For OCSVM the distance between a point � r and the minimal hypersphere in a kernel-induced feature space can be computed as n n � � � � R − α i α j K x i , � � x j (10) i = 1 j = 1 1 / 2 n � � � − 2 α j K � x j ,� r + K ( � r ,� r ) . (11) j = 1 For kernels for which K ( � x ,� x ) is constant, such as the radial basis function (RBF) kernel, the radius R can be computed as follows � n n � � � + 2 b ∗ . � � � R = � K ( � x ,� x ) + α i α j K x i , � � x j (12) i = 1 j = 1 9 / 12
Performance measure For SVML, the distance between a point � r and the hyperplane in a kernel-induced feature space can be computed as | � w c · � r + b c | = (13) � w c � 2 � � c α ∗ ��� n � i = 1 y i � � i K ( � x i ,� r ) + b c � � . (14) �� n c y j � n c α ∗ i α ∗ � j = 1 y i j K � x i , � x j i = 1 10 / 12
Comparison of Clustering Methods First, we map any two points to the same cluster if there do not exist two points between them with different sign of a functional margin. Second, we map remaining unassigned points to clusters of the nearest neighbors from the assigned points. 1.0 1.0 1.0 y y y 0.0 0.0 0.0 0.0 0.0 0.0 x x x (a) (b) (c) Fig. 2: Clustering by manifold learning. Points—examples, filled points—support vectors. (a) Solid line—solution of support vector clustering (SVCL) for C = 10000 . 0, σ = 0 . 35. (b) Solid line—solution of support vector manifold learning clustering (SVMLC) for C = 100000 . 0, σ = 1 . 1, t = 0 . 01. (c) Solid line—solution of KPCA. 11 / 12
Results For the manifold learning experiment, we check the average distance between points and a solution in a kernel-induced feature space. We validate clustering on classification data sets. We assume that data samples that belong to the same cluster have the same class in a classification problem. Table 1: Performance of SVMLC, SVCL, KPCA, SVML, OCSVM for real world data, part 2. The numbers in descriptions of the columns mean the methods: 1 - SVMLC, 2 - SVCL, 3 - KPCA for the first row, 1 - SVML, 2 - OCSVM, 3 - KPCA for the second row. The test with id=0 is for all tests for the clustering experiment. The test with id=1 is for all tests for the manifold learning experiment. Column descriptions: rs – an average rank of the method for the mean error; the best method is in bold, tsf – the Friedman statistic for average ranks for the mean error; the significant value is in bold, tsn – the Nemenyi statistic for average ranks for the mean error, reported when the Friedman statistic is significant, the significant value is in bold, svr – the average rank for the number of nonzero coefficients (support vectors for support vector machines (SVM) methods); the smallest value is in bold. id rs1 rs2 rs3 tsf tsn12 tsn13 tsn23 sv1 sv2 sv3 0 1.71 1 . 93 2 . 36 4 . 5 – – – 2 . 83 1 . 67 1.5 1 1.49 2 . 98 1 . 53 33.09 -4.82 0 . 3 5.13 1.51 2 . 38 2 . 11 12 / 12
Recommend
More recommend