Kernel K -Means Low Rank Approximation for Spectral Clustering and Diffusion Maps IDEAL 2014 Salamanca – Spain ´ Carlos M. Ala´ ız Angela Fern´ andez Yvonne Gala Jos´ e R. Dorronsoro Departamento de Ingenier´ ıa Inform´ atica Universidad Aut´ onoma de Madrid September 10, 2014 UNIVERSIDAD AUTONOMA
Contents UNIVERSIDAD AUTONOMA 1 Introduction 2 SC, DM and Nystr¨ om 3 Kernel KASP 4 Numerical Experiments 5 Conclusions
Contents: Introduction Introduction UNIVERSIDAD AUTONOMA 1 Introduction
Introduction Introduction UNIVERSIDAD AUTONOMA Spectral Clustering (SC) and Diffusion Maps (DM) are two of the leading methods for advanced clustering and dimensionality reduction. They require the eigenanalysis of a matrix with the same dimensionality N as the sample size. Complexity O ( N 3 ). It is difficult to compute the SC or DM projections of new patterns, as these projections are eigenvector components. The Nystr¨ om approach allows to extend an eigenanalysis to new points. It can be used for new patterns. To deal with costs, a common approach is to subsample the original patterns retaining a small subset that is used to define a first embed- ding, which is then extended to the entire sample. A proper subsampling can be critical for the performance of this approach. C. M. Ala´ ız et al. (EPS–UAM) KKM Approximation for SC and DM September 10, 2014 1 / 12
Contents: SC, DM and Nystr¨ om SC, DM and Nystr¨ om UNIVERSIDAD AUTONOMA 2 SC, DM and Nystr¨ om Spectral Clustering Diffusion Maps Nystr¨ om Extension
Spectral Clustering SC, DM and Nystr¨ om UNIVERSIDAD AUTONOMA Spectral Clustering Spectral Clustering SC is a manifold learning method for clustering. Scheme: 1 An appropriate similarity matrix W is built over the sample S = { x 1 , . . . , x N } . This defines a weighted graph G . 2 The random walk Laplacian is defined as L rw = I − D − 1 W = I − P . D is the diagonal degree matrix, D ii = d i = � j w ij . 3 K -means is applied over the spectral projections v ( x i ) = ( v 1 i , . . . , v m i ) ⊤ of a sample point x i . { v p } N − 1 p =0 are the right eigenvectors of L rw (or P ). m is the chosen projection dimension. i ) ⊤ can also be used for dimensionality re- The SC coordinates ( v 1 i , . . . , v m duction purposes. C. M. Ala´ ız et al. (EPS–UAM) KKM Approximation for SC and DM September 10, 2014 2 / 12
Diffusion Maps SC, DM and Nystr¨ om UNIVERSIDAD AUTONOMA Diffusion Maps Diffusion Maps Diffusion Maps add some improvements to SC. Scheme: 1 W is normalized to reflect the role of the sample density. In particular, w ( α ) = w ij / d α i d α j for 0 ≤ α ≤ 1. ij If α = 0, W α is the previously defined W . If α = 1, the effect of the density is compensated. 2 A Markov probability matrix is defined on the graph G as P α = ( D α ) − 1 W α . 3 The diffusion distance for t steps over the graph G is given by D t ( x i , x j ) 2 = � N − 1 j ) 2 , with v k and λ k the eigenvectors k =1 λ 2 t k ( v k i − v k and eigenvalues of P α . 4 The embedding is given by Ψ t ( x i ) = ( λ t N − 1 v N − 1 1 v 1 i , . . . , λ t ) ⊤ ; the Eu- i clidean distance between Ψ t ( x i ) and Ψ t ( x j ) is precisely D t ( x i , x j ). DM lends itself to dimensionality reduction and clustering, selecting the first m coordinates and using K -means on the Ψ projections. C. M. Ala´ ız et al. (EPS–UAM) KKM Approximation for SC and DM September 10, 2014 3 / 12
Nystr¨ om Extension SC, DM and Nystr¨ om UNIVERSIDAD AUTONOMA Nystr¨ om Extension SC and DM share two drawbacks. The cost of the eigenanalysis they require. The difficulty of computing the SC or DM projections of new, unseen patterns. Both can be dealt with using Nystr¨ om extension . For a kernel a ( x i , x j ) and its kernel matrix A , with AU = U Λ its eigende- composition, the Nystr¨ om extension to a new pattern x is the approxima- � N 1 u k ( x ) to the true u k ( x ) given by ˜ u k ( x ) = j =1 a ( x , x j ) u k tion ˜ j . λ k This approach can also be applied to the asymmetric matrix P , so its � N v k ( x ) = 1 j =1 P ( x , x j ) v k eigenvectors can be extended as ˜ j . λ k Therefore, an embedding can be built using just a subsample and then it can be extended to new points using Nystr¨ om. C. M. Ala´ ız et al. (EPS–UAM) KKM Approximation for SC and DM September 10, 2014 4 / 12
Reconstruction Error SC, DM and Nystr¨ om UNIVERSIDAD AUTONOMA Nystr¨ om Extension In order to compare different subsamples, some quality measure is needed. Let W and P be structured as � ˜ � ˜ B ⊤ � B ′ � W P , P = D − 1 W = W = P , B C B P C P where ˜ W is the K × K similarity of a K pattern subsample ˜ S . Considering only the subsample of the first K patterns, the eigenanalysis U ⊤ can be used to approximate that of W using Nystr¨ W = ˜ ˜ U ˜ Λ ˜ om, with � ˜ ˜ � � B ⊤ � U W U ′ = , W ′ = U ′ ˜ Λ U ′⊤ = , B ˜ U ˜ B ˜ Λ − 1 W − 1 B ⊤ B � ˜ B ′ � P P ′ = D − 1 W ′ = P . B P ˜ P − 1 B ′ B P P A possible measure to compare different ways of selecting ˜ S is the recon- om approximation P ′ , struction error between the real P and the Nystr¨ d F ( P , P ′ ) = � P − P ′ � F = � C P − B P ˜ P − 1 B ′ P � F . C. M. Ala´ ız et al. (EPS–UAM) KKM Approximation for SC and DM September 10, 2014 5 / 12
Contents: Kernel KASP Kernel KASP UNIVERSIDAD AUTONOMA 3 Kernel KASP KASP KKASP
K -Means and KASP Kernel KASP UNIVERSIDAD AUTONOMA KASP K -Means Scheme: 1 K initial centroids are chosen, { C 0 k } K k =1 . 2 Sample patterns x p are associated to their nearest centroid, giving a first set of clusters {C 0 k } K k =1 , with x p ∈ C 0 k if k = arg min ℓ � x p − C 0 ℓ � . 3 The new centroids C 1 k are the means of the C 0 k which are used to define a new set of clusters C 1 k . 4 This is repeated until no changes are made. This algorithm progressively minimize the within cluster sum of squares � K k � x p − C i k � 2 . � k =1 x p ∈C i K -Means-based Approximate Spectral Clustering (KASP) It consists in using standard K -means to build a set of representative cen- troids over which spectral clustering is done. In order to compute d F ( P , P ′ ), each centroid is approximated by its nearest pattern, using these pseudo-centroids as the subsample. C. M. Ala´ ız et al. (EPS–UAM) KKM Approximation for SC and DM September 10, 2014 6 / 12
Kernel K -Means and Kernel KASP Kernel KASP UNIVERSIDAD AUTONOMA KKASP Kernel K -Means K -means can be enhanced in a kernel setting replacing the sample patterns x by non linear extensions Φ( x ). k � 2 If Φ corresponds to a reproducing kernel K , the distances � Φ( x p ) − C i can be computed without working explicitly with Φ( x ): 1 2 k � 2 = K ( x p , x p ) + � Φ( x p ) − C i � � K ( x q , x r ) − K ( x p , x q ) . |C i k | 2 |C i k | x q , x r ∈C i x q ∈C i k k Thus the previous Euclidean K -means procedure extends straightforwardly to a kernel setting. Our Proposal: Kernel KASP (kKASP) Similarly to the KASP approach, but based on kernel K -means. The centroids are not available explicitly, so they are substituted by the pseudo-centroids (with respect to the kernel). C. M. Ala´ ız et al. (EPS–UAM) KKM Approximation for SC and DM September 10, 2014 7 / 12
KKASP Algorithm Kernel KASP UNIVERSIDAD AUTONOMA KKASP Algorithm Require: S = ( x 1 , . . . , x N ); K , the subsample size; S K = { z 1 , . . . , z K } ; 1: Apply kernel K -means on S and select K pseudo-centroids ˜ 2: Perform the eigenanalysis of the matrix P K associated to ˜ S K ; om extensions ˜ V K ; 3: Compute Nystr¨ V K and clustering; 4: If desired, perform dimensionality reduction on the ˜ The complexity analysis of the kKASP approach is easy: Kernel K -means: O ( KNI ), with I the number of iterations, plus the cost O ( N 2 ) of pre-computing the similarity matrix. Eigenanalysis of P : O ( K 3 ). Nystr¨ om extensions: O ( KN ). A DM over the entire sample would require the eigenanalysis of the com- plete matrix: O ( N 3 ). C. M. Ala´ ız et al. (EPS–UAM) KKM Approximation for SC and DM September 10, 2014 8 / 12
Contents: Numerical Experiments Numerical Experiments UNIVERSIDAD AUTONOMA 4 Numerical Experiments Experimental Framework Results
Framework (I) Numerical Experiments UNIVERSIDAD AUTONOMA Experimental Framework The similarity matrix W is defined with a Gaussian kernel with width parameter σ as the 10% percentile of all the distances. The distance d F ( P , P ′ ) is used as a quality measure, where P = D − 1 W is the transition probability matrix of SC. Models: S r : random selection. S k : KASP selection. S kk 1 : kKASP selection using kernel parameter σ the percentile 1%. It is more local, producing thus more clusters. S kk 1 : kKASP selection using kernel parameter σ the percentile 10%. The kernel matrix is the similarity matrix W . Sizes: 10, 50, 100, 200, 300, 400, 500, 750 and 1 , 000. For S r and S k these are the final sizes but, for S kk 1 and S kk 10 , kernel K – means can collapse some of the clusters giving a smaller subsample. C. M. Ala´ ız et al. (EPS–UAM) KKM Approximation for SC and DM September 10, 2014 9 / 12
Recommend
More recommend