SPECTRAL CLUSTERING OF LARGE NETWORKS A. Fender, N. Emad, S. Petiton, M. Naumov May 8th, 2017
Introduction Laplacian Agenda Modularity Conclusions 2
THE CLUSTERING PROBLEM Example : detect relevant groups based on frequent co-purchasing on Amazon.com Data: V. Krebs. 2004 3 Visualization: M. Bastian, S. Heymann, and M. Jacomy . “ Gephi: An Open Source Software for exploring and manipulating networks” 2009
THE CLUSTERING PROBLEM Pink Liberal Yellow Neutral Green Conservative Data: V. Krebs. 2004 4 Visualization: M. Bastian, S. Heymann, and M. Jacomy . “ Gephi: An Open Source Software for exploring and manipulating networks” 2009
CLUSTERING ALGORITHMS A x x • Spectral Build a matrix, solve an eigenvalue problem, use eigenvectors for clustering coarse fine • Hierarchical / Agglomerative Build a hierarchy (fine to coarse), partition coarse, propagate results back to fine level Local refinements • Switch one node at a time 5
Laplacian 6
RATIO CUT COST FUNCTION Objective function: 𝑞 𝐷𝑝𝑡𝑢 = 𝜖𝐹 𝑗 𝑊 𝑗 𝑗=1 where 𝜖𝐹 𝑗 : # of edges cut and 𝑊 𝑗 : # of nodes in i-th partition 𝐷𝑝𝑡𝑢 = 2 2 + 2 3 = 5 A compromise between small edge-cut and balanced partitions 3 M. Naumov, T. Moon. “Parallel spectral graph partitioning.” Nvidia Technical Report, 2016. 7
GRAPH LAPLACIAN L = D − A D : degree matrix A : adjacency matrix 1 -1 1 1 5 -1 3 -1 -1 3 1 1 1 3 1 − = -1 3 -1 -1 1 1 1 3 4 -1 -1 2 1 1 2 2 -1 1 1 1 𝑀 𝐵 𝐸 M. Naumov, T. Moon. “Parallel spectral graph partitioning.” Nvidia Technical Report, 2016. 8
GRAPH LAPLACIAN For a vector x with elements that are 0 or 1 : Number of edges cut x T L x 1 -1 1 5 -1 3 -1 -1 1 3 |𝜖𝐹 1 | = = 2 1 1 1 -1 3 -1 -1 -1 -1 2 4 -1 1 2 V 1 𝜖𝐹 1 Number of elements x T x |𝑊 1 | = = 2 M. Naumov, T. Moon. “Parallel spectral graph partitioning.” Nvidia Technical Report, 2016. 9
MINIMIZATION PROBLEM p p T Lx i x i x i 𝑊 i 𝜖𝐹 i min min T x i x i 𝑊 i i=1 i=1 where x i ∈ 0,1 n and x i ⊥ x j 5 Next step 3 1 Relax requirements on x, 4 and let x i take real values 2 V 1 𝜖𝐹 1 M. Naumov, T. Moon. “Parallel spectral graph partitioning.” Nvidia Technical Report, 2016. 10
K-MEANS POINTS CLUSTERING Centroids Points p 1 c 1 p 2 c k … Lloyd’s Algorithm: Select centroids • • Compute distance of points to centroids Assign points to the closest centroid • p l • Recompute centroid position M. Naumov, T. Moon. “Parallel spectral graph partitioning.” Nvidia Technical Report, 2016. 11
EDGE CUT MINIMIZATION PIPELINE Points Graph Eigensolver Laplacian Clustering Clustering Preprocessing 1.0 1 -1.0 1 -1 1.0 1 -0.3 -1 3 -1 -1 1 1.0 0.3 -1 3 -1 -1 1 1.0 0.0 -1 -1 2 1 1.0 1.0 -1 1 𝑦 1 𝑦 2 𝑦 1 𝑦 2 12
SPECTRAL EDGE CUT MINIMIZATION 80% hit rate Balanced cut minimization Ground truth 13
Modularity 14
MODULARITY FUNCTION Measures the difference between how well vertices are assigned into clusters for the current graph G = (V,E) versus a random graph R = (V,F). … … … G = (V,E) R = (V,F) … − 𝑤 𝑗 𝑤 𝑘 𝑅 = 1 2𝜕 (𝑥 𝑗𝑘 2𝜕 ) 𝜀 𝑑 𝑗 𝑑(𝑘) 𝑗 𝑘 for some assignment c(.) into clusters. 15 A. Fender, N. Emad, S. Petiton, M. Naumov. “Parallel Modularity Clustering.” ICCS, 2017
MODULARITY MATRIX 5 3 1 Let matrix 4 2 𝐶 = 𝐵 − 1 2𝜕 𝑤𝑤 𝑈 then modularity 𝑤 𝐵 1 2𝜕 𝑈𝑠(𝑌 𝑈 𝐶𝑌 ) Q = 1 1 3 1 1 1 where Tr(.) is the trace (sum of diagonal elements) 3 1 1 1 2 1 1 and X = [𝑦 1 , … , 𝑦 𝑞 ] is such that 𝑦 𝑗𝑙 = 1 𝑗𝑔 𝑑 𝑗 = 𝑙 . 1 1 16 A. Fender, N. Emad, S. Petiton, M. Naumov. “Parallel Modularity Clustering.” ICCS, 2017
MODULARITY MAXIMIZATION PIPELINE Graph Points Eigensolver Modularity Clustering Preprocessing Clustering -0.5 1 0 1 -0.6 0.0 1 1 0 1 1 -0.4 1 0.0 1 0 1 1 0.4 1 0.6 0.0 1 1 0 1 -0.5 0.6 1 0 𝑦 4 𝑦 5 𝑦 1 − 1 𝑦 2 2𝜕 𝑤𝑤 𝑈 17
SPECTRAL MODULARITY MAXIMIZATION 84% hit rate Spectral Modularity maximization Ground truth 18
PROFILING The eigensolver takes 90% of the time The sparse matrix vector multiplication takes 90% of the time in the eigensolver 19
MODULARITY VS. LAPLACIAN CLUSTERING Modularity higher and steadier modularity score 3x speedup over Laplacian Laplacian homogeneous cluster sizes Nvidia Titan X (Pascal), Intel Core i7-3930K @3.2 GHz 20
SPEEDUP AND QUALITY VS. AGGLOMERATIVE * 0.8s on network with 100 million edges on a single Titan X GPU Speedup 3x over agglomerative* scheme Tradeoff Speed vs. quality *: D. LaSalle and G. Karypis. “Multi - threaded Modularity Based Graph Clustering Using the Multilevel Paradigm” Parallel Distrib. Comput., Vol. 76, pp. 66- 80, 2015. Nvidia Titan X (Pascal), Intel Core i7-3930K @3.2 GHz 21
Spectral Clustering in CUDA Toolkit 9.0 release of nvGRAPH 22
CUDA TOOLKIT 9.0 nvGRAPH API nvgraphStatus_t nvgraphSpectralClustering ( struct SpectralClusteringParameter { nvgraphHandle_t handle, int n_clusters; const nvgraphGraphDescr_t graph_descr, int n_eig_vects; const size_t weight_index, nvgraphSpectralClusteringType_t alg float evs_tolerance const struct SpectralClusteringParameter *params, int *clustering, int evs_max_iter; float kmean_tolerance; void *eig_vals, void *eig_vects ); … }; 23
Conclusions 24
SPECTRAL CLUSTERING • Software Framework similar for both • Laplacian - Minimum balanced cut [1] • Probably the most common metric, with balancing involved in the cost function • Requires careful choice of eigensolver • Modularity maximization [2] • Widely used in analysis of social networks • Faster to compute • [1] M. Naumov, T. Moon. “Parallel spectral graph partitioning.” Nvidia Technical Report, 2016. [2] A. Fender, N. Emad, S. Petiton, M. Naumov. “Parallel Modularity Clustering.” ICCS, 2017. 25
Thank you H7129 - Accelerated Libraries Monday and Wednesday @ 4:00, Pod B 26
Recommend
More recommend