Clustering in Popularity Adjusted Stochastic Block Model Majid Noroozi and Marianna Pensky Department of Mathematics University of Central Florida Majid Noroozi (UCF) Clustering in PABM 1 / 25
Introduction Clustering is a central problem in machine learning and data mining A vast amount of data sets can be represented as a network of interacting items One of the first features of interest in such networks is to understand which items are alike and clustering is used in particular to do that Majid Noroozi (UCF) Clustering in PABM 2 / 25
Introduction Clustering is a central problem in machine learning and data mining A vast amount of data sets can be represented as a network of interacting items One of the first features of interest in such networks is to understand which items are alike and clustering is used in particular to do that Source: Abbe, ”Community detection and stochastic block models: recent developments.” Majid Noroozi (UCF) Clustering in PABM 2 / 25
Introduction Network Node Link Network Type Citation Network Papers Citations Directed Email Email Emails Directed Addresses Social Users Interactions Undirected Network Coauthorship Research Coauthor a paper Undirected Network Scientists Table: Some examples of real networks Majid Noroozi (UCF) Clustering in PABM 3 / 25
Random Graph Models Majid Noroozi (UCF) Clustering in PABM 4 / 25
Random Graph Models Let A ∈ { 0 , 1 } n × n be the symmetric adjacency matrix of the network with A i , j = 1 if there is a connection between nodes i and j , and A i , j = 0 otherwise. Assume that A i , j ∼ Bernoulli ( P i , j ) , 1 ≤ i ≤ j ≤ n , where A i , j are conditionally independent given P i , j and A i , j = A j , i , P i , j = P j , i for i > j . Majid Noroozi (UCF) Clustering in PABM 5 / 25
Community structure in networks The block models assume that each node i belongs to one of K distinct blocks or communities N k , k = 1 , · · · , K Community assignment is described by a function c : { 1 , . . . , n } → { 1 , . . . , K } where c ( i ) = k if i ∈ N k Alternatively, one considers a corresponding membership (or clustering matrix Z ∈ { 0 , 1 } n × K such that i ∈ N k , i = 1 , . . . , n , k = 1 , · · · , K Z i , k = 1 iff The popularity of nodes across communities defined as the number of edges between a specific node and a specific community Majid Noroozi (UCF) Clustering in PABM 6 / 25
Stochastic Block Model (SBM) A classical random graph model for networks with community structure is the Stochastic Block Model (SBM) Under this model, the probability of connection between nodes is completely defined by the communities to which they belong Majid Noroozi (UCF) Clustering in PABM 7 / 25
Stochastic Block Model (SBM) A classical random graph model for networks with community structure is the Stochastic Block Model (SBM) Under this model, the probability of connection between nodes is completely defined by the communities to which they belong b 1 , 1 1 n 1 1 T b 1 , 2 1 n 1 1 T b 1 , K 1 n 1 1 T · · · n 1 n 2 n K b 2 , 1 1 n 2 1 T b 2 , 2 1 n 2 1 T b 2 , K 1 n 2 1 T · · · n 1 n 2 n K P ( Z , K ) = . . . . . . · · · . . . b K , 1 1 n K 1 T b K , 2 1 n K 1 T b K , K 1 n K 1 T · · · n 1 n 2 n K where P ( Z , K ) is a rearranged version of matrix P where its first n 1 rows correspond to nodes from class 1, the next n 2 rows correspond to nodes from class 2 and the last n K rows correspond to nodes from class K . Majid Noroozi (UCF) Clustering in PABM 7 / 25
Degree Corrected Block Model (DCBM) Since the real-life networks usually contain a very small number of high-degree nodes while the rest of the nodes have very few connections (low degree), the SBM model fails to explain the structure of many networks that occur in practice Majid Noroozi (UCF) Clustering in PABM 8 / 25
Degree Corrected Block Model (DCBM) Since the real-life networks usually contain a very small number of high-degree nodes while the rest of the nodes have very few connections (low degree), the SBM model fails to explain the structure of many networks that occur in practice The Degree-Corrected Block Model (DCBM) addresses this deficiency by allowing these probabilities to be multiplied by the node-dependent weights Majid Noroozi (UCF) Clustering in PABM 8 / 25
Degree Corrected Block Model (DCBM) Since the real-life networks usually contain a very small number of high-degree nodes while the rest of the nodes have very few connections (low degree), the SBM model fails to explain the structure of many networks that occur in practice The Degree-Corrected Block Model (DCBM) addresses this deficiency by allowing these probabilities to be multiplied by the node-dependent weights b 1 , 1 θ n 1 θ T b 1 , 2 θ n 1 θ T b 1 , K θ n 1 θ T · · · n 1 n 2 n K b 2 , 1 θ n 2 θ T b 2 , 2 θ n 2 θ T b 2 , K θ n 2 θ T · · · n 1 n 2 n K P ( Z , K ) = . . . . . . . . · · · . b K , 1 θ n K θ T b K , 2 θ n K θ T b K , K θ n K θ T · · · n 1 n 2 n K Majid Noroozi (UCF) Clustering in PABM 8 / 25
Degree Corrected Block Model (DCBM) DCBM allows a generous degree distribution in which nodes can have different expected degree Nodes that are popular have higher value of θ Majid Noroozi (UCF) Clustering in PABM 9 / 25
Degree Corrected Block Model (DCBM) DCBM allows a generous degree distribution in which nodes can have different expected degree Nodes that are popular have higher value of θ DCBM fails to model node popularities in a flexible and realistic way since two nodes in the same community, the one with higher θ must be uniformly more popular in all communities Majid Noroozi (UCF) Clustering in PABM 9 / 25
Popularity Adjusted Block Model (PABM) The Popularity Adjusted Stochastic Block Model (PABM) introduced by Sengupta and Chen (2018) generalizes the SBM and its general form the DCBM. Majid Noroozi (UCF) Clustering in PABM 10 / 25
Popularity Adjusted Block Model (PABM) The Popularity Adjusted Stochastic Block Model (PABM) introduced by Sengupta and Chen (2018) generalizes the SBM and its general form the DCBM. For a K − block network, let Λ n × K be popularity scaling parameters. Then for i < j , P ij = λ ic j λ jc i 0 ≤ P ij ≤ 1 for all i < j . Majid Noroozi (UCF) Clustering in PABM 10 / 25
Popularity Adjusted Block Model (PABM) The Popularity Adjusted Stochastic Block Model (PABM) introduced by Sengupta and Chen (2018) generalizes the SBM and its general form the DCBM. For a K − block network, let Λ n × K be popularity scaling parameters. Then for i < j , P ij = λ ic j λ jc i 0 ≤ P ij ≤ 1 for all i < j . Λ ( 1 , 1 ) Λ T Λ ( 1 , 2 ) Λ T Λ ( 1 , K ) Λ T · · · ( 1 , 1 ) ( 2 , 1 ) ( K , 1 ) Λ ( 2 , 1 ) Λ T Λ ( 2 , 2 ) Λ T Λ ( 2 , K ) Λ T · · · ( 1 , 2 ) ( 2 , 2 ) ( K , 2 ) P ( Z , K ) = . . . . . . · · · . . . Λ ( K , 1 ) Λ T Λ ( K , 2 ) Λ T Λ ( K , K ) Λ T · · · ( 1 , K ) ( 2 , K ) ( K , K ) where Λ ( 1 , 1 ) Λ ( 1 , 2 ) · · · Λ ( 1 , K ) Λ ( 2 , 1 ) Λ ( 2 , 2 ) · · · Λ ( 2 , K ) Λ = . . . . . . . . · · · . Λ ( K , 1 ) Λ ( K , 2 ) · · · Λ ( K , K ) Majid Noroozi (UCF) Clustering in PABM 10 / 25
Understanding the PABM Left panel: matrix Λ ; Λ ( 1 , 1 ) ( red ), Λ ( 2 , 1 ) ( blue ), Λ ( 1 , 2 ) ( yellow ), Λ ( 2 , 2 ) ( purple ). Right panel: assembling re-organized probability matrix P ( Z , K ) ; P ( 1 , 1 ) ( Z , K ) ( red ), P ( 2 , 1 ) ( Z , K ) ( green ), P ( 2 , 2 ) ( Z , K ) ( purple ). Majid Noroozi (UCF) Clustering in PABM 11 / 25
Understanding the PABM Left panel: re-organized probability matrix P ( Z , 2 ) . Right panel: probability matrix P ; community 1: nodes 1,3,4; community 2: nodes 2 and 5 Majid Noroozi (UCF) Clustering in PABM 12 / 25
Subspace Clustering Subspace clustering is designed for separation of points that lie in the union of subspaces. The matrix P is constructed by K clusters of columns (rows) that lie in the union of K distinct subspaces, each of the dimension K . Majid Noroozi (UCF) Clustering in PABM 13 / 25
Sparse Subspace Clustering (SSC) If matrix P were known, the coefficient matrix W of the SSC would be based on writing every data point as a sparse linear combination of all other points by minimizing the number of nonzero coefficients � min W j � W j � 0 s . t ( P ) j = W kj ( P ) k k � = j Majid Noroozi (UCF) Clustering in PABM 14 / 25
Sparse Subspace Clustering (SSC) If matrix P were known, the coefficient matrix W of the SSC would be based on writing every data point as a sparse linear combination of all other points by minimizing the number of nonzero coefficients � min W j � W j � 0 s . t ( P ) j = W kj ( P ) k k � = j The sparse coefficients of the contaminated data are found as a solution of W j {� W j � 0 + γ � A j − AW j � 2 2 } min s . t W jj = 0 , j = 1 , ..., n Majid Noroozi (UCF) Clustering in PABM 14 / 25
Recommend
More recommend