CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu
Networks of tightly Networks of tightly connected groups Network communities: Sets of nodes with lots of Sets of nodes with lots of connections inside and few to outside (the rest few to outside (the rest of the network) Communities, clusters, , , groups, modules 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
[Onnela et al. ‘07] Edge strengths (call volume) Edge betweenness in real network in real network 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
[Girvan ‐ Newman PNAS ‘02] Divisive hierarchical clustering based on edge b t betweenness: Number of shortest paths passing through the edge Girvan Newman Algorithm: Girvan ‐ Newman Algorithm: Repeat until no edges are left: Calculate betweenness of edges Remove edges with highest betweenness Connected components are communities Gives a hierarchical decomposition of the network Gives a hierarchical decomposition of the network Example: 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
[Newman ‐ Girvan PhysRevE ‘03] Zachary’s Karate club: Zachary s Karate club: hierarchical decomposition 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
[Newman ‐ Girvan PhysRevE ‘03] Communities in physics collaborations 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
Breath first search starting from A: t ti f A Want to compute betweenness of paths starting at node A 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
Count the number of shortest paths from A to Count the number of shortest paths from A to all other nodes of the network: 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
Compute betweenness by working up the tree: Compute betweenness by working up the tree: If there are multiple paths count them fractionally • Repeat the BFS 1+1 paths to H Split evenly procedure for each node of the network • Add edge scores 1+0.5 paths to J Split 1:2 • Runtime (all pairs shortest path): Runtime (all pairs shortest path): ‐‐ Weighted graphs: O(N 3 ) 1 path to K ‐‐ Unweighted graphs: O(N 2 ) Split evenly 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
Define modularity to be Define modularity to be Q = (number of edges within groups) – (expected number within groups) (expected number within groups) Actual number of edges between i and j is Expected number of edges between i and j is m…number of edges 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
Q = (number of edges within groups) – Q = (number of edges within groups) (expected number within groups) Then: Then: m … number of edges A ij … 1 if (i,j) is edge, else 0 k k 1 k i … degree of node i i j Q Q A ( ( c , , c ) ) c i c i … group id of node i group id of node i ij ij i i j j 4 4 m 2 2 m (a, b) … 1 if a=b, else 0 i , j Modularity lies in the range [ − 1,1] y g [ , ] It is positive if the number of edges within groups exceeds the expected number 0.3<Q<0.7 means significant community structure 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
Modularity is useful for selecting the Modularity is useful for selecting the number of clusters: Why not optimize modularity directly? 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
Consider splitting the graph in two communities Consider splitting the graph in two communities k k Modularity Q is: 2 i j A y ij 2 m m i , j in same group Or we can write in matrix form as s … vector of group memberships s i ={+1, ‐ 1} B … modularity matrix Note: each row (column) of B sums to 0 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
Task: Find s { 1 +1} n that maximizes Q Task: Find s { ‐ 1,+1} that maximizes Q Rewrite Q in terms of eigenvalues β i of B n 2 2 T T T T T Q s u u s s u u s s u i i i i i i i i i i i 1 To maximize Q, easiest way is to make s = u 1 Assigns all weight in the sum to β 1 (largest eigval) A i ll i h i h β (l i l) (all other s T u i terms zero because of orthonormality) Unfortunately elements of s must be 1 Unfortunately, elements of s must be 1 In general, finding optimal s is NP ‐ hard 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
2 n n 2 T T Q Q s u s u i i i 1 i 1 i 1 i 1 Heuristic: try to maximize only the β 1 term β Similar in spirit to the spectral partitioning p p p g algorithm (we will explore it next time) Continue the bisection hierarchically 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
Fast Modularity Optimization Algorithm: Fast Modularity Optimization Algorithm: Find leading eigenvector u 1 of modularity matrix B Divide the nodes by the signs of the elements of u 1 y g 1 Repeat hierarchically until: If a proposed split does not cause modularity to increase declare modularity to increase, declare community indivisible and do not split it If all communities are indivisible, stop How to find u 1 ? Power method! Bv Iterative multiplication, normalization k v v 1 k Start with random v, until convergence: Bv k 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16
Also, can combine with other methods: , Randomly divide the nodes into two groups Move the node that, if moved, will increase Q the most Repeat for all nodes, with each node only moved once epeat o a odes, t eac ode o y o ed o ce Once complete, find intermediate state with highest Q Start from this state and repeat until Q stops increasing Good results for “fine ‐ tuning” the spectral method Good results for fine tuning the spectral method CNM Algorithm (Clauset ‐ Newman ‐ Moore ‘04): (1) Separate each vertex solely into n community (1) Separate each vertex solely into n community (2) Calculate Q for all possible community pairs (3) Merge the pair of the largest increase in Q Repeat (2)&(3) until one community remains Repeat (2)&(3) until one community remains Cross cut the dendogram where Q is maximum 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17
Fast modularity Fast modularity GN = Girvan ‐ Newman, O(n 3 ) CNM = Greedy merging (n log 2 n) DA = External Optimization O(n 2 log 2 n) Issues with modularity: May not find communities with less than m links NP ‐ hard to optimize exactly [Brandes et al. ‘07] 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
[Kumar et al. ‘99] Searching for small communities Searching for small communities in a Web graph (1) The signature of a community/discussion (1) The signature of a community/discussion in the context of a Web graph Intuition: a bunch of people all A dense 2 ‐ layer graph talking about the same things 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20
(2) A more well defined problem: (2) A more well ‐ defined problem: Enumerate complete bipartite subgraphs K s,t Where K Where K s,t = s nodes where each links to the same s nodes where each links to the same t other nodes 11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
Recommend
More recommend