CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu
Degree distribution: P(k) Path length: h Clustering coefficient: C Connected components: s Definitions will be presented for undirected graphs, sometimes we will explicitly mention extensions to directed graphs, and sometimes extensions will be obvious 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 3
¡ Degree distribution P(k) : Probability that a randomly chosen node has degree k N k = # nodes with degree k ¡ Normalized histogram: ➔ plot P(k) = N k / N P(k) 0.6 0.5 0.4 0.3 0.2 0.1 1 2 3 4 k For directed graphs we have separate in- and out-degree distributions. 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 4
¡ A path is a sequence of nodes in which each node is linked to the next one n = { i 0 , i 1 , i 2 ,..., i n } n = {( i 0 , i P P 1 ),( i 1 , i 2 ),( i 2 , i 3 ),...,( i n - 1 , i n )} ¡ A path can intersect itself and pass through the same edge multiple times B F A § E.g.: ACBDCDEG E D G C H X 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 5
¡ Distance (shortest path, geodesic) D between a pair of nodes is defined as A X the number of edges along the C shortest path connecting the nodes B § *If the two nodes are not connected, the h B,D = 2 h A,X = ∞ distance is usually defined as infinite (or zero) ¡ In directed graphs, paths need to D follow the direction of the arrows A § Consequence: Distance is C B not symmetric : h B,C ≠ h C,B h B,C = 1, h C,B = 2 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 6
¡ Diameter: The maximum (shortest path) distance between any pair of nodes in a graph ¡ Average path length for a connected graph or a strongly connected directed graph 1 å • h ij is the distance from node i to node j = h h E max is the max number of edges (total • ij 2 E number of node pairs) = n(n-1)/2 ¹ i , j i max § Many times we compute the average only over the connected pairs of nodes (that is, we ignore “infinite” length paths) § Note that ths measure also applied to (strongly) connected components of a graph 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 7
¡ Clustering coefficient (for undirected graphs): § How connected are i ’s neighbors to each other? § Node i with degree k i Note 𝑙 " (𝑙 " − 1) is § C i Î [0,1] max number of edges between the 𝑙 " neighbors where e i is the number of edges § between the neighbors of node i Clustering coefficient is undefined (or defined to be 0) for nodes with degree 0 or 1 1 N å ¡ Average clustering coefficient: = C C i N i 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 8
¡ Clustering coefficient (for undirected graps): § How connected are i ’s neighbors to each other? § Node i with degree k i where e i is the number of edges § between the neighbors of node i k B =2, e B =1, C B =2/2 = 1 B F A k D =4, e D =2, C D =4/12 = 1/3 E D G Avg. clustering: C=0.33 C H 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 9
¡ Size of the largest connected component § Largest set where any two vertices can be joined by a path ¡ Largest component = Giant component B How to find connected components: A Start from random node and perform • Breadth First Search (BFS) Label the nodes that BFS visits • If all nodes are visited, the network is connected C • F D Otherwise find an unvisited node and repeat BFS • H G 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 10
Degree distribution: P(k) Path length: h Clustering coefficient: C Connected components: s 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 11
MSN Messenger: ¡ 1 month of activity § 245 million users logged in § 180 million users engaged in conversations § More than 30 billion conversations § More than 255 billion exchanged messages 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 13
9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 14
Network: 180M people, 1.3B edges 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 15
Messaging as an undirected graph • Edge (u,v) if users u and v exchanged at least 1 msg Contact Conversation • N=180 million people • E=1.3 billion edges 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 16
9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 17
Note: We plotted the same data as on the previous slide, just the axes are now logarithmic. 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 18
Avg. clustering of the MSN: C = 0.1140 1 å = C C C k : average C i of nodes i of degree k: k i N = i : k k k i 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 19
9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 20
Steps #Nodes 0 1 1 10 2 78 3 3,96 4 8,648 # nodes as we do BFS out of a random node 5 3,299,252 Number of links 6 28,395,849 between pairs of 7 79,059,497 nodes in the 8 52,995,778 largest connected 9 10,321,008 component 10 1,955,007 11 518,410 12 149,945 13 44,616 14 13,740 15 4,476 16 1,542 17 536 18 167 19 71 20 29 21 16 Avg. path length 6.6 22 10 23 3 90% of the nodes can be reached in < 8 hops 24 2 25 3 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 21
Heavily skewed; Degree distribution: avg. degree = 14.4 Path length: 6.6 Clustering coefficient: 0.11 Connectivity: giant component Are these values “expected”? Are they “surprising”? To answer this we need a model! 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 22
a. Undirected network N=2,018 proteins as nodes E=2,930 binding interactions as links. b. Degree distribution: Skewed. Average degree <k>=2.90 c. Diameter: Avg. path length = 5.8 d. Clustering: Avg. clustering = 0.12 Connectivity: 185 components the largest component has 1,647 nodes (81% of nodes) 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 23
¡ Erdös-Renyi Random Graphs [Erdös-Renyi, ‘60] ¡ Two variants: § G np : undirected graph on n nodes where each edge (u,v) appears i.i.d. with probability p § G nm : undirected graph with n nodes, and m edges picked uniformly at random What kind of networks do such models produce? 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 25
¡ n and p do not uniquely determine the graph! § The graph is a result of a random process ¡ We can have many different realizations given the same n and p n = 10 p= 1/6 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 26
Degree distribution: P(k) Path length: h Clustering coefficient: C What are the values of these properties for G np ? 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 27
¡ Fact: Degree distribution of G np is binomial. ¡ Let P(k) denote the fraction of nodes with degree k : - æ ö n 1 = ç ÷ - - - k n 1 k P ( k ) p ( 1 p ) ç ÷ P(k) k è ø Probability of Probability of missing the rest of k Select k nodes the n-1-k edges having k edges out of n-1 1/2 " % k = 1 − p 1 1 σ Mean, variance of a binomial distribution ≈ $ ' ( n − 1) 1/2 p ( n − 1) # & = ( - k p n 1 ) By the law of large numbers, as the network size increases, the distribution becomes increasingly σ 2 = p (1 − p )( n − 1) narrow—we are increasingly confident that the degree of a node is in the vicinity of k . 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 28
2 e ¡ Remember: = Where e i is the number i C of edges between i’s i - k ( k 1 ) neighbors i i ¡ Edges in G np appear i.i.d. with prob. p e i = p k i ( k i − 1) ¡ So, expected E[ e i ] is: 2 Number of distinct pairs of Each pair is connected neighbors of node i of degree k i with prob. p × - p k ( k 1 ) k k ¡ Then E[C i ] : = = = » C i i p - - k ( k 1 ) n 1 n i i Clustering coefficient of a random graph is small. If we generate bigger and bigger graphs with fixed avg. degree 𝑙 (that is we set 𝑞 = 𝑙 ⋅ 1/𝑜 ), then C decreases with the graph size n . 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 29
æ - ö n 1 Degree distribution: - - = ç ÷ - k n 1 k P ( k ) p ( 1 p ) ç ÷ k è ø Clustering coefficient: C=p=k/n Path length: next! Connectivity: 9/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, cs224w.stanford.edu 30
Recommend
More recommend