Sampling Large Graphs: Algorithms and Applications Don Towsley College of Information & Computer Science Umass - Amherst Collaborators: P.H. Wang, J.C.S. Lui, J.Z. Zhou, X. Guan
Measuring, analyzing large networks - large networks can be represented by graphs - Facebook 1+ Billion 3
Measuring, analyzing large networks - large networks can be represented by graphs - Facebook - WWW 50 Billion 3
Measuring, analyzing large networks - large networks can be represented by graphs 300 million - Facebook - WWW - Twitter 3
Measuring, analyzing large networks - large networks can be represented by graphs - Facebook - WWW - Twitter - Ebay 233 Million 3
Measuring, analyzing large networks - large networks can be represented by graphs - Facebook - WWW - Twitter - Ebay Curse of data dimensionality !!! 3
Challenges in measurement: Information distortion “World Map” in 1459 incomplete (Columbus et al. 1492) (Australia 17 th century) wrong proportions (Africa & Asia) www.flickr.com/
Why do we want to understand these networks? Want to understand or find out how did these networks evolve?
Why do we want to understand these networks? Want to understand or find out how did these networks evolve? High school friendship who are the influential users? network
Why do we want to understand these networks? Want to understand or find out how did these networks evolve? High school friendship who are the influential users? network how does influence propagate? communities in these networks? ….etc .
Goals and challenges Goals generate statistically valid characterization of network structure node pairs in this work Challenges large networks correcting for biases
How to measure: sampling Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)
How to measure: sampling Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)
How to measure: sampling Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)
How to measure: sampling algorithms Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)
How to measure: sampling algorithms Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)
Breadth first search sampling Orkut data set (Mislove 2007), 3M vertices, 200M edges CCDF True distribution BFS, depth = 3 BFS sampling highly biased difficult to remove bias 117
Random walk sampling Bias removal? Markov model π i at steady state visits edges uniformly at random (edge θ i sampling ) CCDF Model: i - P[node degree = i ] RW sampling π i - P[visited degree = i ] i i i i 18
Random walk sampling Bias removal? Markov model π i at steady state visits edges uniformly at random (edge θ i sampling ) CCDF Model: i - P[node degree = i ] RW sampling π i - P[visited degree = i ] i = i i / avg degree i i = Norm i / i or 19
Node sampling vs. RW: Orkut random walk node sampling log(CCDF l og(CCDF ) log(degree) log(degree) RW – estimates tail well node sampling – estimates small degrees well 20
Focus of talk Measure node pair statistics: important for many applications! 22
Classification of node pairs Classify node pair [𝑣, 𝑤] using shortest path • 1-hop node pair class if distance( u , v ) = 1 • 2-hop node pair class if distance( u , v ) = 2 • … 23
Homophily Homophily: tendency of users to connect to others with common interests. P. Singla and M. Richardson. Yes, there is a correlation: from social networks to personal behavior on the web. In WWW 2008 (MSN) Can infer characteristics and make recommendations Compare homophily( u , v ) between different node pair classes 24
Pair similarity: Proximity Proximity( u , v ) : number of common neighbors of u and v; closeness of u and v u v knowing proximity distribution of node pairs important for friendship prediction interest recommendation … 25
Pair similarity: distance Distance( u , v ) : length of shortest path between u and v in graph measure distance distribution of all node pairs to calculate average distance • Twitter: 4.1 • MSN: 6.6 effective diameter (the 90th percentile of all distances) small world 26
Problem formulation undirected graph 𝐻 = (𝑊, 𝐹) measure node pair characteristics in following sets: all pairs - 𝑇 = { 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝑊, 𝑣 ≠ 𝑤} one-hop pairs - pairs of connected nodes 𝑇 (1) = 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝐹 two-hop pairs - pairs of nodes with at least one common neighbor 𝑇 (2) = { 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝑊, 𝑣 ≠ 𝑤; ∃𝑦 ∈ 𝑊 𝑡𝑢 𝑦, 𝑣 , 𝑦, 𝑤 ∈ 𝐹} 27
Problem formulation 𝑮(𝒗, 𝒘) – similarity of node pair under study, e.g., # of common neighbors of 𝑣, 𝑤 {𝑏 1 , … , 𝑏 𝐿 } - range of 𝐺 𝑣, 𝑤 distribution of 𝐺 𝑣, 𝑤 𝑇: (𝜕 1 , … , 𝜕 𝐿 ) (1) , … , 𝜕 𝐿 (1) ) 𝑇 (1) : (𝜕 1 2 , … , 𝜕 𝐿 2 ) 𝑇 (2) : (𝜕 1 (2) - fractions of node pairs in 𝑇, 𝑇 1 , 𝑇 (2) (1) , 𝜕 𝑙 𝜕 𝑙 , 𝜕 𝑙 with property 𝐺 𝑣, 𝑤 = 𝑏 𝑙 28
Challenges OSNs large Facebook, Google+, Twitter, Facebook, LinkedIn, …, 𝑊 > 500 million users huge number of node pairs, 𝑊 2 > 10 16 topology not available ⇒ sampling required UVS (Uniform Vertex Sampling): • unbiased for 𝑻 • sampling bias for 𝑻 (𝟐) , 𝑻 (𝟑) . • sometimes UVS not allowed crawling - RW: sampling bias need to construct unbiased estimates 29
Node pair sampling based on UVS Basic sampling techniques UVS : sample nodes from 𝑊 uniformly weighted vertex sampling (WVS) : sample nodes from V with desired probability distribution (𝜌 𝑦 : 𝑦 ∈ 𝑊) independent WVS (IWVS) (if we have topology) Metropolis-Hastings WVS (MHWVS) (if not): at each step, MHWVS selects a node v using UVS and then accepts the sample with probability min(𝜌 𝑤 /𝜌 𝑣 , 1) , where 𝑣 is previous sample; otherwise tries again 30
Node pair sampling based on UVS All pairs 𝑻 Sampling method: select two different nodes 𝑣 and 𝑤 uniformly at random Estimator : given sampled pairs 𝑣 𝑗 , 𝑤 𝑗 , 𝑗 = 1, … , 𝑜 𝑜 𝜕 𝑙 = 1 𝑜 𝟐(𝐺 𝑣 𝑗 , 𝑤 𝑗 = 𝑏 𝑙 ) , 𝑙 = 1, … , 𝐿 𝑗=1 Accuracy ( unbiased ) 𝐹 𝜕 𝑙 = 𝜕 𝑙 , 𝑙 = 1, … , 𝐿 31
Node pair sampling based on UVS One hop pairs 𝑻 (𝟐) Sampling node pair [𝑣, 𝑤] 1) sample node 𝑣 according to 𝑒 𝑣 probability distribution 1 : 𝑣 ∈ 𝑊) , where (𝜌 𝑣 … 1 = 𝑒 𝑣 𝜌 𝑣 2|𝐹| 𝑒 𝑣 - degree of node 𝑣 u 2) select neighbor 𝑤 at random Each [𝑣, 𝑤] sampled uniformly from 𝑻 (𝟐) 32
Node pair sampling based on UVS One hop pairs 𝑻 (𝟐) Sampling node pair [𝑣, 𝑤] 1) sample node 𝑣 according to 𝑒 𝑣 probability distribution 1 : 𝑣 ∈ 𝑊) , where (𝜌 𝑣 … 1 = 𝑒 𝑣 v 𝜌 𝑣 2|𝐹| 𝑒 𝑣 - degree of node 𝑣 u 2) select neighbor 𝑤 at random Each [𝑣, 𝑤] sampled uniformly from 𝑻 (𝟐) 33
Node pair sampling based on UVS One hop pairs 𝑻 (𝟐) Estimator : given sampled pairs 𝑣 𝑗 , 𝑤 𝑗 , 𝑗 = 1, … , 𝑜 𝑜 (1) = 1 𝜕 𝑙 𝑜 𝟐(𝐺 𝑣 𝑗 , 𝑤 𝑗 = 𝑏 𝑙 ) , 𝑙 = 1, … , 𝐿 𝑗=1 Accuracy (unbiased) (1) = 𝜕 𝑙 (1) , 𝑙 = 1, … , 𝐿 𝐹 𝜕 𝑙 34
Node pair sampling based on UVS Two hop pairs 𝑻 (𝟑) sampling node pair 𝑣, 𝑤 sample node x 1) x select two neighbors u and v of 2) node 𝑦 at random 35
Node pair sampling based on UVS Two hop pairs 𝑻 (𝟑) u sampling node pair 𝑣, 𝑤 sample node x 1) x select two neighbors u and v of v 2) node 𝑦 at random produces asymptotically unbiased (2) , 𝑙 = 1, … , 𝐿 estimate of 𝜕 𝑙 tight convergence rate 36
Node pair sampling based on RW Why? UVS not available, too costly API not provided user IDs sparsely distributed only crawling techniques can be used random walk : walker moves to random neighbor, samples its information we saw for connected non-bipartite graph 𝜌 𝑤 = 𝑒 𝑤 2 𝐹 , 𝑤 ∈ 𝑊 37
Node pair sampling based on RW All pairs 𝑻 sample node pair 𝑣 𝑗 , 𝑤 𝑗 by two independent RWs, where 𝑣 𝑗 , 𝑤 𝑗 𝑣 𝑗 , are nodes sampled by two RWs 𝑤 𝑗 at step i node pair [ 𝑣 , 𝑤 ] sampled according to stationary distribution 𝜌 [𝑣,𝑤] = 𝑒 𝑣 𝑒 𝑤 4 𝐹 2 , 𝑣, 𝑤 ∈ 𝑊 38
Recommend
More recommend