sampling large graphs
play

Sampling Large Graphs: Algorithms and Applications Don Towsley - PowerPoint PPT Presentation

Sampling Large Graphs: Algorithms and Applications Don Towsley College of Information & Computer Science Umass - Amherst Collaborators: P.H. Wang, J.C.S. Lui, J.Z. Zhou, X. Guan Measuring, analyzing large networks - large networks can be


  1. Sampling Large Graphs: Algorithms and Applications Don Towsley College of Information & Computer Science Umass - Amherst Collaborators: P.H. Wang, J.C.S. Lui, J.Z. Zhou, X. Guan

  2. Measuring, analyzing large networks - large networks can be represented by graphs - Facebook 1+ Billion 3

  3. Measuring, analyzing large networks - large networks can be represented by graphs - Facebook - WWW 50 Billion 3

  4. Measuring, analyzing large networks - large networks can be represented by graphs 300 million - Facebook - WWW - Twitter 3

  5. Measuring, analyzing large networks - large networks can be represented by graphs - Facebook - WWW - Twitter - Ebay 233 Million 3

  6. Measuring, analyzing large networks - large networks can be represented by graphs - Facebook - WWW - Twitter - Ebay Curse of data dimensionality !!! 3

  7. Challenges in measurement: Information distortion “World Map” in 1459  incomplete (Columbus et al. 1492) (Australia 17 th century)  wrong proportions (Africa & Asia) www.flickr.com/

  8. Why do we want to understand these networks? Want to understand or find out  how did these networks evolve?

  9. Why do we want to understand these networks? Want to understand or find out  how did these networks evolve? High school friendship  who are the influential users? network

  10. Why do we want to understand these networks? Want to understand or find out  how did these networks evolve? High school friendship  who are the influential users? network  how does influence propagate?  communities in these networks?  ….etc .

  11. Goals and challenges Goals  generate statistically valid characterization of network structure  node pairs in this work Challenges  large networks  correcting for biases

  12. How to measure: sampling Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)

  13. How to measure: sampling Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)

  14. How to measure: sampling Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)

  15. How to measure: sampling algorithms Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)

  16. How to measure: sampling algorithms Random sampling Crawling (uniform & independent) Node sampling Breadth First sampling (BFS) Edge sampling Random walk sampling (RW)

  17. Breadth first search sampling  Orkut data set (Mislove 2007), 3M vertices, 200M edges CCDF True distribution BFS, depth = 3  BFS sampling highly biased  difficult to remove bias 117

  18. Random walk sampling Bias removal?  Markov model π i  at steady state visits edges uniformly at random (edge θ i sampling ) CCDF Model:  i - P[node degree = i ] RW sampling π i - P[visited degree = i ]  i   i  i i 18

  19. Random walk sampling Bias removal?  Markov model π i  at steady state visits edges uniformly at random (edge θ i sampling ) CCDF Model:  i - P[node degree = i ] RW sampling π i - P[visited degree = i ]  i =  i  i / avg degree i  i = Norm   i / i or 19

  20. Node sampling vs. RW: Orkut random walk node sampling log(CCDF l og(CCDF ) log(degree) log(degree)  RW – estimates tail well  node sampling – estimates small degrees well 20

  21. Focus of talk Measure node pair statistics: important for many applications! 22

  22. Classification of node pairs Classify node pair [𝑣, 𝑤] using shortest path • 1-hop node pair class if distance( u , v ) = 1 • 2-hop node pair class if distance( u , v ) = 2 • … 23

  23. Homophily Homophily: tendency of users to connect to others with common interests. P. Singla and M. Richardson. Yes, there is a correlation: from social networks to personal behavior on the web. In WWW 2008 (MSN) Can infer characteristics and make recommendations Compare homophily( u , v ) between different node pair classes 24

  24. Pair similarity: Proximity Proximity( u , v ) : number of common neighbors of u and v; closeness of u and v u v  knowing proximity distribution of node pairs important for  friendship prediction  interest recommendation  … 25

  25. Pair similarity: distance  Distance( u , v ) : length of shortest path between u and v in graph  measure distance distribution of all node pairs to calculate  average distance • Twitter: 4.1 • MSN: 6.6  effective diameter (the 90th percentile of all distances)  small world 26

  26. Problem formulation  undirected graph 𝐻 = (𝑊, 𝐹)  measure node pair characteristics in following sets:  all pairs - 𝑇 = { 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝑊, 𝑣 ≠ 𝑤}  one-hop pairs - pairs of connected nodes 𝑇 (1) = 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝐹  two-hop pairs - pairs of nodes with at least one common neighbor 𝑇 (2) = { 𝑣, 𝑤 : 𝑣, 𝑤 ∈ 𝑊, 𝑣 ≠ 𝑤; ∃𝑦 ∈ 𝑊 𝑡𝑢 𝑦, 𝑣 , 𝑦, 𝑤 ∈ 𝐹} 27

  27. Problem formulation  𝑮(𝒗, 𝒘) – similarity of node pair under study, e.g., # of common neighbors of 𝑣, 𝑤  {𝑏 1 , … , 𝑏 𝐿 } - range of 𝐺 𝑣, 𝑤  distribution of 𝐺 𝑣, 𝑤  𝑇: (𝜕 1 , … , 𝜕 𝐿 ) (1) , … , 𝜕 𝐿 (1) )  𝑇 (1) : (𝜕 1 2 , … , 𝜕 𝐿 2 )  𝑇 (2) : (𝜕 1 (2) - fractions of node pairs in 𝑇, 𝑇 1 , 𝑇 (2) (1) , 𝜕 𝑙 𝜕 𝑙 , 𝜕 𝑙 with property 𝐺 𝑣, 𝑤 = 𝑏 𝑙 28

  28. Challenges  OSNs large  Facebook, Google+, Twitter, Facebook, LinkedIn, …, 𝑊 > 500 million users  huge number of node pairs, 𝑊 2 > 10 16  topology not available ⇒ sampling required  UVS (Uniform Vertex Sampling): • unbiased for 𝑻 • sampling bias for 𝑻 (𝟐) , 𝑻 (𝟑) . • sometimes UVS not allowed  crawling - RW: sampling bias  need to construct unbiased estimates 29

  29. Node pair sampling based on UVS Basic sampling techniques  UVS : sample nodes from 𝑊 uniformly  weighted vertex sampling (WVS) : sample nodes from V with desired probability distribution (𝜌 𝑦 : 𝑦 ∈ 𝑊)  independent WVS (IWVS) (if we have topology)  Metropolis-Hastings WVS (MHWVS) (if not): at each step, MHWVS selects a node v using UVS and then accepts the sample with probability min(𝜌 𝑤 /𝜌 𝑣 , 1) , where 𝑣 is previous sample; otherwise tries again 30

  30. Node pair sampling based on UVS All pairs 𝑻 Sampling method: select two different nodes 𝑣 and 𝑤 uniformly at random Estimator : given sampled pairs 𝑣 𝑗 , 𝑤 𝑗 , 𝑗 = 1, … , 𝑜 𝑜 𝜕 𝑙 = 1 𝑜 𝟐(𝐺 𝑣 𝑗 , 𝑤 𝑗 = 𝑏 𝑙 ) , 𝑙 = 1, … , 𝐿 𝑗=1 Accuracy ( unbiased ) 𝐹 𝜕 𝑙 = 𝜕 𝑙 , 𝑙 = 1, … , 𝐿 31

  31. Node pair sampling based on UVS One hop pairs 𝑻 (𝟐) Sampling node pair [𝑣, 𝑤] 1) sample node 𝑣 according to 𝑒 𝑣 probability distribution 1 : 𝑣 ∈ 𝑊) , where (𝜌 𝑣 … 1 = 𝑒 𝑣 𝜌 𝑣 2|𝐹| 𝑒 𝑣 - degree of node 𝑣 u 2) select neighbor 𝑤 at random Each [𝑣, 𝑤] sampled uniformly from 𝑻 (𝟐) 32

  32. Node pair sampling based on UVS One hop pairs 𝑻 (𝟐) Sampling node pair [𝑣, 𝑤] 1) sample node 𝑣 according to 𝑒 𝑣 probability distribution 1 : 𝑣 ∈ 𝑊) , where (𝜌 𝑣 … 1 = 𝑒 𝑣 v 𝜌 𝑣 2|𝐹| 𝑒 𝑣 - degree of node 𝑣 u 2) select neighbor 𝑤 at random Each [𝑣, 𝑤] sampled uniformly from 𝑻 (𝟐) 33

  33. Node pair sampling based on UVS One hop pairs 𝑻 (𝟐) Estimator : given sampled pairs 𝑣 𝑗 , 𝑤 𝑗 , 𝑗 = 1, … , 𝑜 𝑜 (1) = 1 𝜕 𝑙 𝑜 𝟐(𝐺 𝑣 𝑗 , 𝑤 𝑗 = 𝑏 𝑙 ) , 𝑙 = 1, … , 𝐿 𝑗=1 Accuracy (unbiased) (1) = 𝜕 𝑙 (1) , 𝑙 = 1, … , 𝐿 𝐹 𝜕 𝑙 34

  34. Node pair sampling based on UVS Two hop pairs 𝑻 (𝟑)  sampling node pair 𝑣, 𝑤 sample node x 1) x select two neighbors u and v of 2) node 𝑦 at random 35

  35. Node pair sampling based on UVS Two hop pairs 𝑻 (𝟑) u  sampling node pair 𝑣, 𝑤 sample node x 1) x select two neighbors u and v of v 2) node 𝑦 at random  produces asymptotically unbiased (2) , 𝑙 = 1, … , 𝐿 estimate of 𝜕 𝑙 tight convergence rate  36

  36. Node pair sampling based on RW Why?  UVS not available, too costly  API not provided  user IDs sparsely distributed  only crawling techniques can be used  random walk : walker moves to random neighbor, samples its information  we saw for connected non-bipartite graph 𝜌 𝑤 = 𝑒 𝑤 2 𝐹 , 𝑤 ∈ 𝑊 37

  37. Node pair sampling based on RW All pairs 𝑻  sample node pair 𝑣 𝑗 , 𝑤 𝑗 by two independent RWs, where 𝑣 𝑗 , 𝑤 𝑗 𝑣 𝑗 , are nodes sampled by two RWs 𝑤 𝑗 at step i  node pair [ 𝑣 , 𝑤 ] sampled according to stationary distribution 𝜌 [𝑣,𝑤] = 𝑒 𝑣 𝑒 𝑤 4 𝐹 2 , 𝑣, 𝑤 ∈ 𝑊 38

Recommend


More recommend