ds504 cs586 big data analytics graph mining
play

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK232 Fall 2016 Graph Data: Social Networks Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]


  1. Welcome to DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm –8:50pm R Location: AK232 Fall 2016

  2. Graph Data: Social Networks Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] J. Leskovec, A. Rajaraman, J. Ullman: 2 Mining of Massive Datasets, http:// www.mmds.org

  3. Graph Data: Media Networks Connections between political blogs Polarization of the network [Adamic-Glance, 2005] 3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  4. Graph Data: Information Nets Citation networks and Maps of science [Börner et al., 2012] 4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  5. Graph Data: Communication Nets domain2 domain1 router domain3 Internet 5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  6. Graph Data: Topological Networks Seven Bridges of Königsberg [Euler, 1735] Return to the starting point by traveling each link of the graph once and only once. 6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  7. Graph representation of networks Following Friendship One-way road Resistance Co-authorship Wireless channel Undirected links Directed links + Multi-relational links Friend & foe - Hyperlinks + + Trust & distrust … … Repulsion & cohesion Signed links

  8. Mining in Big Graphs v Network Statistic Analysis (this lecture) § Network Size § Degree distribution. v Node Ranking (Next lecture) § Identifying most influential nodes § Viral Marketing, resource allocation

  9. Graph Data: Social Networks Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] J. Leskovec, A. Rajaraman, J. Ullman: 9 Mining of Massive Datasets, http:// www.mmds.org

  10. Sampling graphs R andom sampling c rawling (uniform & independent) } vertex sampling } BFS sampling } random walk sampling } edge sampling 10 10

  11. Random Walks on Graphs Random Walk Random walk sampling Routing Molecule in liquid Influence diffusion

  12. Undirected Graphs Undirected !! 2 3 1 6 4 5

  13. Random Walk v Adjacency matrix 1 2 ! $ ! $ 0 1 1 1 3 0 0 0 # & # & 0 2 0 0 1 0 1 0 Symmetric # & # & A = D = # & # & 1 1 0 1 0 0 3 0 # & # & 0 0 0 2 1 0 1 0 " % " % 4 3 v Transition Probability Matrix Undirected ij = 1 " % 0 1/ 3 1/ 3 1/ 3 P $ ' k i 1/ 2 0 1/ 2 0 P = A • D − 1 = $ ' $ ' 1/ 3 1/ 3 0 1/ 3 $ ' 1/ 2 0 1/ 2 0 # & v |E|: number of links v Stationary Distribution π i = d i 2 E

  14. Metropolis-Hastings Random Walk v Adjacency matrix 1 2 ! $ ! $ 0 1 1 1 3 0 0 0 # & # & 0 2 0 0 1 0 1 0 Symmetric # & # & A = D = # & # & 1 1 0 1 0 0 3 0 # & # & 0 0 0 2 1 0 1 0 " % " % 4 3 v Transition Probability Matrix Undirected 1 min(1, k ⎧ ) if neighbor of w υ υ " % ⎪ 0 1/ 3 1/ 3 1/ 3 k k ⎪ $ ' MH P w = ⎨ υ 1/ 3 1/ 3 1/ 3 0 P = A • D − 1 = $ ' , w υ MH 1 P if = w ⎩ ∑ $ ' ⎪ − υ 1/ 3 1/ 3 0 1/ 3 , y υ $ ' ⎪ 1/ 3 0 1/ 3 1/ 3 y # & ≠ υ v |E|: number of links v Stationary Distribution 1 π = υ V

  15. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka , Maciej Kurant ‡, Carter Butts, Athina Markopoulou UC Irvine, EPFL ‡ Minas Gjoka, UC Irvine Walking in Facebook 15

  16. Outline v Motivation and Problem Statement v Sampling Methodology v Data Analysis v Conclusion Minas Gjoka, UC Irvine Walking in Facebook 16

  17. Online Social Networks (OSNs) v A network of declared friendships between users v Allows users to maintain relationships F G E H C D B v Many popular OSNs with different focus A § Facebook, LinkedIn, Flickr, … Social Graph Minas Gjoka, UC Irvine Walking in Facebook 17

  18. Why Sample OSNs? v Representative samples desirable § study properties § test algorithms v Obtaining complete dataset difficult § companies usually unwilling to share data § tremendous overhead to measure all (~100TB for Facebook) Minas Gjoka, UC Irvine Walking in Facebook 18

  19. Problem statement v Obtain a representative sample of users in a given OSN by exploration of the social graph. § in this work we sample Facebook (FB) § explore graph using various crawling techniques Minas Gjoka, UC Irvine Walking in Facebook 19

  20. Related Work v Graph traversal (BFS) § A. Mislove et al, IMC 2007 § Y. Ahn et al, WWW 2007 § C. Wilson, Eurosys 2009 v Random walks (MHRW, RDS) § M. Henzinger et al, WWW 2000 § D. Stutbach et al, IMC 2006 § A. Rasti et al, Mini Infocom 2009 Minas Gjoka, UC Irvine Walking in Facebook 20

  21. Outline v Motivation and Problem Statement v Sampling Methodology § crawling methods § data collection § convergence evaluation § method comparisons v Data Analysis v Conclusion Minas Gjoka, UC Irvine Walking in Facebook 21

  22. (1) Breadth-First-Search (BFS) v Starting from a seed, explores all neighbor nodes. Process continues F iteratively without replacement. G E H C v BFS leads to bias towards high D B degree nodes A Lee et al, “Statistical properties of Sampled Networks”, Phys Review E, 2006 Unexplored v Early measurement studies of Explored OSNs use BFS as primary sampling technique Visited i.e [Mislove et al], [Ahn et al], [Wilson et al.] Minas Gjoka, UC Irvine Walking in Facebook 22

  23. (2) Random Walk (RW) • Explores graph one node at a time with replacement F G E 1 RW P = H , w υ k C υ 3 / 1 D B Degree of node υ 1/3 1/3 • In the stationary distribution A k υ π = Next candidate υ 2 E ⋅ Current node Number of edges Minas Gjoka, UC Irvine Walking in Facebook 23

  24. (3) Re-Weighted Random Walk (RWRW) Hansen-Hurwitz estimator v Corrects for degree bias at the end of collection v Without re-weighting, the probability distribution for node property A is: 1 ∑ | A | Subset of sampled u A nodes with value i ∈ i p A ( ) i = = i 1 | V | ∑ u V ∈ v Re-Weighted probability distribution : All sampled nodes 1/ k = ∑ Degree of node u u A u ∈ p A ( ) i i 1/ k ∑ u V u ∈ Minas Gjoka, UC Irvine Walking in Facebook 24

  25. (4) Metropolis-Hastings Random Walk (MHRW) F G E v Explore graph one node at a time with replacement H C 1/5 D B 1 min(1, k ⎧ ) if neighbor of w 3 1/3 υ / 1 υ A ⎪ k k ⎪ MH P w = ⎨ υ , w υ 2/15 MH 1 P if = w ⎩ ∑ ⎪ − υ , y υ Next candidate ⎪ y ≠ υ v In the stationary distribution Current node 1 1 1 2 MH P 1 ( ) 1 = − + + = AA 3 3 5 15 π = υ V 1 3 1 MH P = ⋅ = AC 3 5 5 Minas Gjoka, UC Irvine Walking in Facebook 25

  26. Uniform userID Sampling (UNI) v As a basis for comparison , we collect a uniform sample of Facebook userIDs (UNI) § rejection sampling on the 32-bit userID space v UNI not a general solution for sampling OSNs § userID space must not be sparse § names instead of numbers Minas Gjoka, UC Irvine Walking in Facebook 26

  27. Summary of Datasets Sampling method MHRW RW BFS UNI #Valid Users 28x81K 28x81K 28x81K 984K # Unique Users 957K 2.19M 2.20M 984K • Egonets for a subsample of MHRW - local properties of nodes • Datasets available at: http://odysseas.calit2.uci.edu/research/osn.html Minas Gjoka, UC Irvine Walking in Facebook 27

  28. Data Collection Basic Node Information v What information do we collect for each sampled node u ? Friend List UserID Name UserID UserID Networks Name Name Networks Privacy Settings Networks Privacy settings Privacy settings u Regional School/Workplace UserID 1 1 1 1 Name Send Message Networks View Friends Privacy settings Profile Photo Add as Friend Minas Gjoka, UC Irvine Walking in Facebook 28

  29. Detecting Convergence • Number of samples (iterations) to loose dependence from starting points? Minas Gjoka, UC Irvine Walking in Facebook 29

  30. Online Convergence Diagnostics Geweke v Detects convergence for a single walk. Let X be a sequence of samples for metric of interest. X a X b E X ( ) E X ( ) − z a b = Var X ( ) Var X ( ) − a b J. Geweke, “Evaluating the accuracy of sampling based approaches to calculate posterior moments“ in Bayesian Statistics 4, 1992 Minas Gjoka, UC Irvine Walking in Facebook 30

  31. Online Convergence Diagnostics Gelman-Rubin v Detects convergence for m>1 walks Between walks variance Walk 1 Walk 2 n 1 m 1 B − + ⎛ ⎞ R = + ⎜ ⎟ n mn W ⎝ ⎠ Walk 3 Within walks variance A. Gelman, D. Rubin, “Inference from iterative simulation using multiple sequences“ in Statistical Science Volume 7, 1992 Minas Gjoka, UC Irvine Walking in Facebook 31

  32. When do we reach equilibrium? Node Degree Burn-in determined to be 3K Minas Gjoka, UC Irvine Walking in Facebook 32

  33. Methods Comparison Node Degree v Poor performance for BFS, RW 28 crawls v MHRW, RWRW produce good estimates § per chain § overall Minas Gjoka, UC Irvine Walking in Facebook 33

  34. Sampling Bias BFS v Low degree nodes under- represented by two orders of magnitude v BFS is biased Minas Gjoka, UC Irvine Walking in Facebook 34

Recommend


More recommend