Generating Large Graphs for Benchmarking Ali Pinar, Tamara G. Kolda,C. Seshadhri, Todd Plantenga U.S. Department of Energy U.S. Department of Defense Office of Advanced Scientific Computing Research 2/21/2014 Defense Advanced Research Projects Agency Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE -AC04-94AL85000. Pinar @ SIAM PP 14
Modeling graphs is a crucial challenge Our understanding of network structure Useful Real Data is still limited. Measurements We do not have the first principles. Why model graphs? Calibration Real data will rarely be available. Inherent Understanding normal helps identifying Properties abnormal. Benchmarking requires controlled experiments. Challenges Data analysis: Identifying metrics that can help in characterization (e.g., degree Mathematical Generative Model distribution, clustering coefficients) Theoretical analysis: Understanding the structure inferred by these metrics Algorithms: Designing algorithms to compute these metrics, generate graphs, Generated Data Measurements etc. 2/21/2014 Pinar – SIAM PP 14 2
A Good Network Model… Encapsulates underlying driving Story-driven models principals “Physics” Example: Preferential Attachment (Barabasi & Albert, Science,1999) Captures measurable characteristics of real-world data • New nodes joins graph one at new a time, in sequence Degree distribution node & • Each new node chooses k new Clustering coefficients edge(s) neighbors, according to degree Community structure • Node degrees updated after each addition – Rich get richer! Connectedness, Diameter k = 1 Eigenvalues Calibrates to specific data sets Structure-driven models Quantitative vs. qualitative 2 Surrogate for real data, protecting Example: CL (aka Configuration) 1 privacy and security (Chung & Lu, PNAS, 2002) 4 1 1 Provides results “like” the real data • Desired node degrees new 7 Easy to share, reproduce specified in advance edge 3 • New edges inserted, choosing Yields understanding 3 endpoints by desired degree 1 • Higher-degree nodes are more Serve as null model 2 likely to be selected Statistical sampling guidance Predictive capabilities 2/21/2014 Pinar @ SIAM PP 14 3
A Good Network Model… Encapsulates underlying driving Story-driven models principals “Physics” Example: Preferential Attachment (Barabasi & Albert, Science,1999) Captures measurable characteristics of real-world data • New nodes joins graph one at new a time, in sequence Degree distribution node & • Each new node chooses k new Clustering coefficients edge(s) neighbors, according to degree Community structure • Node degrees updated after each addition – Rich get richer! Connectedness, Diameter k = 1 Eigenvalues Calibrates to specific data sets Structure-driven models Quantitative vs. qualitative 2 Surrogate for real data, protecting Example: CL (aka Configuration) 1 privacy and security (Chung & Lu, PNAS, 2002) 4 1 1 Provides results “like” the real data • Desired node degrees new 7 Easy to share, reproduce specified in advance edge 3 • New edges inserted, choosing Yields understanding 3 endpoints by desired degree 1 • Higher-degree nodes are more Serve as null model 2 likely to be selected Statistical sampling guidance Predictive capabilities 2/21/2014 Pinar @ SIAM PP 14 4
Degree Dist. Measures Connectivity The degree distribution is one way to characterize a graph. Barabasi & Albert, Science, 1999: “ A common property of many K large networks is that the vertex L A B connectivities follow a scale-free power- law distribution” F C J G H E D 2/21/2014 Pinar @ SIAM PP 14 5
Clustering Coeff. Measures Cohesion The clustering coefficient measures the rate of wedge closure . In social networks, the clustering coefficients decrease smoothly as K the degree increases. High L A B degree nodes generally have little social cohesion. F C J G H E D 2/21/2014 Pinar @ SIAM PP 14 6
Current State-of-the-Art Falls Short Story-Driven Models Structure-Driven Models Examples Examples Preferential Attachment CL: Chung-Lu; aka Configuration Model, Barabasi & Albert, Science 1999 Weighted Erdös-Rényi Forest Fire Leskovec, Kleinberg, Faloutsos, KDD 2005 PNAS 2002 Random Walk SKG: Stochastic Kronecker Graphs; R-MAT Vazquez, Phys. Rev. E 2003 is a special case Pros & Cons Leskovec et al., JMLR 2010; Chakrabarti, Poor fits to real data Zhan, Faloutsos, SDM 2004 Expensive to calibrate to real data Graph 500 Generator! Do not scale – inherently sequential Pros & Cons Survey: Sala et al., WWW 2010 Do not capture clustering coefficients SKG expensive to calibrate clustering coefficient Scales – generation cost O(m log n) CL & SKG very similar in behavior Pinar, Seshadhri, Kolda, SDM 2012 degree 2/21/2014 Pinar @ SIAM PP 14 7
Stochastic Kronecker Graph (SKG) as Graph 500 Generator Pros Only 5 parameters 2x2 generator matrix (sums to 1) n = 2 L = # nodes m = 16n = # edges O(m log n) generation cost Edge generation fully parallelizable Except de-duplication Cons L Isolated d avg Oscillations in degree distribution 26 51% 32 (fixed by adding special noise) 29 57% 37 Limited degree distribution 32 62% 41 (noisy version is lognormal) 36 67% 49 Half the nodes are isolated! 39 71% 55 Tiny clustering coefficients! 42 74% 62 Seshadhri, Pinar, Kolda, Journal of the ACM, April 2012 2/21/2014 Pinar @ SIAM PP 14 8
The Physics of Graphs Random graph: CL Model (1) Formed according to CL Model (2) “High” clustering coefficient Thm: Must contain a “substantive” subgraph Global Clustering Coefficient that is a dense Erdös-Rényi graph . A heavy-tailed network with a high clustering Dense Erdös-Rényi Subgraph coefficient contains many Erdös-Rényi affinity blocks . (The distribution of the block sizes is also heavy tailed.) Basic measurements lead to inferences about larger structures (communities) that are consistent with literature. Seshadhri, Kolda, Pinar, Phys. Rev. E, 2012 2/21/2014 Pinar @ SIAM PP 14 9
BTER: Block Two-Level Erdös-Rényi Preprocessing Phase 2 Phase 1 • • CL model on excess • Erdös-Rényi graphs in Create affinity blocks of nodes with (nearly) same degree (a sort of each block degree, determined by • weighted Erdös-Rényi) Need to insert extra degree distribution • Creates connections links to insure enough • Connectivity per block based across blocks unique links per block on clustering coefficient • For each node, compute desired • within-block degree • excess degree Occurring independently Seshadhri, Kolda, Pinar, Phys. Rev. E, 2012 Kolda, Pinar, Plantenga,, Seshadhri, arXiv:1302.6636, Feb. 2013 2/21/2014 Pinar @ SIAM PP 14 10
BTER vs. SKG: Co-authorship Degree Distribution Clustering Coefficients SKG & CL lacking enough triangles SKG parameters from Leskovec et al., JMLR, 2010 2/21/2014 Pinar @ SIAM PP 14 11
BTER vs. SKG: Social Website Degree Distribution Clustering Coefficients Note oscillations in SKG SKG parameters from Leskovec et al., JMLR, 2010 2/21/2014 Pinar @ SIAM PP 14 12
Community Structure of BTER Improves Eigenvalue Fit Leading E-vals of Adjacency Matrix Leading E-vals of Adjacency Matrix 2/21/2014 Pinar @ SIAM PP 14 13
Making BTER Scalable Requirements: Extreme scalability requires independent edge insertion. Data structures should be o(|V|) to be duplicated at each processor. Data Structures: Given the degree distribution, compute <block size, #blocks>, which requires O(dmax) memory. Given the clustering coefficients, compute the number of edges per block, hence the phase 1 degrees. Given Phase 1 degrees, we can compute residual (Phase 2) degrees. Challenge: Adjust for repetitions 2/21/2014 Pinar @ SIAM PP 14 14
Adjusting for repeated edges Parallel edge insertion leads to multiple edges. This is negligible if edge probabilities are small. This is the case for SKG, CL But not for BTER. BTER has dense blocks, hence many repeats. We had extra edges to guarantee the number of unique items is as expected. Coupon collector problem. 2/21/2014 Pinar @ SIAM PP 14 15
Recommend
More recommend