The Hierarchical Structure of Networks Aaron Clauset Santa Fe Institute 4 August 2008 SFI / CAIDA W orkshop Networks and Navigation
First, Some Pictures
social groups or communities teenage friendships * research collaborations * *image stolen from elsewhere
functional(?) clusters, hierarchies * * metabolites proteins *image stolen from elsewhere
co-purchasing (topical?) groups amazon.com books on politics communities * *image stolen from elsewhere
A Question How can we extract • structural patterns • at many scales • in a rigorous fashion from complex networks?
What is Structure? some stylized ideas
no structure
no structure modular structure one scale
no structure modular structure hierarchical structure one scale multi-scale
A Question network data How can we extract • hierarchical structure • in a rigorous fashion from complex networks? → ? hierarchy
One Approach Model-based inference 1. describe how to generate hierarchies (a model) 2. “fit” model to empirical data 3. test “fitted” model 4. extract predictions + insight
A Model of Hierarchy
A Model of Hierarchy D , { p r } assortative modules → probability p r
model “inhomogeneous” random graph → → j i instance → i j Pr( i, j connected) = p r = p (lowest common ancestor of i,j )
→
Model Features • explicit model = explicit assumptions • very flexible (many parameters) • captures structure at all scales • arbitrary mixtures of assortativity, disassortativity • learnable directly from data
Learning From Data • We use a Bayesian approach: • likelihood function L = Pr( data | model ) scores quality of model • sample high quality models via MCMC • technical details in arXiv : physics/0610051 and Nature 453 , p98 (2008)
From Graph to Ensemble
From Graph to Ensemble • Given graph G • run MCMC to equilibrium • then, for each sampled , draw a resampled D G � graph from ensemble A test: do resampled graphs look like original?
herbivore → → plant → parasite Grassland species* *thank you: Jennifer Dunne
Degree Distribution a 0 10 Fraction of vertices with degree k original → ! 1 10 ! 2 10 → resampled ! 3 10 0 1 10 10 Degree, k
Clustering Coefficient Fraction of graphs with clustering coefficient c 0.25 original → original 0.2 → 0.15 0.1 → → resampled resampled 0.05 0 0 0.05 0.1 0.15 0.2 0.25 0.3 Clustering coefficient, c
Distance Distribution b 0 10 Fraction of vertex ! pairs at distance d original → ! 1 10 → ! 2 10 resampled ! 3 10 2 4 6 8 10 Distance, d
Missing Links A test: can model predict missing links?
Predicting is Hard • remove edges from G k • how easy to guess a missing link? k p guess ≈ n 2 − m + k = O ( n − 2 ) n = 75 m = 113 p guess = k/ (2662 + k )
Predicting Missing Links • Given incomplete graph G • run MCMC to equilibrium � p r � • then, over sampled , compute average D ( i, j ) �∈ G for links � p r � • predict links with high values are missing Test idea via leave- k -out cross-validation perfect accuracy: AUC = 1 no better than chance: AUC = 1/2
Missing Structure Grassland species network 1 Pure chance Common neighbors 0.9 Jaccard coeff. hierarchy Degree product Area under ROC curve → Shortest paths 0.8 Hierarchical structure AUC 0.7 → simple predictors 0.6 → 0.5 pure chance 0.4 0 0.2 0.4 0.6 0.8 1 Fraction of edges observed, k/m
Other Networks Terrorist association network a 1 Pure chance Common neighbors 0.9 Jaccard coefficient Degree product Shortest paths 0.8 Hierarchical structure AUC 0.7 b T. pallidum metabolic network 1 Pure chance 0.6 Common neighbors 0.9 Jaccard coefficient Degree product 0.5 Shortest paths 0.8 Hierarchical structure 0.4 0 0.2 0.4 0.6 0.8 1 AUC Fraction of edges observed 0.7 0.6 0.5 0.4 0 0.2 0.4 0.6 0.8 1 Fraction of edges observed
Summary • Many real networks are hierarchically modular • Hierarchies can • model multi-scale structure • generalize a single network • predict missing links • Model-based inference is very powerful Acknowledgments : C. Moore, M.E.J. Newman, C.H. Wiggins, and C.R. Shalizi
Fin
Markov chain Monte Carlo (MCMC) Given , choose random internal node D Choose random reconfiguration of subtrees [ergodicity] { p r } Recompute probabilities and likelihood L Sampling states according to their likelihood [detailed balance] three subtree configurations (up to relabeling)
herbivore → → plant → parasite Grassland species
c
Graph Resampling
1. Summary Statistics 0 10 0.4 0.35 ! 1 10 0.3 0.25 ! 2 10 P(x) p(d) 0.2 ! 3 10 0.15 0.1 ! 4 10 0.05 ! 5 10 0 0 1 2 3 4 1 2 3 4 5 10 10 10 10 10 Distance, d x degree distribution distance distribution rich-club distribution ... etc. short-loop distribution betweenness function degree-degree correlations
1. Summary Statistics The good • good for exploratory analysis • often quick calculations The bad • throw away important information • can make different networks appear similar • what are right statistics to measure? • different statistics often highly correlated • indirect measures of large-scale structure, function
2. Algorithmic Analysis U B C B U global modularity Q local modularity R network motifs ... etc. box covering clique covering
2. Algorithmic Analysis The good • good for exploratory analysis • illustrate large-scale structure, heterogeneity The bad • often (NP-)hard optimizations • can be sensitive to noise, uncertainty • ad hoc or heuristic measures of structure, function • algorithm = theory • implied physics often unclear
3. Statistical Inference hierarchical random graphs latent space models correlation reconstruction I ( X ; Y ) = H ( X ) − H ( X | Y ) community mixtures information bottlenecks network classification
3. Statistical Inference The good • model-based measures of structure • concrete, testable predictions • better robustness to noise, uncertainty • well-grounded in computer science, statistics The bad • models must be explicit, precise • often hard computations • data intensive
Two Case Studies 22 18 25 26 8 20 10 28 2 4 30 24 NCAA Schedule 2000 27 31 3 13 1 15 34 32 n = 115 m = 613 6 16 7 5 19 12 49 14 53 58 33 21 63 9 17 46 83 114 11 29 23 28 33 25 11 97 88 1 59 67 73 Zachary’s Karate Club 105 24 50 103 37 89 69 36 45 110 109 57 90 n = 34 m = 78 44 66 34 42 16 82 75 4 31 86 93 91 112 80 0 18 54 48 9 92 23 7 29 104 8 61 71 94 41 35 78 68 99 19 22 55 21 77 5 10 111 30 81 101 79 3 108 51 85 38 52 84 98 113 2 6 17 43 26 76 70 107 60 39 40 14 74 72 47 62 95 96 12 13 27 100 15 102 65 20 87 106 56 64 32
Mixing Times equilibrium → → MCMC mixes ! %" ! %"" , , relatively quickly ! !"" ! !""" ! !$" ! !$"" Equilibrium in /01 ! /)2+/)3004 ! !'" ! !'"" O ( n 2 ) steps ! !&" ! !&"" ! !%" ! !%"" ! $"" ! $""" 2565(+7,.89' :;<<,$"""7,.8!!# ! $$" ! $$"" , , ! # " # ! # " # !" !" !" !" !" !" ()*+,-,. $ ()*+,-,. $
Hierarchies 2 14 3 8 1 5 2 5 3 4 3 6 6 3 34 13 30 7 10 28 11 4 3 20 2 17 3 7 16 22 24 2 3 8 2 8 0 1 3 27 21 4 2 1 4 12 9 3 2 1 8 6 5 4 2 1 29 6 15 18 32 10 2 7 0 21 11 32 22 19 17 29 13 15 31 1 1 2 23 9 2 31 26 9 5 point estimate consensus hierarchy
Recommend
More recommend