Network Topology Inference Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ April 9, 2019 Network Science Analytics Network Topology Inference 1
Network topology inference Network topology inference problems Link prediction Case study: Predicting lawyer collaboration Inference of association networks Case study: Inferring genetic regulatory interactions Tomographic network topology inference Case study: Computer network topology identification Network Science Analytics Network Topology Inference 2
Network topology inference ◮ So far dealt with modeling and inference of observed network graphs ⇒ Q: If a portion of G is unobserved, can we infer it from data? ◮ Discussed construction of representations G ( V , E ) for network mapping ⇒ Largely informal methodology, lacking an element of validation ◮ Formulate instead as statistical inference task, i.e. given ◮ Measurements x i of attributes at some or all vertices i ∈ V ◮ Indicators y ij of edge status for some vertex pairs { i , j } ∈ V (2) ◮ A collection G of candidate graphs G Goal: infer the topology of the network graph G ( V , E ) ◮ Three canonical network topology inference problems (i) Link prediction (ii) Association network inference (iii) Tomographic network topology inference Network Science Analytics Network Topology Inference 3
Link prediction Original graph Link prediction ◮ Suppose we observe vertex attributes x = [ x 1 , . . . , x N v ] ⊤ ; and ◮ Edge status is only observed for some subset of pairs V (2) obs ⊂ V (2) miss = V (2) \ V (2) ◮ Goal: predict edge status for all other pairs, i.e., V (2) obs Network Science Analytics Network Topology Inference 4
Association network inference Original graph Association network inference ◮ Suppose we only observe vertex attributes x = [ x 1 , . . . , x N v ] ⊤ ; and ◮ Assume ( i , j ) defined by nontrivial ‘level of association’ among x i , x j ◮ Goal: predict edge status for all vertex pairs V (2) Network Science Analytics Network Topology Inference 5
Tomographic network topology inference Original graph Tomographic inference ◮ Suppose we only observe x i for vertices i ⊂ V in the ‘perimeter’ of G ◮ Goal: predict edge and vertex status in the ‘interior’ of G Network Science Analytics Network Topology Inference 6
Link prediction Network topology inference problems Link prediction Case study: Predicting lawyer collaboration Inference of association networks Case study: Inferring genetic regulatory interactions Tomographic network topology inference Case study: Computer network topology identification Network Science Analytics Network Topology Inference 7
Link prediction ◮ Let G ( V , E ) be a random graph, with adjacency matrix Y ∈ { 0 , 1 } N v × N v ⇒ Y obs and Y miss denote entries in V (2) obs and V (2) miss Link prediction Predict entries in Y miss , given observations Y obs = y obs and possibly various vertex attributes X = x ∈ R N v ◮ Edge status information may be missing due to: ⇒ Difficulty in observation, issues of sampling ⇒ Edge is not yet present, wish to predict future status ◮ Given a model for X and ( Y obs , Y miss ), jointly predict Y miss based on � Y obs = y obs , X = x Y miss � � � P ⇒ More manageable to predict the variables Y miss individually ij Network Science Analytics Network Topology Inference 8
Informal scoring methods ◮ Idea: compute score s ( i , j ) for missing ‘potential edges’ { i , j } ∈ V (2) miss ⇒ Predicted edges returned by retaining the top n ∗ scores ◮ Scores designed to assess certain local structural properties of G obs ⇒ Distance-based, inspired by the small-world principle s ( i , j ) = − dist G obs ( i , j ) ⇒ Neighborhood-based, e.g., the number of common neighbors |N obs ∩ N obs | i j s ( i , j ) = |N obs ∩ N obs | or s ( i , j ) = i j |N obs ∪ N obs | i j ⇒ Favor loosely-connected common neighbors [Adamic-Adar’03] 1 � s ( i , j ) = log |N obs | k k ∈N obs ∩N obs i j Network Science Analytics Network Topology Inference 9
Tests on co-authorship networks ◮ Results from a link prediction study in [Liben Nowell-Kleinberg’03] Network Science Analytics Network Topology Inference 10
Classification methods ◮ Idea: use training data y obs and x to build a binary classifier ⇒ Classifier is in turn used to predict the entries in Y miss ◮ Logistic regression classifiers most popular, based on the model � � � � Z ij = z ) P β ( Y ij = 1 = β ⊤ z , log where � Z ij = z ) � P β ( Y ij = 0 (i) β ∈ R K is a vector of regression coefficients; and (ii) Z ij is a vector of explanatory variables indexed by { i , j } Z ij = [ g 1 ( Y obs ( − ij ) , X ) , . . . , g K ( Y obs ( − ij ) , X )] ⊤ ◮ Functions g k ( · ) encode useful predictive information in y obs ( − ij ) and x Ex: vertex attributes, score functions, network statistics in ERGMs Network Science Analytics Network Topology Inference 11
Logistic regression classifier ◮ Train: Obtain MLE ˆ β via iteratively-reweighted LS ◮ Test: Potential edges ( i , j ) declared present based on probabilities ⊤ z � � ˆ exp β � Z ij = z ) = � P ˆ β ( Y ij = 1 ⊤ z � � ˆ 1 + exp β ◮ Logistic regression assumes Y ij conditionally independent given z ⇒ Seldom the case with relational network data ◮ Underlying mechanism of data missingness is important ⇒ Classification for link prediction reminiscent of cross-validation ⇒ Assumption that data are missing at random is fundamental Network Science Analytics Network Topology Inference 12
Latent variable models ◮ In addition to a lineal predictor β ⊤ z , latent models describe Y ij ⇒ As a function of vertex-specific latent variables u i and u j Homophily Stochastic equivalence ◮ Latent models are flexible to capture underlying social mechanisms Ex: homophily (transitivity) and stochastic equivalence (groups) Network Science Analytics Network Topology Inference 13
Latent class and distance models ◮ Latent distance model: node i has unobserved position U i ∈ R d ◮ Positions U i in latent space assumed i.i.d. e.g., Gaussian distributed ◮ Model cond. probability of edge Y ij as function of β ⊤ z − � u i − u j � 2 ◮ Homophily: Nearby nodes in latent space more likely to link ◮ Latent class model: node i belongs to unobserved class U i ∈ { 1 , . . . , k } ◮ Classes U i assumed i.i.d. e.g., multinomial distributed ◮ Model cond. probability of edge Y ij as function of β ⊤ z − θ u i , u j ◮ Stochastic equivalence: Nodes in same class equally likely to link ◮ P. D. Hoff, “Modeling homophily and stochastic equivalence in symmetric relational data,” NIPS, 2008 Network Science Analytics Network Topology Inference 14
Logistic regression with latent variables ◮ Let M ∈ R N v × N v be unknown, random, and symmetric of the form M = U ⊤ ΛU + E , where (i) U = [ u 1 , . . . , u N v ] is a random orthonormal matrix of latent variables; (ii) Λ is a random diagonal matrix; and (iii) E is a symmetric matrix of i.i.d. noise entries ǫ ij ◮ Latent eigenmodel subsumes the class and distance variants [Hoff’08] ⇒ Notice that M ij = u T i Λu j + ǫ ij ◮ The logistic regression model with latent variables is � � Z ij = z , M ij = m ) � � P β ( Y ij = 1 = β ⊤ z + m log � Z ij = z , M ij = m ) � P β ( Y ij = 0 ◮ Y ij still assumed conditionally independent given Z ij and M ij ⇒ But they are conditionally dependent given only Z ij Network Science Analytics Network Topology Inference 15
Bayesian link prediction ◮ Specify distributions for U , Λ , E to make statistical link predictions ◮ Bayesian inference natural ⇒ Specify a prior for β as well ◮ To predict those entries in Y miss , threshold the posterior mean � � β ⊤ Z ij + M ij exp � Y obs = y obs , Z ij = z � � E � β ⊤ Z ij + M ij 1 + exp ◮ Use MCMC algorithms to approximate the posterior distribution ◮ Gaussian distributions attractive for their conjugacy properties ◮ Higher complexity than MLE for standard logistic regression ⇒ Need to generate draws for N 2 v unobserved variables { U ij } ⇒ Major cost reduction with reduced rank( U ) = k ≪ N v models Network Science Analytics Network Topology Inference 16
Case study Network topology inference problems Link prediction Case study: Predicting lawyer collaboration Inference of association networks Case study: Inferring genetic regulatory interactions Tomographic network topology inference Case study: Computer network topology identification Network Science Analytics Network Topology Inference 17
Lawyer collaboration network ◮ Network G obs of working relationships among lawyers [Lazega’01] ◮ Nodes are N v = 36 partners, edges indicate partners worked together 13 33 5 8 36 6 31 30 10 24 32 18 23 20 15 28 4 22 35 3 34 26 14 19 25 12 16 17 9 7 29 2 27 21 11 1 ◮ Data includes various node-level attributes: ◮ Seniority (node labels indicate rank ordering) ◮ Office location (triangle, square or pentagon) ◮ Type of practice, i.e., litigation (red) and corporate (cyan) ◮ Gender (three partners are female labeled 27, 29 and 34) ◮ Goal: predict cooperation among social actors in an organization Network Science Analytics Network Topology Inference 18
Recommend
More recommend