Sampling and Estimation in Network Graphs Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ March 27, 2020 Network Science Analytics Sampling and Estimation in Network Graphs 1
Network sampling Network sampling and challenges Background on statistical sampling theory Network graph sampling designs Estimation of network totals and group size Estimation of degree distributions Network Science Analytics Sampling and Estimation in Network Graphs 2
Sampling network graphs ◮ Measurements often gathered only from a portion of a complex system ◮ Ex: social study of high-school class vs. large corporation, Internet ◮ Network graph → sample from a larger underlying network ◮ Goal: use sampled network data to infer properties of the whole system ◮ Approach using principles of statistical sampling theory ◮ Sampling in network contexts introduces various potential challenges System under study Available measurements G ∗ ( V ∗ , E ∗ ) G ( V , E ) Random Procedure − − − − − − − − − − − → Population graph Sampled graph ◮ G ∗ often a subgraph of G (i.e., V ∗ ⊆ V , E ∗ ⊆ E ), but may not be Network Science Analytics Sampling and Estimation in Network Graphs 3
The fundamental problem ◮ Suppose a given graph characteristic or summary η ( G ) is of interest ◮ Ex: order N v , size N e , degree d v , clustering coefficient cl( G ), . . . ◮ Typically impossible to recover η ( G ) exactly from G ∗ η ( G ∗ ) of η ( G )? ⇒ Q: Can we still form a useful estimate ˆ η = ˆ ◮ Plug-in estimator ˆ η := η ( G ∗ ) ◮ Boils down to computing the characteristic of interest in G ∗ ◮ Many familiar estimators in statistical practice are of this type Ex: sample means, standard deviations, covariances, quantiles. . . ◮ Oftentimes η ( G ∗ ) is a poor representation of η ( G ) Network Science Analytics Sampling and Estimation in Network Graphs 4
Example: Estimating average degreee ◮ Let G ( V , E ) be a network of protein interactions in yeast ⇒ Characteristic of interest is average degree � η ( G ) = 1 d i N v i ∈ V ◮ Here N v = 5 , 151, N e = 31 , 201 ⇒ η ( G ) = 12 . 115 ◮ Consider two sampling designs to obtain G ∗ ◮ First sample n vertices V ∗ = { i 1 , . . . , i n } without replacement ◮ Design 1: For each i ∈ V ∗ , observe incident edges ( i , j ) ∈ E ◮ Design 2: Observe edge ( i , j ) only when both i , j ∈ V ∗ ◮ Estimate η ( G ) by averaging the observed degree sequence { d ∗ i } i ∈ V ∗ � η ( G ∗ ) = 1 d ∗ i n i ∈ V ∗ Network Science Analytics Sampling and Estimation in Network Graphs 5
Example: Estimating average degreee (cont.) ◮ Random sample of n = 1 , 500 vertices, Designs 1 and 2 for edges ⇒ Process repeated for 10,000 trials ⇒ histogram of η ( G ∗ ) Design 2 1.5 1.5 Design 1 Density 1.0 1.0 0.5 0.5 0.0 0.0 0 0 5 5 10 10 15 15 Estimate of average degree ◮ Under-estimate η ( G ) for Design 2, but Design 1 on target. Why? ◮ Design 1: sample vertex degree explicitly, i.e., d ∗ i = d i ◮ Design 2: (implicitly) sample vertex degree with bias, i.e., d ∗ n i ≈ N v d i Network Science Analytics Sampling and Estimation in Network Graphs 6
Improving estimation accuracy ◮ In order to do better we need to incorporate the effects of ⇒ Random sampling; and/or ⇒ Measurement error ◮ Sampling design, topology of G , nature of η ( · ) all critical ◮ Model-based inference → Likelihood-based and Bayesian paradigms ◮ Design-based methods → Statistical sampling theory ◮ Assume observations made without measurement error ◮ Only source of randomness → sampling procedure ◮ Ex: Estimating average degree ◮ Under Design 2 the estimate is biased, with mean of only 3 . 528 ◮ Adjusting η ( G ∗ ) upward by a factor N v n = 3 . 434 yields 12,115 ◮ Will see how statistical sampling theory justifies this correction Network Science Analytics Sampling and Estimation in Network Graphs 7
Background Network sampling and challenges Background on statistical sampling theory Network graph sampling designs Estimation of network totals and group size Estimation of degree distributions Network Science Analytics Sampling and Estimation in Network Graphs 8
Statistical sampling theory ◮ Suppose we have a population U = { 1 , . . . , N u } of N u units ◮ Ex: People, animals, objects, vertices, . . . ◮ A value y i is associated with each unit i ∈ U ◮ Ex: Height, age, gender, infected, membership, . . . ◮ Typical interest in the population totals τ and averages µ � � µ := 1 y i = 1 τ := y i and τ N u N u i ∈U i ∈U ◮ Basic sampling theory paradigm oriented around these steps: S1: Randomly sample n units S = { i 1 , . . . , i n } from U S2: Observe the value y i k for k = 1 , . . . , n S3: Form an unbiased estimator ˆ µ of µ , i.e., E [ˆ µ ] = µ S4: Evaluate or estimate the variance var [ˆ µ ] Network Science Analytics Sampling and Estimation in Network Graphs 9
Inclusion probabilities ◮ Def: For given sampling design, the inclusion probability π i of unit i is π i := P (unit i belongs in the sample S ) ◮ Simple random sampling (SRS): n units sampled uniformly form U Without replacement: i 1 chosen from U , i 2 from U \ { i 1 } , and so on � N u � ⇒ There are such possible samples of size n n � N u − 1 � ⇒ There are samples which include a given unit i n − 1 ◮ The inclusion probability is � N u − 1 � = n n − 1 π i = � N u � N u n Network Science Analytics Sampling and Estimation in Network Graphs 10
Sample mean estimator ◮ Definition of sample mean estimator � µ = 1 ˆ y i n i ∈S ◮ Using indicator RVs I { i ∈ S} for i ∈ U , where E [ I { i ∈ S} ] = π i � � � � N u � � 1 1 ⇒ E [ˆ µ ] = E y i = E y i I { i ∈ S} n n i ∈S i =1 � N u � N u = 1 y i E [ I { i ∈ S} ] = 1 y i π i n n i =1 i =1 ◮ SRS without replacement → unbiased because π i = n N u ◮ Unequal probability sampling ◮ More common than SRS, especially with networks. (More soon) ◮ Sample mean can be a poor (i.e., biased) estimator for µ Network Science Analytics Sampling and Estimation in Network Graphs 11
Horvitz-Thompson estimation for totals ◮ Idea: weighted average using inclusion probabilities as weights Horvitz-Thompson (HT) estimator � µ π = 1 y i ˆ and ˆ τ π = N u ˆ µ π N u π i i ∈S ◮ Remedies the bias problem N u N u � � µ π ] = 1 y i E [ I { i ∈ S} ] = 1 E [ˆ y i = µ π i N u N u i =1 i =1 ⇒ Size of the population N u assumed known ⇒ Broad applicability, but π i may be difficult to compute Network Science Analytics Sampling and Estimation in Network Graphs 12
Horvitz-Thompson estimator variance ◮ Def: Joint inclusion probability π ij of units i and j is π ij := P (units i and j belong in the sample S ) ◮ If inclusion of units i and j are independent events ⇒ π ij = π i π j ◮ Ex: Simple random sampling without replacement yields n ( n − 1) π ij = N u ( N u − 1) ◮ Variance of the HT estimator: � π ij � � � µ π ] = var [ˆ τ π ] − 1 var [ˆ τ π ] = y i y j , var [ˆ π i π j N 2 u i ∈U j ∈U ⇒ Typically estimated in an unbiased fashion from the sample S Network Science Analytics Sampling and Estimation in Network Graphs 13
Probability proportional to size sampling ◮ Unequal probability sampling ⇒ n units selected w.r.t. a distribution { p 1 , . . . , p N u } on U 1 ⇒ Uniform sampling: special case with p i = N u for all i ∈ U ◮ Probability proportional to size (PPS) sampling ⇒ Probabilities p i proportional to a characteristic c i Ex: households chosen by drawing names from a database ◮ If sampling with replacement, PPS inclusion probabilities are c i π i = 1 − (1 − p i ) n , where p i = � k c k ◮ Joint inclusion probabilities for variance calculations π ij = π i + π j − [1 − (1 − p i − p j ) n ] Network Science Analytics Sampling and Estimation in Network Graphs 14
Estimation of group size ◮ So far implicitly assumed N u known → Often not the case! Ex: endangered animal species, people at risk of rare disease ◮ Special population total often of interest is the group size � N u = 1 i ∈U ◮ Suggests the following HT estimator of N u � ˆ π − 1 N u = i i ∈S ⇒ Infeasible, since knowledge of N u needed to compute π i Network Science Analytics Sampling and Estimation in Network Graphs 15
Capture-recapture estimator ◮ Capture-recapture estimators overcome HT limitations in this setting ◮ Two rounds of SRS without replacement ⇒ Two samples S 1 , S 2 Round 1: Mark all units in sample S 1 of size n 1 from U ◮ Ex: tagging a fish, noting the ID number... ◮ All units in S 1 are returned to the population Round 2: Obtain a sample S 2 of size n 2 from U Capture-recapture estimator of N u N u := n 2 ˆ m n 1 , where m := |S 1 ∩ S 2 | ◮ Factor m / n 2 indicative of marked fraction of the overall population ⇒ Can derive using model-based arguments as an ML estimator Network Science Analytics Sampling and Estimation in Network Graphs 16
Common network graph sampling designs Network sampling and challenges Background on statistical sampling theory Network graph sampling designs Estimation of network totals and group size Estimation of degree distributions Network Science Analytics Sampling and Estimation in Network Graphs 17
Recommend
More recommend