Sampling and Estimation in Network Graphs Gonzalo Mateos Dept. of - PowerPoint PPT Presentation

Sampling and Estimation in Network Graphs Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ March 27, 2020 Network Science Analytics Sampling and Estimation in Network Graphs 1

Network sampling Network sampling and challenges Background on statistical sampling theory Network graph sampling designs Estimation of network totals and group size Estimation of degree distributions Network Science Analytics Sampling and Estimation in Network Graphs 2

Sampling network graphs ◮ Measurements often gathered only from a portion of a complex system ◮ Ex: social study of high-school class vs. large corporation, Internet ◮ Network graph → sample from a larger underlying network ◮ Goal: use sampled network data to infer properties of the whole system ◮ Approach using principles of statistical sampling theory ◮ Sampling in network contexts introduces various potential challenges System under study Available measurements G ∗ ( V ∗ , E ∗ ) G ( V , E ) Random Procedure − − − − − − − − − − − → Population graph Sampled graph ◮ G ∗ often a subgraph of G (i.e., V ∗ ⊆ V , E ∗ ⊆ E ), but may not be Network Science Analytics Sampling and Estimation in Network Graphs 3

The fundamental problem ◮ Suppose a given graph characteristic or summary η ( G ) is of interest ◮ Ex: order N v , size N e , degree d v , clustering coefficient cl( G ), . . . ◮ Typically impossible to recover η ( G ) exactly from G ∗ η ( G ∗ ) of η ( G )? ⇒ Q: Can we still form a useful estimate ˆ η = ˆ ◮ Plug-in estimator ˆ η := η ( G ∗ ) ◮ Boils down to computing the characteristic of interest in G ∗ ◮ Many familiar estimators in statistical practice are of this type Ex: sample means, standard deviations, covariances, quantiles. . . ◮ Oftentimes η ( G ∗ ) is a poor representation of η ( G ) Network Science Analytics Sampling and Estimation in Network Graphs 4

Example: Estimating average degreee ◮ Let G ( V , E ) be a network of protein interactions in yeast ⇒ Characteristic of interest is average degree � η ( G ) = 1 d i N v i ∈ V ◮ Here N v = 5 , 151, N e = 31 , 201 ⇒ η ( G ) = 12 . 115 ◮ Consider two sampling designs to obtain G ∗ ◮ First sample n vertices V ∗ = { i 1 , . . . , i n } without replacement ◮ Design 1: For each i ∈ V ∗ , observe incident edges ( i , j ) ∈ E ◮ Design 2: Observe edge ( i , j ) only when both i , j ∈ V ∗ ◮ Estimate η ( G ) by averaging the observed degree sequence { d ∗ i } i ∈ V ∗ � η ( G ∗ ) = 1 d ∗ i n i ∈ V ∗ Network Science Analytics Sampling and Estimation in Network Graphs 5

Example: Estimating average degreee (cont.) ◮ Random sample of n = 1 , 500 vertices, Designs 1 and 2 for edges ⇒ Process repeated for 10,000 trials ⇒ histogram of η ( G ∗ ) Design 2 1.5 1.5 Design 1 Density 1.0 1.0 0.5 0.5 0.0 0.0 0 0 5 5 10 10 15 15 Estimate of average degree ◮ Under-estimate η ( G ) for Design 2, but Design 1 on target. Why? ◮ Design 1: sample vertex degree explicitly, i.e., d ∗ i = d i ◮ Design 2: (implicitly) sample vertex degree with bias, i.e., d ∗ n i ≈ N v d i Network Science Analytics Sampling and Estimation in Network Graphs 6

Improving estimation accuracy ◮ In order to do better we need to incorporate the effects of ⇒ Random sampling; and/or ⇒ Measurement error ◮ Sampling design, topology of G , nature of η ( · ) all critical ◮ Model-based inference → Likelihood-based and Bayesian paradigms ◮ Design-based methods → Statistical sampling theory ◮ Assume observations made without measurement error ◮ Only source of randomness → sampling procedure ◮ Ex: Estimating average degree ◮ Under Design 2 the estimate is biased, with mean of only 3 . 528 ◮ Adjusting η ( G ∗ ) upward by a factor N v n = 3 . 434 yields 12,115 ◮ Will see how statistical sampling theory justifies this correction Network Science Analytics Sampling and Estimation in Network Graphs 7

Background Network sampling and challenges Background on statistical sampling theory Network graph sampling designs Estimation of network totals and group size Estimation of degree distributions Network Science Analytics Sampling and Estimation in Network Graphs 8

Statistical sampling theory ◮ Suppose we have a population U = { 1 , . . . , N u } of N u units ◮ Ex: People, animals, objects, vertices, . . . ◮ A value y i is associated with each unit i ∈ U ◮ Ex: Height, age, gender, infected, membership, . . . ◮ Typical interest in the population totals τ and averages µ � � µ := 1 y i = 1 τ := y i and τ N u N u i ∈U i ∈U ◮ Basic sampling theory paradigm oriented around these steps: S1: Randomly sample n units S = { i 1 , . . . , i n } from U S2: Observe the value y i k for k = 1 , . . . , n S3: Form an unbiased estimator ˆ µ of µ , i.e., E [ˆ µ ] = µ S4: Evaluate or estimate the variance var [ˆ µ ] Network Science Analytics Sampling and Estimation in Network Graphs 9

Inclusion probabilities ◮ Def: For given sampling design, the inclusion probability π i of unit i is π i := P (unit i belongs in the sample S ) ◮ Simple random sampling (SRS): n units sampled uniformly form U Without replacement: i 1 chosen from U , i 2 from U \ { i 1 } , and so on � N u � ⇒ There are such possible samples of size n n � N u − 1 � ⇒ There are samples which include a given unit i n − 1 ◮ The inclusion probability is � N u − 1 � = n n − 1 π i = � N u � N u n Network Science Analytics Sampling and Estimation in Network Graphs 10

Sample mean estimator ◮ Definition of sample mean estimator � µ = 1 ˆ y i n i ∈S ◮ Using indicator RVs I { i ∈ S} for i ∈ U , where E [ I { i ∈ S} ] = π i � � � � N u � � 1 1 ⇒ E [ˆ µ ] = E y i = E y i I { i ∈ S} n n i ∈S i =1 � N u � N u = 1 y i E [ I { i ∈ S} ] = 1 y i π i n n i =1 i =1 ◮ SRS without replacement → unbiased because π i = n N u ◮ Unequal probability sampling ◮ More common than SRS, especially with networks. (More soon) ◮ Sample mean can be a poor (i.e., biased) estimator for µ Network Science Analytics Sampling and Estimation in Network Graphs 11

Horvitz-Thompson estimation for totals ◮ Idea: weighted average using inclusion probabilities as weights Horvitz-Thompson (HT) estimator � µ π = 1 y i ˆ and ˆ τ π = N u ˆ µ π N u π i i ∈S ◮ Remedies the bias problem N u N u � � µ π ] = 1 y i E [ I { i ∈ S} ] = 1 E [ˆ y i = µ π i N u N u i =1 i =1 ⇒ Size of the population N u assumed known ⇒ Broad applicability, but π i may be difficult to compute Network Science Analytics Sampling and Estimation in Network Graphs 12

Horvitz-Thompson estimator variance ◮ Def: Joint inclusion probability π ij of units i and j is π ij := P (units i and j belong in the sample S ) ◮ If inclusion of units i and j are independent events ⇒ π ij = π i π j ◮ Ex: Simple random sampling without replacement yields n ( n − 1) π ij = N u ( N u − 1) ◮ Variance of the HT estimator: � π ij � � � µ π ] = var [ˆ τ π ] − 1 var [ˆ τ π ] = y i y j , var [ˆ π i π j N 2 u i ∈U j ∈U ⇒ Typically estimated in an unbiased fashion from the sample S Network Science Analytics Sampling and Estimation in Network Graphs 13

Probability proportional to size sampling ◮ Unequal probability sampling ⇒ n units selected w.r.t. a distribution { p 1 , . . . , p N u } on U 1 ⇒ Uniform sampling: special case with p i = N u for all i ∈ U ◮ Probability proportional to size (PPS) sampling ⇒ Probabilities p i proportional to a characteristic c i Ex: households chosen by drawing names from a database ◮ If sampling with replacement, PPS inclusion probabilities are c i π i = 1 − (1 − p i ) n , where p i = � k c k ◮ Joint inclusion probabilities for variance calculations π ij = π i + π j − [1 − (1 − p i − p j ) n ] Network Science Analytics Sampling and Estimation in Network Graphs 14

Estimation of group size ◮ So far implicitly assumed N u known → Often not the case! Ex: endangered animal species, people at risk of rare disease ◮ Special population total often of interest is the group size � N u = 1 i ∈U ◮ Suggests the following HT estimator of N u � ˆ π − 1 N u = i i ∈S ⇒ Infeasible, since knowledge of N u needed to compute π i Network Science Analytics Sampling and Estimation in Network Graphs 15

Capture-recapture estimator ◮ Capture-recapture estimators overcome HT limitations in this setting ◮ Two rounds of SRS without replacement ⇒ Two samples S 1 , S 2 Round 1: Mark all units in sample S 1 of size n 1 from U ◮ Ex: tagging a fish, noting the ID number... ◮ All units in S 1 are returned to the population Round 2: Obtain a sample S 2 of size n 2 from U Capture-recapture estimator of N u N u := n 2 ˆ m n 1 , where m := |S 1 ∩ S 2 | ◮ Factor m / n 2 indicative of marked fraction of the overall population ⇒ Can derive using model-based arguments as an ML estimator Network Science Analytics Sampling and Estimation in Network Graphs 16

Common network graph sampling designs Network sampling and challenges Background on statistical sampling theory Network graph sampling designs Estimation of network totals and group size Estimation of degree distributions Network Science Analytics Sampling and Estimation in Network Graphs 17

Sampling and Estimation in Network Graphs Gonzalo Mateos Dept. of - PowerPoint PPT Presentation

Sampling and Estimation in Network Graphs Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ March 27, 2020 Network Science Analytics

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

MAP Estimation, Message Passing and Perfect Graphs Tony Jebara November 25, 2009 Background

MAP Estimation with Perfect Graphs Tony Jebara July 21, 2009 Background Matchings Perfect

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

CS200: Graphs Prichard Ch. 14 Rosen Ch. 10 CS200 - Graphs 1 Graphs A collection of What can

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling in Practice GESIS Survey Guidelines Sabine Hder These slides are based on the GESIS

Data Analysis and Uncertainty Part 3: Hypothesis Testing/Sampling Instructor: Sargur N. Srihari

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

Sampling Distribution of a Statistic Recall: a statistic is a summary calculated from a sample.

Political Science 209 - Fall 2018 Uncertainty Florian Hollenbach 2nd December 2018 Statistical

Logistics and Such COGS 105 Research Methods for Cognitive Scientists Exam date now posted.

Sampling and Representativeness Department of Government London School of Economics and

Power and Limitations of Opinion Polls Rajeeva L. Karandikar Director Chennai Mathematical