Estimating the Size of Hidden Populations based on Partially-Observed Network Data Mark S. Handcock Krista J. Gile Department of Statistics Department of Mathematics University of California University of Massachusetts - Los Angeles - Amherst Corinne M. Mar Center for Studies in Demography and Ecology University of Washington Supported by the DoD ONR MURI award N00014-08-1-1015. Working Papers available at http://www.stat.ucla.edu/ ∼ handcock http://arXiv.org MURI Annual Review Meeting, Jan 10 2012
Hard-to-Reach Population Methods Research Group I Krista J. Gile, UMass I Mark S. Handcock, UCLA I Lisa G. Johnston, Tulane University, UCSF I Corinne M. Mar, University of Washington I http://hpmrg.org
Sampling Hard-to-reach Populations Many motivating fields: I epidemiology I CDC HIV surveillance program I UNAIDS requires HIV prevalence estimates for all countries I Most countries: concentrated in high-risk populations: Injecting drug users, men who have sex with men, and sex workers I Hard-to-reach networked populations. I labor economics: Unregulated workers I demography: displaced populations, immigrant populations Traditional Survey Sampling: I Probability sample (e.g. simple random sampling, stratified random sampling) I Analyze data using sampling weights Hard-to-reach populations: No practical conventional sampling frame.
Link-Tracing Sampling Suppose: I Each population joined by informal social network of relationships. I Researchers can access some members of the population. Then: I Begin with a reachable convenience sample (the seeds ) I Expand sample by following social network ties This is Link-tracing Network Sampling
Stylized population
Start with seeds . . .
Seeds recruit the first wave . . .
First wave recruit the second wave . . .
and so on . . .
(and with un-sampled)
Respondent-Driven Sampling - Link-tracing variant: I Seed Dependence: Follow only a few links from each sampled I Confidentiality: Respondents distribute uniquely identified coupons. No names. ( respondent-driven ) I Inference based on network positions: Under rapid development I Effective at obtaining large varied samples in many populations. I Widely used: over 100 studies, in over 30 countries. Often HIV-risk populations. Heckathorn, D.D., “ Respondent-driven sampling: A new approach to the study of hidden populations. ” Social Problems , 1997. Salganik, M.J. and D.D. Heckathorn, “ Sampling and estimation in hidden populations using respondent-driven sampling. ” Sociological Methodology, 2004.
Link-Tracing Sampling: I Challenges I Sampling depends on (typically) partially-observed network data I Convenience mechanism for initial sample leads to non-probability sample I Unknown population size and unknown sampling frame I Sampling designs have much in common, but no consensus on inferential approach Respondent-Driven Sampling subject to all of these
Statistical Assessments of RDS I Many critics in subject fields (Wang et. al 2005, ...) I Wejnert and Heckathorn (2008): compare in known population (web-based) I Gile (2008): Uses CDC data as basis for simulated population (ERGM). Evaluates: (1) bias induced by the initial sample, (2) to uncontrollable features of respondent behavior, and (3) to the without-replacement structure of sampling. I Goel and Salganik (2009) using a Markov chain model, effects of clustering, non-branching assumption. I Gile and Handcock (2010): use realistic but simulated populations to show: (1) the number of sample waves typically used is to small; (2) that preferential referral behavior leads to bias; (3) finite population effects can be large. I Goel and Salganik (2010): simulate RDS over (largely known) friendship and IDU networks to show high variance of original estimators. I Thomas and Gile (2011): effect of differential recruitment, non-response and non-recruitment.
Inferential approaches The key is the modeling of the sampling process I Salganik and Heckathorn (2004): simple Markov Chain model over classes I Volz and Heckathorn (2008): Markov Chain model over people I Gile (2008, 2011): Adjusts for with-replacement effects I Gile and Handcock (2008, 2011): a network model-assisted estimator I better performance, realistic representation of RDSprocess. I Unlike other Link-tracing methods, does not require initial probability sample I is able to adjust for the bias from the selection of the seeds I Still subject to many assumptions: I Self-reported infected and uninfected contacts I Known population size I Adequate working network structure and sampling structure I Measurement Error
Why estimate the population size? I We want to know the size for the population under study I We want to estimate population totals rather than averages I We want to estimate population counts rather than proportions I We need it to improve new estimators that require it (e.g. Gile (2008, 2011).
Is there information in RDS data about population size? Idea: the approach is to use Gile’s sequential sampling model and leverage the information in the ordered sequence of degrees in the sample. Intuitively, the change in the degree distribution of successive waves indicates the depletion of the population and this can be quantified to estimate N .
20 Population Size 555 15 ● ● ● ● Degree of Sampled Node ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 200 300 400 500 Time (order of being sampled)
Recommend
More recommend