Using Hierarchical Models to Calibrate Selection Bias Douglas Rivers Stanford University and YouGov February 26, 2016
Margins of error We agree that margin of sampling error in surveys has an accepted meaning and that this measure is not appropriate for non-probability samples. . . . We believe that users of non-probability samples should be encouraged to report measures of the precision of their estimates, but suggest that to avoid confusion, the set of terms be distinct from those currently used in probability sample surveys. AAPOR Report on Non-probability Sampling (2013)
Is inference possible with unknown selection probabilities? ◮ It better be, since we certainly don’t know what the selection probabilities are for most public opinion polls and market research surveys. With single digit response rates, actual sample inclusion probabilities differ by two orders of magnitude from the initial unit selection probabilities. ◮ The usual approach is to assume ignorable selection (conditional independence of selection and survey variables given a set of covariates). Such inferences are made conditional upon the selection model, which is unlikely to hold exactly. Shouldn’t this be reflected somehow in the margin of error? ◮ Empirically, calculated standard errors in pre-election polls substantially underestimate the RMSE. Gelman, Goel and Rothschild (2016) find that the actual RMSE was understated between 25% and 50% (depending upon type of election) in 4,221 polls.
Three questions A 100(1 − α )% confidence interval for a descriptive population parameter ˆ θ 0 is usually computed using θ ± z 1 − α/ 2 s.e.(ˆ ˆ θ ) where ˆ θ is a sample mean or proportion, possibly weighted. 1. Does ˆ θ have a normal sampling distribution? 2. Can we estimate s.e.(ˆ θ ) without knowing the selection probabilities? 3. Is the sampling distribution of ˆ θ centered on the population parameter θ 0 ? If the answer to all three questions is “yes,” then the confidence interval will have the stated level of coverage.
1. Is ˆ θ normally distributed? Suppose { y i } N i =1 is a bounded sequence of real numbers and { D i } N i =1 is a sequence of independent Bernoulli random variables with E ( D i ) = π i . Let � N � N � N θ N = 1 π N = 1 ˆ n = D i D i y i ¯ π i n N N i =1 i =1 i =1 � N N � i =1 π i y i 1 ω 2 N ) 2 θ ∗ N = N = π i (1 − π i )( y i − θ ∗ N ¯ π N ¯ π i =1 If (i) lim N →∞ ¯ π N = ¯ π where 0 < ¯ π < 1 (ii) lim N →∞ θ ∗ N = θ ∗ N = ω 2 where 0 < ω 2 < ∞ (iii) lim N →∞ ω 2 then √ n (ˆ L → N (0 , ω 2 ) θ N − θ ∗ N ) −
2. Can we estimate s.e.(ˆ θ )? Under the same assumptions as the preceding result, � � 1 / 2 � 1 s.e.(ˆ ( y i − ˆ θ ) 2 � θ ) = n i ∈ s is a conservative estimator of s.e.(ˆ θ ) with asymptotic bias O (¯ π N ). This also works with weighting, except that y i is replaced everywhere by w i y i (where w i is the weight) and �� � 1 / 2 i ( y i − ˆ i ∈ s w 2 θ ) 2 s.e.(ˆ � θ ) = n ¯ w 2 Independence of the draws is enough. You don’t need to know the selection probabilities.
3. Is the distribution of ˆ θ centered on θ 0 ? Unfortunately, no. The sampling distribution of ˆ θ is approximately � � N , ω 2 θ a ˆ N θ ∗ ∼ N n so the confidence interval ˆ s.e.(ˆ θ ± z 1 − α/ 2 � θ ) is shifted to the right by the quantity θ ) . Bias(ˆ = θ ∗ N − θ 0 The margin of error has approximately correct coverage for θ ∗ N . The interval is still useful for quantifying sampling error (how much variation could be expected from selecting another sample using the same process), but actual coverage for the population parameter is overstated (sometimes by a lot).
Post-stratification to correct for selection bias Bias can be eliminated if we can identify a set of covariates that make selection conditionally independent of the survey variables. The conditional independence (ignorability) assumption is more plausible if the number of covariates is large. However, post-stratification involves a bias-variance tradeoff. Post-stratifying on a large number of variables is a form of over-fitting which, while it may reduce bias, can increase the mean square error by inflating the variance. MSE(ˆ θ ) = Bias 2 (ˆ θ ) + V (ˆ θ ) Is the problem bias or variance?
Data Seven opt-in internet surveys, a probability internet panel study, and an RDD phone survey fielded almost identical questionnaires in 2004-05, including eight items also included in the 2004 American Community Survey (ACS). Primary demographics. Gender, age, race, education. Secondary demographics. Marital status, home ownership, number of bedrooms, number of vehicles. Six of the opt-in surveys used online panels, while one (SPSS-AOL) used a “river sample.” All of the opt-in samples used some form of quota sampling on gender, sometimes on age and/or region, and only one on race. The probability internet panel (KN) uses purposive sampling for within-panel selection, while it appears that the phone survey may have used gender quotas. Only one of the opt-in survey vendors (Harris Interactive) provided post-stratification weights.
Parallel estimates with different post-stratification schemes Phone (SRBI) Probability Web (KN) 1. Harris 8 8 8 ● ● 6 6 6 ● 4 ● 4 4 ● ● ● ● ● ● ● Error (Percent) Error (Percent) Error (Percent) 2 ● 2 ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 −2 −2 ● ● ● ● −4 ● −4 −4 ● −6 −6 −6 −8 −8 −8 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Number of raking variables Number of raking variables Number of raking variables 2. Luth 3. Greenfield 4. SSI 8 8 8 6 6 6 ● ● 4 4 ● 4 ● ● Error (Percent) ● Error (Percent) Error (Percent) ● ● ● 2 ● 2 ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 ● −2 −2 ● ● ● ● ● ● ● ● −4 ● ● −4 −4 ● ● ● ● ● ● −6 −6 −6 −8 −8 −8 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Number of raking variables Number of raking variables Number of raking variables 5. Survey Direct 6. SPSS/AOL 7. GoZing 8 8 10 6 6 5 ● 4 4 ● ● ● ● ● ● ● ● Error (Percent) Error (Percent) Error (Percent) 0 ● 2 2 ● ● ● ● ● ● ● ● 0 0 −5 ● ● ● ● ● ● ● −2 ● ● ● −2 ● −10 ● ● ● ● ● ● ● ● ● ● ● −4 ● ● ● −4 ● ● ● ● ● ● ● ● ● −15 ● ● ● −6 −6 ● ● ● ● ● ● ● ● ● ● ● ● ● −8 −8 ● ● −20 ● 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Number of raking variables Number of raking variables Number of raking variables
95% confidence intervals for estimates Phone (SRBI) Probability Web (KN) 1. Harris 8 8 8 ● ● 6 6 6 ● 4 ● 4 4 ● ● ● ● ● ● ● Error (Percent) Error (Percent) Error (Percent) 2 ● 2 ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 −2 −2 ● ● ● ● −4 ● −4 −4 ● −6 −6 −6 −8 −8 −8 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Number of raking variables Number of raking variables Number of raking variables 2. Luth 3. Greenfield 4. SSI 8 8 8 6 6 6 ● ● 4 4 ● 4 ● ● Error (Percent) ● Error (Percent) Error (Percent) ● ● ● 2 ● 2 ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 ● −2 −2 ● ● ● ● ● ● ● ● −4 ● ● −4 −4 ● ● ● ● ● ● −6 −6 −6 −8 −8 −8 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Number of raking variables Number of raking variables Number of raking variables 5. Survey Direct 6. SPSS/AOL 7. GoZing 8 8 10 6 6 5 ● 4 4 ● ● ● ● ● ● ● ● Error (Percent) Error (Percent) Error (Percent) 0 ● 2 2 ● ● ● ● ● ● ● ● 0 0 −5 ● ● ● ● ● ● ● −2 ● ● ● −2 ● −10 ● ● ● ● ● ● ● ● ● ● ● −4 ● ● ● −4 ● ● ● ● ● ● ● ● ● −15 ● ● ● −6 −6 ● ● ● ● ● ● ● ● ● ● ● ● ● −8 −8 ● ● −20 ● 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Number of raking variables Number of raking variables Number of raking variables
Recommend
More recommend