A Modeling Approach to Compensate for Nonresponse and Selection Bias in Surveys? Tien-Huan (Amy) Lin, Ismael Flores Cervantes Westat JSM 2019 • Denver, Colorado • July 27-August 1 The views presented in this paper are those of the author(s) 1 and do not represent the views of any Government Agency/Department or Westat
Objectives ❯ Can we reduce nonresponse bias for different types of nonresponse … Two-step • by adjusting on response propensity alone (i.e., �̂ ) ? approach • by incorporating (predicted) survey outcome(s) in response propensity models (i.e., � � → �̂ ) ? - Vartivarian & Little, 2002 - Using modeling tools from the statistical learning area (i.e., gradient boosting) (Morral et al., 2015; Fay & Riddles, 2016; Lin & Flores Cervantes, 2018) ❯ Empirical study: simulation with realistic survey design • Unequal sampling probability • Indirect correlation between survey outcome and response propensity 2
Types of Nonresponse ❯ M issing C ompletely A t R andom (MCAR) • Nonresponse is unrelated to any variable in the data MCAR ❯ M issing A t R andom (MAR) Magnitude of • Probability to respond depends only on the bias covariates MAR ❯ N ot M issing A t R andom (NMAR) • Probability to respond depends on the unobserved NMAR data (Sikov, 2018) 3
Simulation Details ❯ Population • 2012 National Health Interview Survey • Target population: ages 18 and over • N: 57,356 ❯ Sample Design • Complex sample design: - One-stage cluster sample with implicit stratification • Bernoulli or Poisson sample selection of ≈500 households - Poisson sample selection with probability of selection proportional to household size (with differential error) • All persons in sampled households selected (n ≈ 800) - Nonresponse (with and without selection bias) at the person level - Survey outcome (y) is artificial 4
Model Specifications for Survey Outcome and Response Propensity Covariates y r mar r nmar ✓ ✓ Education*income (* workclass for outcome model) † ✓ ✓ #kids in HH (* sex for response models) † ✓ ✓ ✓ Age ✓ ✓ ✓ Age *Sex Synthetic y ✓ (e.g. estimate of Race/Ethnicity † ✓ smokers) Race/Ethnicity*Sex † ✓ Esophagus cancer † ✓ Lung cancer † ✓ Throat cancer † ✓ Kidney cancer † ✓ Heart condition/disease ✓ Coronary heart disease ✓ Heart attack ✓ COPD ✓ ✓ Sex ✓ Family type † ✓ ✓ Work limitation due to health problem (family member) ✓ ✓ Worried food would run out before got money to buy more ✓ ✓ Unable to work now due to health problem (individual) ✓ ✓ Any limitation – all conditions ✓ ✓ Region ( Significant predictor and available for modeling † Removed from dataset and not available for modeling 5
Model Specifications for Survey Outcome and Response Propensity Covariates y r mar r nmar ✓ ✓ Education*income (* workclass for outcome model) † ✓ ✓ #kids in HH (* sex for response models) † ✓ ✓ ✓ Age ✓ ✓ ✓ Age *Sex ✓ Race/Ethnicity † ✓ Race/Ethnicity*Sex † ✓ Esophagus cancer † ✓ Lung cancer † ✓ Throat cancer † ✓ Kidney cancer † ✓ Heart condition/disease Source 1 of NMAR : ✓ Coronary heart disease some covariates ✓ Heart attack removed, introducing ✓ COPD unobserved data ✓ ✓ Sex ✓ Family type † ✓ ✓ Work limitation due to health problem (family member) ✓ ✓ Worried food would run out before got money to buy more ✓ ✓ Unable to work now due to health problem (individual) ✓ ✓ Any limitation – all conditions ✓ ✓ Region ( Significant predictor and available for modeling † Removed from dataset and not available for modeling 6
Model Specifications for Survey Outcome and Response Propensity Covariates y r mar r nmar ✓ ✓ Education*income (* workclass for outcome model) † ✓ ✓ #kids in HH (* sex for response models) † ✓ ✓ ✓ Age ✓ ✓ ✓ Age *Sex ✓ Race/Ethnicity † ✓ Source 2 of NMAR : Race/Ethnicity*Sex † selection bias ✓ Esophagus cancer † ✓ Lung cancer † ✓ Throat cancer † ✓ Kidney cancer † ✓ Heart condition/disease ✓ Coronary heart disease ✓ Heart attack ✓ COPD ✓ ✓ Sex ✓ Family type † ✓ ✓ Work limitation due to health problem (family member) ✓ ✓ Worried food would run out before got money to buy more ✓ ✓ Unable to work now due to health problem (individual) ✓ ✓ Any limitation – all conditions ✓ ✓ Region ( Significant predictor and available for modeling † Removed from dataset and not available for modeling 7
Observed Selection Bias for Expected y for NMAR Income x Education #kids in HH x Sex 70% 70% 60% 60% 50% 50% Synthetic y Synthetic y 40% 40% 30% 30% 20% 20% 10% 10% 0% No high High schools No high High schools 0% school and above school and above M F M F M F M F M F M F Low Income High Income 0 0 1 1 2 2 3 3 4 4 5 5 pop nmar pop nmar 8
Simulation Scenarios MCAR MCAR MAR MAR NMAR NMAR . . Bernoulli sample Poisson sample Poisson sample a selection selection selection 100% household RR 100% household RR 100% household RR b 70% person RR 59% person RR 43% person RR 4 10,000 simulations for each response propensity assumption • Knottnerus, P (2003). Sample Survey Theory. New York, NY: Springer true unbiased estimate 1 HT • Horvitz-Thompson on full sample ignoring nonresponse 2 BWGT • Adjusted using inverse of selection probability • Tree algorithm to model � � � � 3 rpms xgboost → rpms • Uses xgboost to model � � and rpms to model � � (with � � + all x) 4 � � → � � 5 xgboost → xgboost • Uses xgboost to model � � and xgboost to model � � (with � � + all x) 9
Bias Assessment for MCAR and MAR Baseline to measure �̂ improvement in � � → �̂ nonresponse bias ❯ M issing C ompletely A t R andom (MCAR) • Nonresponse is unrelated to any variable in the data -> unbiased estimate regardless of auxiliary variables and adjustment methods! CONFIRMED! ❯ M issing A t R andom (MAR) • Probability to respond depends only on the covariates -> covariates are observed for all sampled units and estimates should be unbiased Mostly true 10
Bias Assessment for NMAR Baseline to measure ❯ Can we reduce nonresponse bias for different �̂ improvement in � � → �̂ nonresponse bias types of nonresponse … (Vartivarian & Little, 2002) • by adjusting on response propensity alone (i.e., �̂ ) ? • by incorporating (predicted) survey outcome(s) in response propensity models (i.e., � � → �̂ ) ? - Using modeling tools from the statistical learning area - rpart, rpms , gradient boosting (Morral et al., 2015; Fay & Riddles, 2016; Lin & Flores bwgt: worst case scenario, Cervantes, 2018) baseline to measure improvement in bias and Ø None of the adjustment methods yield rmse “unbiased” estimates. Ø All methods have some impact on nonresponse bias, with the level of correction: rpms > xgb+rpms > xgb+xgb. Ø Comparing the � � (i.e., xgb+rpms , xgb+xgb ) � → � methods to the � � (i.e., rpms ) method, under these settings, the � � method yields the lowest bias and rmse. 11
Conclusion MCAR: in this baseline assumption, the � � → �̂ methods yield the same unbiased results as the Horvitz-Thompson and �̂ method: does not have a negative impact on estimate MCAR MAR: under the assumption of having all data available for modeling should yield unbiased estimates, the � � → �̂ methods show no benefit over the � � method MAR NMAR: in this setting, the � � of the � � → �̂ methods predicts the estimate for respondents and not for the population; NMAR consequently, the estimates under these methods show � method in terms of bias worse results than the � and mse reduction 12
Results Contact information: AmyLin@westat.com IsmaelFloresCervantes@westat.com 13
Recommend
More recommend