Stop or Continue Data Collection: A Nonignorable Missing Data Approach for Continuous Variables Thaís Paiva Jerry Reiter Federal University of Minas Gerais, Brazil Duke University March 14, 2018 JOS and DC-AAPOR Workshop on Responsive and Adaptive Survey Design
Outline Introduction Methodology Illustration with Census of Manufactures Data Conclusions References This research was supported by the NSF NCRN grant (SES-11-31897) awarded to Duke University. Any opinions and conclusions expressed in this article are those of the authors and do not necessarily represent the views of the U.S. Census Bureau. All results have been reviewed to ensure that no confidential information is disclosed. 1
Introduction
Motivation Census of Manufactures Data • Survey administrated by the U.S. Census Bureau annually, with sample estimates of statistics for all manufacturing establishments with one or more paid employee. • Provides statistics on employment, payroll, cost of materials consumed, operating expenses, value of shipments, etc. 2
Adaptive Design • Methods that use auxiliary information to tailor and update the sampling scheme throughout the survey. • administrative records; • paradata (data about the data collection process); • actual responses as they are collected. • The changes on the survey design can be applied to individuals or to the entire survey. • In an ongoing survey, decide to: 1. stop the data collection or 2. invest on collecting more data. 3
Decision rule How to decide to stop or not? 4
Decision rule How to decide to stop or not? Information measure Cost measure How different is the How much does it cost non-respondents to collect more data distribution from the and what is the respondents? budget? 4
Stopping Rules Rao et al. (2008): stopping rules for surveys with multi- ple waves for binary response variables . • Based on standardized differences of the response proportions at each wave, where the proportions are estimated with multiple imputation of the nonresponses. Wagner and Raghunathan (2010): stopping rules based on the probability of additional data changing the esti- mates, also for binary response variables . • Compared the estimates if stop data collection with the estimates if collect follow-up sample. 5
Methodology
Methodology Model for the observed data • Continuous multivariate data • The variables are likely correlated and with heavily skewed distributions • The model has to be flexible to capture any distributional features from the data ➡ • Mixture of multivariate normal distributions • Dirichlet Process prior to allow for more flexibility and better density estimation (Ishwaran and James, 2001) 6
Dirichlet Process Mixture Model Y n = y 1 , . . . , y n n complete p -dimensional observations . Assume each variable is standardized. z i ∈ 1 , . . . , K component indicator of i -th observation, with probability π k = P ( z i = k ) Each component k follows a MVN distribution N ( µ k , Σ k ) Mixture model: y i | z i , µ, Σ N ( y i | µ z i , Σ z i ) ∼ z i | π Multinomial ( π 1 , . . . , π K ) ∼ 7
Prior specification � � 0 φ 1 Components: with φ j ∼ Gamma ( a φ , b φ ) ... Φ = 0 φ p N ( µ 0 , h − 1 Σ k ) µ k | Σ k ∼ a φ = b φ = 0 . 25 µ 0 = 0 IW ( f , Φ) Σ k ∼ df: f = p + 1 h = 1 Alternative: ⇒ to control the size of the clusters Σ k = σ I p , for all k and σ > 0 Stick-breaking representation for the weights: v k g < k ( 1 − v g ) for k = 1 , . . . , K π k = � a α = b α = 0 . 25 v k Beta ( 1 , α ) for k = 1 , . . . , K − 1 ; v K = 1 ∼ Gamma ( a α , b α ) ∼ α 8
Imputation under MNAR Generate impute data from the MAR ➠ posterior predictive distribution 9
Imputation under MNAR Generate impute data from the MNAR ➠ altered posterior predictive distribution 9
Imputation under MNAR Generate impute data from the MNAR ➠ altered posterior predictive distribution Respondents D R 9
Imputation under MNAR Generate impute data from the MNAR ➠ altered posterior predictive distribution Σ µ Respondents D R mixture model π 9
Imputation under MNAR Generate impute data from the MNAR ➠ altered posterior predictive distribution Σ µ Respondents D R mixture model π ∗ π reflect a hypothesis for the non-respondents pattern 9
Imputation under MNAR Generate impute data from the MNAR ➠ altered posterior predictive distribution Σ µ Respondents D R mixture model π ∗ π Non-respondents D NR Imputation 9
10
Sensitivity Analysis • We need to consider different plausible missingness scenarios for sensitivity analysis. • For each scenario s , specify the mixture probabilities π ∗ ( s ) . • Consider a hypothetical population generated by imputing the nonrespondents following each scenario. • Evaluate the impact on inferences if we collect follow-up samples (FUS) with varying sizes. FUS size: n F = δ n MAX , where δ ∈ [ 0 , 1 ] , n MAX is the maximum sample size given budget. 11
Imputation For each scenario ( s ) , generate m P hypothetical populations by imputation with π ∗ ( s ) : Σ µ D R π ∗ ( s ) Multiple Imputation D ( s , 1 ) D ( s , 2 ) D ( s , m P ) . . . ˜ ˜ ˜ NR NR NR 12
Imputation For each scenario ( s ) , generate m P hypothetical populations by imputation with π ∗ ( s ) : D R D R D R . . . D ( s , 1 ) D ( s , 2 ) D ( s , m P ) ˜ ˜ ˜ NR NR NR 13
Imputation For each scenario ( s ) and for each imputation j , consider a follow-up sample of size δ : D R D ( s , j ) ˜ NR 14
Imputation For each scenario ( s ) and for each imputation j , consider a follow-up sample of size δ : D R D R D ( s , j ) δ ∗ n NR F ,δ D ( s , j ) ˜ D NF NR 14
Imputation For each scenario ( s ) and for each imputation j , consider a follow-up sample of size δ : Option A): D R D R Σ µ new model π D ( s , j ) F ,δ D ( s , j ) ˜ D NF NR Multiple Imputation (MAR) 14
Imputation For each scenario ( s ) and for each imputation j , consider a follow-up sample of size δ : Option A): D R D R Σ µ new model π D ( s , j ) F ,δ D ( s , j ) ˜ D NF NR Multiple Imputation (MAR) D ( s , j , 1 ) D ( s , j , 2 ) D ( s , j , m F ) . . . ˜ ˜ ˜ NF NF NF 14
Imputation For each scenario ( s ) and for each imputation j , consider a follow-up sample of size δ : Option B): D R D R Σ µ D ( s , j ) new model F ,δ π D ( s , j ) ˜ D NF NR Multiple Imputation (MAR) 14
Imputation For each scenario ( s ) and for each imputation j , consider a follow-up sample of size δ : Option B): D R D R Σ µ D ( s , j ) new model F ,δ π D ( s , j ) ˜ D NF NR Multiple Imputation (MAR) D ( s , j , 1 ) D ( s , j , 2 ) D ( s , j , m F ) . . . ˜ ˜ ˜ NF NF NF 14
Imputation Compare the data sets: D ( s , j , 1 ) D ( s , j , m F ) P ( s , j ) � � and ˜ , . . . , ˜ δ δ for each scenario ( s ) , and for all imputations with j = 1 , . . . , m P . 15
Utility measures Propensity scores: • Used in observational studies for matching covariate characteristics and reduce the impact of confounding factors; • It is the probability of being assigned to be on the treatment group T given the variables x : e ( x ) = P ( T = 1 | x ) • Calculate the propensity score on the merged data set consisting of the population P ( s , j ) (with T = 1) and the data set D ( s , j , l ) (with T=0) (Woo et al., 2009). ˜ δ • Use generalized additive models (GAM), where the linear component of the regression is replaced by a flexible additive function, such as splines. 16
Utility measures Based on the predicted values of the propensity scores calcu- lated on the merged data set of size 2 N . Measure ρ : Let the summary measure be � 2 N e i − 0 . 5 ) 2 i = 1 (ˆ ρ δ ( s , j , l ) = 2 N for each value of δ , scenario s , population j , and imputation l . The predicted values should be around 0.5 if the two data sets are comparable. 17
Illustration with Census of Manufactures Data
Illustration with Census of Manufactures Data Variables: total value of shipments (TVS), total employment (TE), and salary/wages (SW). Industry: plastics products manufacturing. Scenarios: MAR; MNAR with higher probabilities for bottom ranked clusters; MNAR with higher probabilities for top ranked clusters. 18
Plastic industry - MAR scenario The values are log transformed and standardized. 19
Plastic industry - MNAR scenario with higher probabilities for bottom ranked clusters 20
Plastic industry - MNAR scenario with higher probabilities for top ranked clusters 21
Plastic industry D R ∪ D F, δ D F, δ 5e−04 0.004 MAR MAR ● ● ● Bottom Bottom ● 4e−04 ● 0.003 Top Top ● ● 3e−04 0.002 ● ρ ρ 2e−04 ● ● ● ● 0.001 1e−04 ● ● ● ● ● ● ● 0e+00 0.000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 δ δ 22
Conclusions
Recommend
More recommend