Variance Estimation in Presence of Imputation: an Application to - PowerPoint PPT Presentation

Variance Estimation in Presence of Imputation: an Application to ISTAT Survey Data Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi paolo.righi@istat.it European Conference on Quality in Official Statistics - 2008, Rome 8-11 July 2008 1

OUTLINE • Imputation and Variance Estimation in Official Statistics • Adjusted Jackknife (AJ) variance estimation under Hot Deck (HD) imputations • Delete-A-Group Jackknife (DAGJK) and Extended DAGJK (EDAGJK) • DAGJK and EDAGJK with Rao and Shao adjustment for imputation • Description of a Monte Carlo simulation study on real survey data • Results 2

Imputation and Variance Estimation in Official Statistics • In Official Statistics item nonresponses in survey data are generally dealt with imputation • Usually in variance estimation imputed data are treated as they were true values without taking into account the additional source of variability due to the adjustment process. Standard formulas could lead to serious underestimation of the variance • National Statistical Institutes usually do not apply methods of variance estimation taking into account the imputed data because of both theoretical and computational problems • Our goal is to study methods belonging to the jackknife family, focusing on their feasibility with respect to official statistical data 3

Adjusted Jackknife (AJ) variance estimator under Hot Deck (HD) imputation • Rao and Shao (1992) proposed an AJ variance estimator under HD imputation that is consistent under some assumptions on the response model • For a stratified multistage sampling design with ignorabile finite population correction factor the AJ is: Y ( k ) Y I ) 2 V ar (ˆ ˆ [( n h − 1) /n h ] (ˆ − ˆ � � Y I ) = (1) Ih h k ∈ h being   ˆ � � � w k y ∗ Y I = w k y k + (2)   k   h k ∈ A Rh k ∈ A Mh 4

Adjusted Jackknife (AJ) variance estimator under HD imputation Y ( k ) • The term ˆ in (1) is the estimate of ˆ Y when unit k ∈ h is omitted Ih  � � Y ( k ) w ( k ) w ( k ) y ( k )   ˆ � � � y ∗ = y i + i + ˆ Rc − ¯ y Rc (3) Ih i i c   i ∈ A Rc i ∈ A Mc y ( k ) i ∈ A Rc w ( k ) i ∈ A Rc w ( k ) with: ˆ Rc = � y i / � i i y ( k ) ¯ Rc = � i ∈ A Rc w i y i / � i ∈ A Rc w i • HD imputation consists in replacing the missing values y k of an incomplete unit ( recipient ) with the observed values y ∗ k from another record ( donor ) chosen among the complete units of the same survey • In Random HD , the donor is randomly selected among a pool of units belonging to a subset of records ( imputation cell c ) having the same level of some categorical variables 5

Adjusted Jackknife (AJ) variance estimator under HD imputation • Advantages of using jackknife in Official Statistics: – No model assumptions are needed; – Unit and item nonresponse is easily dealt with this method; – Variance of nonlinear statistics and estimation for domains can be easily calculated by external users. • Drawbacks: – Jackknife becomes computer intensive for large scale surveys; – Sometimes not suitable with typical sampling designs adopted in Official Statistics (strata with small sample sizes - upward bias estimates). 6

Delete-A-Group Jackknife (DAGJK) and Extended DAGJK (EDAGJK) • To overcome these problems we consider the DAGJK method • DAGJK is based on the following jackknife procedure: – Primary Sample Units (PSUs) in the same stratum are randomly ordered; – From this ordering, the PSUs are systematically allocated into G groups; – Considering the g -th group the replicate- g weights for the elementary k -unit are computed: 7

Delete-A-Group Jackknife (DAGJK) and Extended DAGJK (EDAGJK)  w k , when k ∈ h and no PSU ∈ h belongs to the group g    w ( g ) 0 , when k ∈ PSU in group g  = (4) k � � n h /n h − n ( g ) w k ,  otherwise.  h   • The precision of the variance estimates improves when the number of random groups increases • DAGJK produces biased estimates when the number of sample PSUs in the strata is small (less than 5) • Kott (2001) proposed the EDAGJK to handle the latter case 8

Delete-A-Group Jackknife (DAGJK) and Extended DAGJK (EDAGJK) • EDAGJK is based on the following replicate weights  w k when k ∈ h and no PSU ∈ h belongs to the group g  w ( g )  = w k [1 − ( n h − 1) Z ] , when k ∈ PSU in group g k w k (1 + Z )  otherwise  (5) where Z 2 = G/ [( G − 1) n h ( n h − 1)] • The DAGJK or EDAGJK variance estimator using the weights in formulas (4) or (5) is Y ( g ) − ˆ Y ) 2 V ar (ˆ (ˆ � Y ) = ( G/G − 1) (6) g Y ( g ) = � s w ( g ) with ˆ y k . k 9

DAGJK and EDAGJK with Rao and Shao adjustment for imputation • We propose a DAGJK or EDAGJK version with the Rao and Shao adjustment for HD imputation Y ( g ) • The method obtains ˆ by replacing the DAGJK or EDAGJK I replicate weights in (3)  � � Y ( g ) w ( g ) w ( g ) y ( g )   ˆ � � � y ∗ = y i + i + ˆ Rc − ¯ y Rc (7) I i i c   i ∈ A Rc i ∈ A Mc y ( g ) i ∈ A Rc w ( g ) i ∈ A Rc w ( g ) with ˆ Rc = � y i / � i i 10

Description of a Monte Carlo simulation study on real survey data • The population of an Italian geographical region - Lazio - (except for the province of Rome) with 1,372,572 units has been considered • 250 samples according to the Italian Labour Force sampling design have been selected: – The municipalities of each province are ordered by population size and strata of municipalities with population size equal to a given threshold are formed. Strata with only one municipality are referred to as self-representing (S-R) strata (7); – In each S-R stratum a sample of households (PSUs) is selected (Stratified cluster design); 11

Description of a Monte Carlo simulation study on real survey data • In non S-R stratum (NS-R) a pps sample of municipalities (PSUs) of size 2 is drawn, and a sample of households is selected (two stage stratified design); • There are many NS-R strata with non negligible PSU sampling fraction Frequency of NS-R strata by PSU sampling fraction < 20% 20% − 40% 40% − 60% > 60% Total Frequency 8 5 2 3 18 • The total of the variable employment ( employed/not employed ) has been considered 12

Description of a Monte Carlo simulation study on real survey data • A missing at Random (MAR) mechanism has been simulated by using 8 different missing rates depending on the values of 2 covariates: X 1 (levels: 1,2,3,4) referred to the household’s type; the domain indicator variable depending on whether the unit belongs to either S-R or NS-R stratum. Missing rate for the simulated nonresponse mechanism X 1 = 1 X 1 = 2 X 1 = 3 X 1 = 4 NS-R 10% 20% 30% 40% S-R 40% 30% 20% 10% • The number of PSUs (municipalities+households) is 552 • HD method is applied with imputation cells defined as above 13

Results: Relative Bias and Relative Root Mean Square Error of EDAGJK by different number of random groups Number RG Without missing data With imputed data RB RRMSE RB RRMSE 5 0.07 0.83 0.07 0.83 15 0.09 0.50 0.11 0.57 30 0.09 0.44 0.11 0.48 50 0.08 0.38 0.10 0.43 14

Results: Boxplots of the variance estimates of the methods EDAGJK - DAGJK - STANDARD FORMULA - JACKKNIFE 15

Results: Confidence Interval of the methods - 95% CI Coverage and CI Relative Lenght Without missing data With imputed data METHODS 95% CI CI RL 95% CI CI RL COVERAGE COVERAGE EDAGJK - 30 RG 90.5 18.7 92.5 23.1 DAGJK -30 RG 97.5 24.5 98.0 29.8 STANDARD FORMULA 91.5 18.8 88.0 19.5 JACKKNIFE 97.5 24.9 98.5 30.8 16

Conclusion • Variance estimation taking into account imputed data is a pressing target in Official Statistics • The proposed approach based on EDAGJK with Rao and Shao adjustment seems to obtain good performances in terms of precision of the variance estimates being, at the same time, computational feasible • The empirical results show that the approach seems to be suitable for the complex designs usually used in National Statistical Institutes • Further analysis are needed to take into account a finite population correction factor in the variance estimator • Finally an empirical study with the calibration estimators is needed 17

References • Brick, J.M.,Jones, M.E.,Kalton, G., Valliant, R. (2005). Variance estimation with hot deck imputation: a simulation study of three methods. Survey Methodology , 31,151- 159. • Kott, P . S. (2001). The delete-a-group jackknife. Journal of Official Statistics , 17, 521-526. • Kott, P . S. (2006). Delete-a-group variance estimation for the general regression estimator under poissoing sampling. Journal of Official Statistics , 22 , 759-767. • Lee, H., Rancourt E., Sarndal, C.-E. (1995). Jackknife variance for data with imputed values. Proceedings of the Statistical Society of Canada Survey Methods Section , 111-115. • Rao, J.N.K., Shao, J. (1992). Jackknife variance estimation with survey data under hot deck imputation. Biometrika , 79, 811-822. • Rust, K. (1985). Variance estimation for complex estimators in sample. Journal of Official Statistics , 1, 381-397. • Shao, J., Steel, P . (1999). Variance estimation for survey data with composite estimation and nonnegligible sampling fractions. journal of American Statistical Association , 94, 254-265. • Wolter, K.M. (1985). Introduction to variance estimation . New York, Springer Verlag.

Variance Estimation in Presence of Imputation: an Application to - PowerPoint PPT Presentation

Variance Estimation in Presence of Imputation: an Application to ISTAT Survey Data Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi paolo.righi@istat.it European Conference on Quality in Official Statistics - 2008, Rome 8-11

Consistent Variance Estimates for Multiple Multiple imputation Imputation in R MI alternative

Overview Multiple Imputation for Multilevel Data Bayesian estimation for MLMs Univariate

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Analysis of variance and regression December 4, 2007 Variance component models Variance

Alex Psomas: Lecture 18. Random Variables: Variance 1. Variance 2. Distributions Variance Flip

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

MixtComp software: Model-based clustering/imputation with mixed data, missing data and uncertain

Estimation theory Parametric estimation Properties of estimators Minimum variance

Genotype Imputation in Genome-wide Association Studies Fernando Rivadeneira 1,2 1 Department of

Genotype imputation accuracy with different reference panels Guan-Hua Huang and Yi-Chi Tseng

Handling missing data in Stata: Imputation and likelihood-based approaches Rose Medeiros

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

Reference based multiple imputation; for sensitivity analysis of clinical trials with missing

Method for the imputation of the earnings variable in the Belgian LFS Workshop on LFS

Web-based Y-STR database for haplotype frequency estimation and kinship index calculation I S

LCCMR ID: 063-C1+2 Project Title: Conserving Prairie Plant Diversity and Evaluating Local

J obenomics National Gras s roots Movement focus es on mas s -producing local micro and

Sergio Colella President Europe We are all travelers Travelling with bags 3 Waiting, sitting,

Comparing Causal Inference Estimators for Average Treatment Effect of Treated Units in

Q1 2018 MAY 3, 2018 Q1 2018 Summary Margins & Revenues Highlights profitability

Frans Bolk CEO UniQ-ID Uses certificates ( x.509) Has its own UniQ-CA ETSI

CLE & e-ID Management: Issues, Prospects and Opportunities Chris E Onyemenam Director

Sambuz

Useful Links

Newsletter

Mail Us

Variance Estimation in Presence of Imputation: an Application to - PowerPoint PPT Presentation

Variance Estimation in Presence of Imputation: an Application to ISTAT Survey Data Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi paolo.righi@istat.it European Conference on Quality in Official Statistics - 2008, Rome 8-11

Consistent Variance Estimates for Multiple Multiple imputation Imputation in R MI alternative

Overview Multiple Imputation for Multilevel Data Bayesian estimation for MLMs Univariate

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Analysis of variance and regression December 4, 2007 Variance component models Variance

Alex Psomas: Lecture 18. Random Variables: Variance 1. Variance 2. Distributions Variance Flip

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

MixtComp software: Model-based clustering/imputation with mixed data, missing data and uncertain

Estimation theory Parametric estimation Properties of estimators Minimum variance

Genotype Imputation in Genome-wide Association Studies Fernando Rivadeneira 1,2 1 Department of

Genotype imputation accuracy with different reference panels Guan-Hua Huang and Yi-Chi Tseng

Handling missing data in Stata: Imputation and likelihood-based approaches Rose Medeiros

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

Reference based multiple imputation; for sensitivity analysis of clinical trials with missing

Method for the imputation of the earnings variable in the Belgian LFS Workshop on LFS

Web-based Y-STR database for haplotype frequency estimation and kinship index calculation I S

LCCMR ID: 063-C1+2 Project Title: Conserving Prairie Plant Diversity and Evaluating Local

J obenomics National Gras s roots Movement focus es on mas s -producing local micro and

Sergio Colella President Europe We are all travelers Travelling with bags 3 Waiting, sitting,

Comparing Causal Inference Estimators for Average Treatment Effect of Treated Units in

Q1 2018 MAY 3, 2018 Q1 2018 Summary Margins &amp; Revenues Highlights profitability

Frans Bolk CEO UniQ-ID Uses certificates ( x.509) Has its own UniQ-CA ETSI

CLE &amp; e-ID Management: Issues, Prospects and Opportunities Chris E Onyemenam Director

Sambuz

Useful Links

Newsletter

Mail Us

Q1 2018 MAY 3, 2018 Q1 2018 Summary Margins & Revenues Highlights profitability

CLE & e-ID Management: Issues, Prospects and Opportunities Chris E Onyemenam Director