Variance Estimation in Presence of Imputation: an Application to ISTAT Survey Data Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi paolo.righi@istat.it European Conference on Quality in Official Statistics - 2008, Rome 8-11 July 2008 1
OUTLINE • Imputation and Variance Estimation in Official Statistics • Adjusted Jackknife (AJ) variance estimation under Hot Deck (HD) imputations • Delete-A-Group Jackknife (DAGJK) and Extended DAGJK (EDAGJK) • DAGJK and EDAGJK with Rao and Shao adjustment for imputation • Description of a Monte Carlo simulation study on real survey data • Results 2
Imputation and Variance Estimation in Official Statistics • In Official Statistics item nonresponses in survey data are generally dealt with imputation • Usually in variance estimation imputed data are treated as they were true values without taking into account the additional source of variability due to the adjustment process. Standard formulas could lead to serious underestimation of the variance • National Statistical Institutes usually do not apply methods of variance estimation taking into account the imputed data because of both theoretical and computational problems • Our goal is to study methods belonging to the jackknife family, focusing on their feasibility with respect to official statistical data 3
Adjusted Jackknife (AJ) variance estimator under Hot Deck (HD) imputation • Rao and Shao (1992) proposed an AJ variance estimator under HD imputation that is consistent under some assumptions on the response model • For a stratified multistage sampling design with ignorabile finite population correction factor the AJ is: Y ( k ) Y I ) 2 V ar (ˆ ˆ [( n h − 1) /n h ] (ˆ − ˆ � � Y I ) = (1) Ih h k ∈ h being ˆ � � � w k y ∗ Y I = w k y k + (2) k h k ∈ A Rh k ∈ A Mh 4
Adjusted Jackknife (AJ) variance estimator under HD imputation Y ( k ) • The term ˆ in (1) is the estimate of ˆ Y when unit k ∈ h is omitted Ih � � Y ( k ) w ( k ) w ( k ) y ( k ) ˆ � � � y ∗ = y i + i + ˆ Rc − ¯ y Rc (3) Ih i i c i ∈ A Rc i ∈ A Mc y ( k ) i ∈ A Rc w ( k ) i ∈ A Rc w ( k ) with: ˆ Rc = � y i / � i i y ( k ) ¯ Rc = � i ∈ A Rc w i y i / � i ∈ A Rc w i • HD imputation consists in replacing the missing values y k of an incomplete unit ( recipient ) with the observed values y ∗ k from another record ( donor ) chosen among the complete units of the same survey • In Random HD , the donor is randomly selected among a pool of units belonging to a subset of records ( imputation cell c ) having the same level of some categorical variables 5
Adjusted Jackknife (AJ) variance estimator under HD imputation • Advantages of using jackknife in Official Statistics: – No model assumptions are needed; – Unit and item nonresponse is easily dealt with this method; – Variance of nonlinear statistics and estimation for domains can be easily calculated by external users. • Drawbacks: – Jackknife becomes computer intensive for large scale surveys; – Sometimes not suitable with typical sampling designs adopted in Official Statistics (strata with small sample sizes - upward bias estimates). 6
Delete-A-Group Jackknife (DAGJK) and Extended DAGJK (EDAGJK) • To overcome these problems we consider the DAGJK method • DAGJK is based on the following jackknife procedure: – Primary Sample Units (PSUs) in the same stratum are randomly ordered; – From this ordering, the PSUs are systematically allocated into G groups; – Considering the g -th group the replicate- g weights for the elementary k -unit are computed: 7
Delete-A-Group Jackknife (DAGJK) and Extended DAGJK (EDAGJK) w k , when k ∈ h and no PSU ∈ h belongs to the group g w ( g ) 0 , when k ∈ PSU in group g = (4) k � � n h /n h − n ( g ) w k , otherwise. h • The precision of the variance estimates improves when the number of random groups increases • DAGJK produces biased estimates when the number of sample PSUs in the strata is small (less than 5) • Kott (2001) proposed the EDAGJK to handle the latter case 8
Delete-A-Group Jackknife (DAGJK) and Extended DAGJK (EDAGJK) • EDAGJK is based on the following replicate weights w k when k ∈ h and no PSU ∈ h belongs to the group g w ( g ) = w k [1 − ( n h − 1) Z ] , when k ∈ PSU in group g k w k (1 + Z ) otherwise (5) where Z 2 = G/ [( G − 1) n h ( n h − 1)] • The DAGJK or EDAGJK variance estimator using the weights in formulas (4) or (5) is Y ( g ) − ˆ Y ) 2 V ar (ˆ (ˆ � Y ) = ( G/G − 1) (6) g Y ( g ) = � s w ( g ) with ˆ y k . k 9
DAGJK and EDAGJK with Rao and Shao adjustment for imputation • We propose a DAGJK or EDAGJK version with the Rao and Shao adjustment for HD imputation Y ( g ) • The method obtains ˆ by replacing the DAGJK or EDAGJK I replicate weights in (3) � � Y ( g ) w ( g ) w ( g ) y ( g ) ˆ � � � y ∗ = y i + i + ˆ Rc − ¯ y Rc (7) I i i c i ∈ A Rc i ∈ A Mc y ( g ) i ∈ A Rc w ( g ) i ∈ A Rc w ( g ) with ˆ Rc = � y i / � i i 10
Description of a Monte Carlo simulation study on real survey data • The population of an Italian geographical region - Lazio - (except for the province of Rome) with 1,372,572 units has been considered • 250 samples according to the Italian Labour Force sampling design have been selected: – The municipalities of each province are ordered by population size and strata of municipalities with population size equal to a given threshold are formed. Strata with only one municipality are referred to as self-representing (S-R) strata (7); – In each S-R stratum a sample of households (PSUs) is selected (Stratified cluster design); 11
Description of a Monte Carlo simulation study on real survey data • In non S-R stratum (NS-R) a pps sample of municipalities (PSUs) of size 2 is drawn, and a sample of households is selected (two stage stratified design); • There are many NS-R strata with non negligible PSU sampling fraction Frequency of NS-R strata by PSU sampling fraction < 20% 20% − 40% 40% − 60% > 60% Total Frequency 8 5 2 3 18 • The total of the variable employment ( employed/not employed ) has been considered 12
Description of a Monte Carlo simulation study on real survey data • A missing at Random (MAR) mechanism has been simulated by using 8 different missing rates depending on the values of 2 covariates: X 1 (levels: 1,2,3,4) referred to the household’s type; the domain indicator variable depending on whether the unit belongs to either S-R or NS-R stratum. Missing rate for the simulated nonresponse mechanism X 1 = 1 X 1 = 2 X 1 = 3 X 1 = 4 NS-R 10% 20% 30% 40% S-R 40% 30% 20% 10% • The number of PSUs (municipalities+households) is 552 • HD method is applied with imputation cells defined as above 13
Results: Relative Bias and Relative Root Mean Square Error of EDAGJK by different number of random groups Number RG Without missing data With imputed data RB RRMSE RB RRMSE 5 0.07 0.83 0.07 0.83 15 0.09 0.50 0.11 0.57 30 0.09 0.44 0.11 0.48 50 0.08 0.38 0.10 0.43 14
Results: Boxplots of the variance estimates of the methods EDAGJK - DAGJK - STANDARD FORMULA - JACKKNIFE 15
Results: Confidence Interval of the methods - 95% CI Coverage and CI Relative Lenght Without missing data With imputed data METHODS 95% CI CI RL 95% CI CI RL COVERAGE COVERAGE EDAGJK - 30 RG 90.5 18.7 92.5 23.1 DAGJK -30 RG 97.5 24.5 98.0 29.8 STANDARD FORMULA 91.5 18.8 88.0 19.5 JACKKNIFE 97.5 24.9 98.5 30.8 16
Conclusion • Variance estimation taking into account imputed data is a pressing target in Official Statistics • The proposed approach based on EDAGJK with Rao and Shao adjustment seems to obtain good performances in terms of precision of the variance estimates being, at the same time, computational feasible • The empirical results show that the approach seems to be suitable for the complex designs usually used in National Statistical Institutes • Further analysis are needed to take into account a finite population correction factor in the variance estimator • Finally an empirical study with the calibration estimators is needed 17
References • Brick, J.M.,Jones, M.E.,Kalton, G., Valliant, R. (2005). Variance estimation with hot deck imputation: a simulation study of three methods. Survey Methodology , 31,151- 159. • Kott, P . S. (2001). The delete-a-group jackknife. Journal of Official Statistics , 17, 521-526. • Kott, P . S. (2006). Delete-a-group variance estimation for the general regression estimator under poissoing sampling. Journal of Official Statistics , 22 , 759-767. • Lee, H., Rancourt E., Sarndal, C.-E. (1995). Jackknife variance for data with imputed values. Proceedings of the Statistical Society of Canada Survey Methods Section , 111-115. • Rao, J.N.K., Shao, J. (1992). Jackknife variance estimation with survey data under hot deck imputation. Biometrika , 79, 811-822. • Rust, K. (1985). Variance estimation for complex estimators in sample. Journal of Official Statistics , 1, 381-397. • Shao, J., Steel, P . (1999). Variance estimation for survey data with composite estimation and nonnegligible sampling fractions. journal of American Statistical Association , 94, 254-265. • Wolter, K.M. (1985). Introduction to variance estimation . New York, Springer Verlag.
Recommend
More recommend