Modelling Measurement Error in Administrative and Survey Variables Sander Scholtus, Bart Bakker, Arnout van Delden (s.scholtus@cbs.nl)
Outline – Introduction – Modelling measurement error ‐ structural equation models ‐ identification by means of an audit sample – Application ‐ VAT data for Dutch quarterly turnover statistics – Summary / discussion 2
Introduction – Quality of administrative data for statistical purposes ‐ coverage of target population, timeliness, … ‐ measurement issues – Administrative data: possible conceptual differences – Compare admin. data to survey data ‐ previous presentation: survey data = gold standard ‐ current presentation: measurement errors in both sources 3
Modelling measurement error – Basic approach: ‐ link administrative data to survey data ‐ allow for measurement errors in both sources ‐ fit a structural equation model (SEM) ‐ latent variables represent “true” concepts ‐ standardised factor loadings reflect validity of measurement y 1 = τ 1 + λ 11 η 1 + ε 1 y 1 ε 1 λ 11 admin y 2 = τ 2 + λ 21 η 1 + ε 2 η 1 τ 1 1 true τ 2 latent y 2 λ 21 ε 2 observed survey constant 4
Modelling measurement error – Complications: model identification ‘requires’ ‐ multiple (≥ 3) related concepts ‐ multiple (≥ 2) observed variables for each concept ‐ choice of a metric for each latent variable (for evaluating bias) ζ 1 y 1 ε 1 latent η 1 1 1 observed δ 1 x 1 y 2 ε 2 constant ξ 1 1 y 3 ε 3 x 2 δ 2 η 2 1 1 ζ 2 y 4 ε 4 5
Modelling measurement error – Standard identification solutions yield ‘arbitrary’ metrics: ‐ reference indicators [e.g., τ 1 = 0 and λ 11 = 1] ‐ standardise latent variables [E( η 1 ) = 0 and Var( η 1 ) = 1] – Alternative solution: calibration ‐ collect additional gold standard data for a random subsample (audit sample / verification study) ‐ simulation results suggest: audit sample of 50 units is sufficient y 1 ε 1 λ 11 y 1 = τ 1 + λ 11 η 1 + ε 1 admin η 1 τ 1 y 2 = τ 2 + λ 21 η 1 + ε 2 1 true τ 2 y 3 = η 1 y 2 λ 21 1 ε 2 survey y 3 audit 6
Application: VAT data – Dutch quarterly turnover statistics – Main question: VAT turnover fit for use? ‐ base cells in car trade and transport sector ‐ tax regulations exist, previous analysis inconclusive ‐ large and complex units excluded – Sources of data: ‐ Business Register (BR) ‐ Profit Declarations (PD; admin. source) ‐ VAT data (admin. source) ‐ Structural Business Statistics (SBS; sample survey) ‐ Audit sample: re-edited SBS data (50 units per base cell) 7
Application: VAT data – Model: (SBS data removed to avoid multicollinearity with audit data) BR No. audit Empl. SBS Tot. Turn- Costs over PD Pur- audit chase VAT SBS SBS PD audit PD audit SBS 8
Application: VAT data – Model estimation ‐ used Pseudo Maximum Likelihood to account for • complex survey design (SBS + audit sample) • skewness of the data ‐ examined data transformations: • variables on original scale • variables divided by number of legal units (heteroscedasticity) ‐ used R packages lavaan and lavaan.survey 9
Application: VAT data Results for NACE 45112 (“Sale/repair of passenger cars”) 1 Robust (PML) fit measures : 1.02 BR 0.87 3.31 χ 2 = 66 (df = 47, p = 0.03); 1 CFI = 0.998; TLI = 0.999; RMSEA = 0.032 No. audit Empl. 1 0.05 0.03 Tot. Turn- 1.02 55 1 Costs over – 0.02 PD 1.03 0.02 1 1.04 1 1.03 0.80 1 1 Pur- 1.05 audit chase 1 VAT PD audit PD audit 1.21 1 – 0.02 – 0.02 – 0.01 1 1 1 10
Application: VAT data – Result from SEM on previous slide: Turnover(VAT) = – 0.02 + 0.80 × Turnover(true) + ε – Derive a correction formula through a second SEM: β α Turn- VAT 1 over ζ λ * = 1.03 τ * = – 0.01 PD 1 θ * = 0.06 ε Turnover(true) = 0.18 + 1.13 × Turnover(VAT) + ζ (R 2 =0.90) ( σ =0.08) ( σ =0.06) 11
Summary / discussion – Can assess validity and bias of admin. data with SEMs – Advantages over direct comparison to survey data: ‐ allow for measurement errors in all sources ‐ objective evaluation of measurement quality – Possible disadvantages: ‐ need multiple related concepts ‐ need an audit sample to identify bias – Suggestion: apply a multi-stage approach 1. Make a direct comparison to survey data (linear regression) 2. If inconclusive, determine validity with SEM approach 3. If validity high, collect audit sample to estimate bias as well 12
Recommend
More recommend