The Use of Uncertainty to Choose the Matching Variables in Statistical Matching Marcello D’Orazio* ( madorazi@istat.it) Marco Di Zio* (dizio@istat.it) Mauro Scanu* (scanu@istat.it) *Italian National Institute of Statistics (Istat) NTTS 2015 conference, Brussels, 10-12 March 2015
Statistical Matching (data fusion or synthetic matching) Series of statistical methods for integrating two data sources (usually samples) referred to the same target population. Objective : study the relationship between variables not jointly observed in a single data source Y X X variables in common source A Y and Z are NOT jointly observed X Z source B 1 Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
Objectives of Statistical Matching micro : derive a “synthetic” data -set with X , Y and Z ; for instance: • A filled-in with Z • with Z filled in A and Y filled in B (file concatenation) macro : estimation of parameters; for instance: • correlation coef. ( ) • regression coefficient ( ) • a contingency table ( ) Various methods available, depending on the objective (micro or macro) and on the framework (parametric, nonparametric or mixed). Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
Matching Variables A and B may share many common variables X NOT all the X variables will be used. It is necessary to select just the most relevant X s called matching variables i.e. the subset of the X s connected, at the same time, with Y and Z : Many methods can be applied to identify (best predictors of Y ) and (best predictors of Z ). They imply separate analyses on A and B . Proposal : perform a unique analysis for choosing by searching the set of common variables more effective in reducing the uncertainty on the relationship between Y and Z Uncertainty is due to lack of information: Y and Z are NOT jointly observed. Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
Uncertainty bounds Focus on categorical X , Y and Z variables are categorical. Objective of SM: estimation of the probabilities , , p Pr Y j Z , k j 1, , J k 1, , K jk In this case the uncertainty set can be computed by resorting to the Fréchet bounds By conditioning on the X , it is possible to conclude that the probability will lie in the interval: p , p p max 0, p p 1 , p min p , p jk jk h h j h k h j h k h h h Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
Proposed method for choosing the matching variables Step 0) ordering of the X s according to their ability in minimizing: 1 ˆ ˆ d p p jk jk j k , J K Step 1) evaluate d for all the possible combinations of the starting variable(s) with each of the remaining ones ordered as in step (0) and evaluate the uncertainty associated in terms of d Step 2) Select the combination of the variables which determine the higher decrease in d and go back to step (1). Method tested with artificial data Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
The data (1) Bayesian networks are used to generate two artificial samples sharing 3 binary X s with the following association structure : True association structure Association str. in A Association str. n B Output of the procedure X variables No. of Xs d X1 1 0.1703 X1*X3 2 0.1703 X1*X3*X2 3 0.1699 Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
The data (2) Artificial data resembling EU-SILC Two artificial samples, and 7 common variables Output of the procedure: Best Combination of X variables d No. of X s 1 0.0878 Yes c.age 2 0.0781 Yes c.age*sex 3 0.0714 Yes c.age*sex*edu7 4 0.0608 No c.age*sex*edu7*area5 No c.age*sex*edu7*area5*hsize5 5 0.0411 6 0.0225 Yes c.age*sex*edu7*area5*hsize5*urb Yes 7 0.0162 c.age*edu7*marital*sex*hsize5*area5*urb The found combinations with 4 and 5 X s are very close to optimality Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
Conclusions Pros: - avoids separate analyses - is able to find best solutions or solutions close to them - is fully authomatic, code written in R and related to the package StatMatch ( D’Orazio , 2015) Cons: - dependence on the initial ordering of the variables - absence of a stopping rule: by increasing the no. of X s the uncertainty always decreases but the tables become very sparse Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
Essential References D’Orazio M., Di Zio M., and Scanu M. (2006) Statistical Matching, Theory and Practice . Wiley, New York. D’Orazio, M. (2015) “ StatMatch : Statistical Matching”, R package version 1.2.3 http://CRAN.R-project.org/package=StatMatch Uncertainty to choose the matching variables , M. D’Orazio, M. Di Zio, M. Scanu – NTTS 2015, Brussels
Recommend
More recommend