New proposal for linkage error estimation Tiziana Tuoto Istat - PowerPoint PPT Presentation

New proposal for linkage error estimation Tiziana Tuoto Istat – Italian National Statistical Institute Joint work with: Niki Stylianidou Brussels, 11th March 2015

Outline 1. Motivations: linkage errors and their effects 2. Drawbacks of current solutions for evaluating linkage quality 3. A new proposal 4. Concluding remarks and future works Tuoto "New proposals for linkage error estimation" Brussels, 11 March 2015

Motivations Why we are still studying linkage errors in order to:  provide necessary information for proper statistical analyses on linked data  Increase the level of automatism of integration process → Move from black box towards toolbox Tuoto "New proposals for linkage error estimation" Brussels, 11 March 2015

Solutions for linkage quality Approaches based Latent Mixture Model:  Fellegi and Sunter (1969) The procedure is relaible in identifying matches, not so accurate in assensing errors. A preciuos outcome of the F-S procedure is the posterior probability of being match for couple (a,b)   [ ( , ) ] [ ] P a b M P M   _ ( , | ) post prob a b      [ ( , ) ] [ ] [ ( , ) ] [ ] P a b M P M P a b U P U  Belin and Rubin (1995) After the F-S procedure, their suggestion assumes tranformation of the Matches and Unmatches distributions in order to normalize them. They only achieve estimate of false match rate. Actually at the end, normality is not guaranteed. Tuoto "New proposals for linkage error estimation" Brussels, 11 March 2015

Scenario at the end of Fellegi-Sunter application... Tuoto "New proposals for linkage error estimation" Brussels, 11 March 2015

A new proposal for evaluating matching errorrs Main idea: enriching F-S results via auxiliary info in a second step The Fellegi and Sunter probabilistic record linkage provides estimates of the ► probability of being correctly matched – the post_prob – applying log-linear model on the most informative variables while other variables are not included for modeling issues due to their (relative) small power to distinguish between matches and non-matches (as sex, marital status, … few categories uniformly distributed ) Those variables could be exploited in a second step in order enrich the F-S ► results particularly to measure matching errors Tuoto "New proposals for linkage error estimation" Brussels, 11 March 2015

The proposal: second step From matches of the F-S procedure, sampling a training set for which the true ► matching status is assigned and modeling such status via the post_prob and the comparisons on other variables not included in the F-S model 𝑧 (𝑏,𝑐) ~𝑚𝑝𝑕𝑗𝑢 𝑞𝑝𝑡𝑢_𝑞𝑠𝑝𝑐 (𝑏,𝑐) + 𝐽 𝑌 2(𝑏,𝑐) + 𝐽 𝑌 3(𝑏,𝑐) + ⋯ 𝑢𝑡 The model is applied on the full data ► 𝑞𝑝𝑡𝑢_𝑞𝑠𝑝𝑐 (𝑏,𝑐) + 𝐽 𝑌 2(𝑏,𝑐) + 𝐽 𝑌 3(𝑏,𝑐) + ⋯ 𝑧 (𝑏,𝑐) ~ 𝛾 1 𝛾 2 𝛾 3 𝑢𝑡 𝑢𝑡 𝑢𝑡 Matches are assigned on the basis of 𝑧 (𝑏,𝑐) ► If 𝑧 (𝑏,𝑐) > 0.5 then the pair is matched Errors are estimated on the basis of 𝑧 (𝑏,𝑐) ► false matches (1 − 𝑧 (𝑏,𝑐) ) 𝑧 𝑏,𝑐 >0 missing matches 𝑧 (𝑏,𝑐) 𝑏,𝑐 <0 𝑧 Tuoto "New proposals for linkage error estimation" Brussels, 11 March 2015

Some results on fictitious data First Step:  Full data: 1500 units from fictitious population census data and administrative register, the true matching status is known (ESSnet DI, 2011)  Probabilistic record linkage, according Fellegi and Sunter theory, using the most informative variables (name, surname, day and year of birth), other variables remain out of the F-S model (sex, month of birth, postal code...) Results of the F-S procedure on the basis of post_prob >0.5: Tuoto "New proposals for linkage error estimation" Brussels, 11 March 2015

Some results on fictitious data Second Step:  Training set randomly sampled from the F-S results (size 10%)  Logistic model selection to predict the true matching status using post_prob plus other variables remain out of the F-S model: sex, month of birth, postal code Some notes on second step: 1. the best model (in terms of AIC, area under the ROC curve, entropy and 0-1 loss functions) on the TS and the one on full data (if true linkage status were known) could not be the same: this seems to not compromise the prediction power of the selected model 2. from first evidence, the logistic model should include as explanatory variable the post_prob Tuoto "New proposals for linkage error estimation" Brussels, 11 March 2015

Some results Tuoto "New proposals for linkage error estimation" Brussels, 11 March 2015

Concluding remarks Proposal: two-step procedure in order to assign linkage status and estimate linkage errors as well ► Step 1. Fellegi-Sunter procedure is applied on the most strong variables ► Step 2. on the F-S results, a training set is sampled, the true matching status is modeled on the basis of the post_prob and still not-exploited comparison variables Advantages: all available comparison variables are exploited Advantages: Estimates in the same time false and missing matches Further works: ► 1. analyze the impact of training set selection ► 2. analyze the impact of model selection on the TS and the projection on the full data Tuoto "New proposals for linkage error estimation" Brussels, 11 March 2015

References • Belin T.R., Rubin D.B., A Method for Calibrating False-Match Rates in Record Linkage, Journal of American Statistical Association, June 1995, vol.90, no 430, pp.81-94. • Fellegi, I.P., Sunter, A.B. (1969), A Theory for Record Linkage , Journal of the American Statistical Association, 64, pp. 1183-1210. • Essnet DI - McLeod, Heasman and Forbes, (2011) Simulated data for the on the job training, http://www.cros-portal.eu/content/job-training Tuoto "New proposals for linkage error estimation" Brussels, 11 March 2015

New proposal for linkage error estimation Tiziana Tuoto Istat - PowerPoint PPT Presentation

New proposal for linkage error estimation Tiziana Tuoto Istat Italian National Statistical Institute Joint work with: Niki Stylianidou Brussels, 11th March 2015 Outline 1. Motivations: linkage errors and their effects 2. Drawbacks of

Modeling Offsets and Linkage in a Modeling Offsets and Linkage in a Modeling Offsets and Linkage

Linkage Disequilibrium Linkage Disequilibrium Linkage Equilibrium Consider two linked loci Locus

Building the Linkage Tree (LT) in LTGA 1. Start with singleton linkage sets Thierens, D. (2010).

Linkage graphs and what they look like Stephen Kell Stephen.Kell@cl.cam.ac.uk Linkage graphs. .

What is data (or record) linkage? Recent interest in data linkage The process of linking and

Record Linkage Record Linkage Craig Knoblock University of Southern California These slides are

Genealogical Record Linkage: Features for Automated Person Matching Randy Wilson

Using Structured Neural Networks for Record Linkage Burdette Pixton Christophe Giraud-Carrier

Data linkage in Victoria 7 August 2017 Sharon Williams, Manager, Centre for Victorian Data

Performing linkage analysis using MERLIN David Duffy Queensland Institute of Medical Research

The Machinery of Parametric Linkage Analysis David Duffy Queensland Institute of Medical

Privacy Preserving Record Linkage Linkage Elizabeth Ashley Durham Health Information Privacy

Contrastive Entity Linkage : Mining Variational Attributes from Large Catalogs for Entity Linkage

Option for Hybrid members to opt out of DB Salary Linkage Employee Guide PUBLIC Option for

January 2017 Data Linkage: An Overview Natalie Shlomo University of Manchester 1

Writing Your CAREER Proposal Writing Your CAREER Proposal 2016 NSF/CMMI CAREER Proposal Writing

with schizophrenia. Arkhipov Andrei Y. Kudryashov P. Nurbekov M. Strelets V.B. Maslennikova A.

Compaa de Transporte de Energa Elctrica en Alta Tensin TRANSENER S.A. October 2019

PRESENTATION OF DAVID M. GROGAN ON HOT TOPICS IN BANKRUPTCY AND CREDITORS RIGHTS TO CHARLOTTE

Su Surv rvey, ey, Sea Searc rch & Seizu h & Seizure re un under r Inc ncom ome

Progress towards graduation Session 1 Roland Mollerus Secretary Committee for Development Policy

This is Tomorrow Just what is it that made yesterdays homes so different, so

Introduction why psychoanalytic? why a training in psychoanalytic psychotherapy?

A Rasch Analysis of Math Tests using M.In.E.R.Va. Platform Grazia Messineo - Salvatore Vassallo

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

New proposal for linkage error estimation Tiziana Tuoto Istat - PowerPoint PPT Presentation

New proposal for linkage error estimation Tiziana Tuoto Istat Italian National Statistical Institute Joint work with: Niki Stylianidou Brussels, 11th March 2015 Outline 1. Motivations: linkage errors and their effects 2. Drawbacks of

Modeling Offsets and Linkage in a Modeling Offsets and Linkage in a Modeling Offsets and Linkage

Linkage Disequilibrium Linkage Disequilibrium Linkage Equilibrium Consider two linked loci Locus

Building the Linkage Tree (LT) in LTGA 1. Start with singleton linkage sets Thierens, D. (2010).

Linkage graphs and what they look like Stephen Kell Stephen.Kell@cl.cam.ac.uk Linkage graphs. .

What is data (or record) linkage? Recent interest in data linkage The process of linking and

Record Linkage Record Linkage Craig Knoblock University of Southern California These slides are

Genealogical Record Linkage: Features for Automated Person Matching Randy Wilson

Using Structured Neural Networks for Record Linkage Burdette Pixton Christophe Giraud-Carrier

Data linkage in Victoria 7 August 2017 Sharon Williams, Manager, Centre for Victorian Data

Performing linkage analysis using MERLIN David Duffy Queensland Institute of Medical Research

The Machinery of Parametric Linkage Analysis David Duffy Queensland Institute of Medical

Privacy Preserving Record Linkage Linkage Elizabeth Ashley Durham Health Information Privacy

Contrastive Entity Linkage : Mining Variational Attributes from Large Catalogs for Entity Linkage

Option for Hybrid members to opt out of DB Salary Linkage Employee Guide PUBLIC Option for

January 2017 Data Linkage: An Overview Natalie Shlomo University of Manchester 1

Writing Your CAREER Proposal Writing Your CAREER Proposal 2016 NSF/CMMI CAREER Proposal Writing

with schizophrenia. Arkhipov Andrei Y. Kudryashov P. Nurbekov M. Strelets V.B. Maslennikova A.

Compaa de Transporte de Energa Elctrica en Alta Tensin TRANSENER S.A. October 2019

PRESENTATION OF DAVID M. GROGAN ON HOT TOPICS IN BANKRUPTCY AND CREDITORS RIGHTS TO CHARLOTTE

Su Surv rvey, ey, Sea Searc rch &amp; Seizu h &amp; Seizure re un under r Inc ncom ome

Progress towards graduation Session 1 Roland Mollerus Secretary Committee for Development Policy

This is Tomorrow Just what is it that made yesterdays homes so different, so

Introduction why psychoanalytic? why a training in psychoanalytic psychotherapy?

A Rasch Analysis of Math Tests using M.In.E.R.Va. Platform Grazia Messineo - Salvatore Vassallo

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Su Surv rvey, ey, Sea Searc rch & Seizu h & Seizure re un under r Inc ncom ome