What is data (or record) linkage? Recent interest in data linkage - PDF document

Data Linkage Techniques: What is data linkage? Past, Present and Future Applications and challenges Peter Christen The past A short history of data linkage Department of Computer Science, The Australian National University The present Contact: peter.christen@anu.edu.au Computer science based approaches: Learning to link Project Web site: http://datamining.anu.edu.au/linkage.html The future Funded by the Australian National University, the NSW Department of Health, the Australian Research Council (ARC) under Linkage Project 0453463, Scalability, automation, and privacy and confidentiality and the Australian Partnership for Advanced Computing (APAC) Our project: Febrl (Freely extensible biomedical record linkage) Peter Christen, August 2006 – p.1/32 Peter Christen, August 2006 – p.2/32 What is data (or record) linkage? Recent interest in data linkage The process of linking and aggregating records Traditionally, data linkage has been used in health (epidemiology) and statistics (census) from one or more data sources representing the same entity (patient, customer, business name, etc.) In recent years, increased interest from Also called data matching , data integration , data businesses and governments scrubbing , ETL (extraction, transformation and A lot of data is being collected by many organisations loading) , object identification , merge-purge , etc. Increased computing power and storage capacities Challenging if no unique entity identifiers available Data warehousing and distributed databases E.g., which of these records represent the same person? Data mining of large data collections Dr Smith, Peter 42 Miller Street 2602 O’Connor E-Commerce and Web applications (for example online Pete Smith 42 Miller St 2600 Canberra A.C.T. product comparisons: http://froogle.com ) P . Smithers 24 Mill Street 2600 Canberra ACT Geocoding and spatial data analysis Peter Christen, August 2006 – p.3/32 Peter Christen, August 2006 – p.4/32 Applications and usage Challenge 1: Dirty data Applications of data linkage Real world data is often dirty Remove duplicates in a data set (internal linkage) Missing values, inconsistencies Merge new records into a larger master data set Typographical errors and other variations Create patient or customer oriented statistics Different coding schemes / formats Compile data for longitudinal (over time) studies Out-of-date data Names and addresses are especially prone to Geocode matching (with reference address data) data entry errors (over phone, hand-written, scanned) Widespread use of data linkage Cleaned and standardised data is needed for Immigration, taxation, social security, census loading into databases and data warehouses Fraud, crime and terrorism intelligence data mining and other data analysis studies Business mailing lists, exchange of customer data data linkage and deduplication Social, health and biomedical research Peter Christen, August 2006 – p.5/32 Peter Christen, August 2006 – p.6/32 Challenge 3: Privacy and Challenge 2: Scalability confidentiality General public is worried about their information Data collections with tens or even hundreds of being linked and shared between organisations millions of records are not uncommon Good: research, health, statistics, crime and fraud Number of possible record pairs to compare detection (taxation, social security, etc.) equals the product of the sizes of the two data sets Scary: intelligence, surveillance, commercial data mining (linking two data sets with 1,000,000 records each will result in 10 6 × 10 6 = 10 12 record pairs) (not much information from businesses, no regulation) Performance bottleneck in a data linkage system is Bad: identify fraud, re-identification usually the (expensive) comparison of attribute Traditionally, identified data has to be given to the (field) values between record pairs person or organisation performing the linkage Blocking / indexing / filtering techniques are used Privacy of individuals in data sets is invaded to reduce the large amount of comparisons Consent of individuals involved is needed Linkage process should be automatic Alternatively, seek approval from ethics committees

Computer assisted data linkage goes back as far What is data linkage? as the 1950s (based on ad-hoc heuristic methods) Applications and challenges Deterministic linkage The past Exact linkage, if a unique identifi er of high quality is A short history of data linkage available (has to be precise, robust, stable over time) Examples: Medicare , ABN or Tax fi le number The present (are they really unique, stable, trustworthy?) Computer science based approaches: Learning to link Rules based linkage (complex to build and maintain) The future Probabilistic linkage Scalability, automation, and privacy and confidentiality Apply linkage using available (personal) information Our project: Febrl (like names , addresses , dates of birth , etc) (Freely extensible biomedical record linkage) Peter Christen, August 2006 – p.9/32 Peter Christen, August 2006 – p.10/32 Probabilistic data linkage Fellegi and Sunter classification Basic ideas of probabilistic linkage were For each compared record pair a vector containing introduced by Newcombe & Kennedy (1962) matching weights is calculated Theoretical foundation by Fellegi & Sunter (1969) Record A: [‘dr’, ‘peter’, ‘paul’, ‘miller’] Record B: [‘mr’, ‘john’, ‘’, ‘miller’] No unique entity identifiers available Matching weights: [0.2, -3.2, 0.0, 2.4 ] Compare common record attributes (or fields) Fellegi & Sunter approach sums all weights Compute matching weights based on frequency ratios (then uses two thresholds to classify record pairs as (global or value specific) and error estimates non-matches , possible matches , or matches ) Sum of the matching weights is used to classify a pair of Lower Upper threshold threshold records as match , non-match , or possible match Many more with lower weights... Problems: Estimating errors and threshold values, assumption of independence, and manual clerical review Still the basis of many linkage systems −5 0 5 10 15 Total matching weight Peter Christen, August 2006 – p.11/32 Peter Christen, August 2006 – p.12/32 Traditional blocking Outline: The present Traditional blocking works by only comparing What is data linkage? record pairs that have the same value for a Applications and challenges blocking variable (for example, only compare records that have the same postcode value) The past Problems with traditional blocking A short history of data linkage An erroneous value in a blocking variable results in a The present record being inserted into the wrong block (several Computer science based approaches: Learning to link passes with different blocking variables can solve this) The future Values of blocking variable should be uniformly Scalability, automation, and privacy and confidentiality distributed (as the most frequent values determine Our project: Febrl the size of the largest blocks) (Freely extensible biomedical record linkage) Example: Frequency of ‘Smith’ in NSW: 25,425 Peter Christen, August 2006 – p.13/32 Peter Christen, August 2006 – p.14/32 Improved classification Classification challenges In many cases there is no training data available Summing of matching weights results in loss of information (e.g. two record pairs: same name but Possible to use results of earlier linkage projects? different address ⇔ different address but same name) Or from clerical review process? View record pair classification as a multi- How confident can we be about correct manual dimensional binary classification problem classification of possible links ? (use matching weight vectors to classify record pairs into Often there is no gold standard available matches or non-matches , but no possible matches ) (no data sets with true known linkage status) Different machine learning techniques can be used No test data set collection available Supervised: Manually prepared training data needed (like in information retrieval or data mining) (record pairs and their match status), almost like manual Recent small repository: RIDDLE clerical review before the linkage http://www.cs.utexas.edu/users/ml/riddle/ Un-supervised: Find (local) structure in the data (similar (Repository of Information on Duplicate Detection, Record Linkage, and Identity Uncertainty)

What is data (or record) linkage? Recent interest in data linkage - PDF document

Data Linkage Techniques: What is data linkage? Past, Present and Future Applications and challenges Peter Christen The past A short history of data linkage Department of Computer Science, The Australian National University The present

Record Linkage Record Linkage Craig Knoblock University of Southern California These slides are

Using Structured Neural Networks for Record Linkage Burdette Pixton Christophe Giraud-Carrier

Genealogical Record Linkage: Features for Automated Person Matching Randy Wilson

Modeling Offsets and Linkage in a Modeling Offsets and Linkage in a Modeling Offsets and Linkage

Linkage Disequilibrium Linkage Disequilibrium Linkage Equilibrium Consider two linked loci Locus

Building the Linkage Tree (LT) in LTGA 1. Start with singleton linkage sets Thierens, D. (2010).

Privacy Preserving Record Linkage Linkage Elizabeth Ashley Durham Health Information Privacy

Record Type Families: Record type A Key to Generic Record Combinators families Record scheme

Linkage graphs and what they look like Stephen Kell Stephen.Kell@cl.cam.ac.uk Linkage graphs. .

January 2017 Data Linkage: An Overview Natalie Shlomo University of Manchester 1

Nam e Standardization Nam e Standardization for Genealogical for Genealogical Record Linkage

Probabilistic Record Linkage in Genealogical Research John Lawson, Dave White, Brenda Price and

A Semantic - based K - anonymity Scheme for Health Record Linkage Yang LU 1 , Richard O. SINNOTT

Transformations for Record Linkage Matthew Michelson & Craig A. Knoblock Fetch

Data linkage in Victoria 7 August 2017 Sharon Williams, Manager, Centre for Victorian Data

Performing linkage analysis using MERLIN David Duffy Queensland Institute of Medical Research

Geometric constraints for shape and topology optimization in architectural design Charles

Algorithm Engineering (aka. How to Write Fast Code) CS260 Lecture 1 Yan Gu Introduction to

Commercial Implementations of Optimization Software and its Application to Fluid Dynamics Problems

Optimizing Compilers Source Optimization (ideal case) Performance Front End Introduction

Web 2.0-mashups Modules of Virtual Organizations Hong Chun Oliver Bohl ISNM 2006 What is

Massive Data Analysis: What is under the hood? S. (Muthu) Muthukrishnan Google mysliceofpizza

Map-Reduce John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity

Wiki meets Semantic Web @WikiSym2006 WibKE: Odense Wiki-based Knowledge Engineering Second I

What is data (or record) linkage? Recent interest in data linkage - PDF document

Data Linkage Techniques: What is data linkage? Past, Present and Future Applications and challenges Peter Christen The past A short history of data linkage Department of Computer Science, The Australian National University The present

Record Linkage Record Linkage Craig Knoblock University of Southern California These slides are

Using Structured Neural Networks for Record Linkage Burdette Pixton Christophe Giraud-Carrier

Genealogical Record Linkage: Features for Automated Person Matching Randy Wilson

Modeling Offsets and Linkage in a Modeling Offsets and Linkage in a Modeling Offsets and Linkage

Linkage Disequilibrium Linkage Disequilibrium Linkage Equilibrium Consider two linked loci Locus

Building the Linkage Tree (LT) in LTGA 1. Start with singleton linkage sets Thierens, D. (2010).

Privacy Preserving Record Linkage Linkage Elizabeth Ashley Durham Health Information Privacy

Record Type Families: Record type A Key to Generic Record Combinators families Record scheme

Linkage graphs and what they look like Stephen Kell Stephen.Kell@cl.cam.ac.uk Linkage graphs. .

January 2017 Data Linkage: An Overview Natalie Shlomo University of Manchester 1

Nam e Standardization Nam e Standardization for Genealogical for Genealogical Record Linkage

Probabilistic Record Linkage in Genealogical Research John Lawson, Dave White, Brenda Price and

A Semantic - based K - anonymity Scheme for Health Record Linkage Yang LU 1 , Richard O. SINNOTT

Transformations for Record Linkage Matthew Michelson &amp; Craig A. Knoblock Fetch

Data linkage in Victoria 7 August 2017 Sharon Williams, Manager, Centre for Victorian Data

Performing linkage analysis using MERLIN David Duffy Queensland Institute of Medical Research

Geometric constraints for shape and topology optimization in architectural design Charles

Algorithm Engineering (aka. How to Write Fast Code) CS260 Lecture 1 Yan Gu Introduction to

Commercial Implementations of Optimization Software and its Application to Fluid Dynamics Problems

Optimizing Compilers Source Optimization (ideal case) Performance Front End Introduction

Web 2.0-mashups Modules of Virtual Organizations Hong Chun Oliver Bohl ISNM 2006 What is

Massive Data Analysis: What is under the hood? S. (Muthu) Muthukrishnan Google mysliceofpizza

Map-Reduce John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity

Wiki meets Semantic Web @WikiSym2006 WibKE: Odense Wiki-based Knowledge Engineering Second I

Transformations for Record Linkage Matthew Michelson & Craig A. Knoblock Fetch