what is data or record linkage recent interest in data
play

What is data (or record) linkage? Recent interest in data linkage - PDF document

Data Linkage Techniques: What is data linkage? Past, Present and Future Applications and challenges Peter Christen The past A short history of data linkage Department of Computer Science, The Australian National University The present


  1. Data Linkage Techniques: What is data linkage? Past, Present and Future Applications and challenges Peter Christen The past A short history of data linkage Department of Computer Science, The Australian National University The present Contact: peter.christen@anu.edu.au Computer science based approaches: Learning to link Project Web site: http://datamining.anu.edu.au/linkage.html The future Funded by the Australian National University, the NSW Department of Health, the Australian Research Council (ARC) under Linkage Project 0453463, Scalability, automation, and privacy and confidentiality and the Australian Partnership for Advanced Computing (APAC) Our project: Febrl (Freely extensible biomedical record linkage) Peter Christen, August 2006 – p.1/32 Peter Christen, August 2006 – p.2/32 What is data (or record) linkage? Recent interest in data linkage The process of linking and aggregating records Traditionally, data linkage has been used in health (epidemiology) and statistics (census) from one or more data sources representing the same entity (patient, customer, business name, etc.) In recent years, increased interest from Also called data matching , data integration , data businesses and governments scrubbing , ETL (extraction, transformation and A lot of data is being collected by many organisations loading) , object identification , merge-purge , etc. Increased computing power and storage capacities Challenging if no unique entity identifiers available Data warehousing and distributed databases E.g., which of these records represent the same person? Data mining of large data collections Dr Smith, Peter 42 Miller Street 2602 O’Connor E-Commerce and Web applications (for example online Pete Smith 42 Miller St 2600 Canberra A.C.T. product comparisons: http://froogle.com ) P . Smithers 24 Mill Street 2600 Canberra ACT Geocoding and spatial data analysis Peter Christen, August 2006 – p.3/32 Peter Christen, August 2006 – p.4/32 Applications and usage Challenge 1: Dirty data Applications of data linkage Real world data is often dirty Remove duplicates in a data set (internal linkage) Missing values, inconsistencies Merge new records into a larger master data set Typographical errors and other variations Create patient or customer oriented statistics Different coding schemes / formats Compile data for longitudinal (over time) studies Out-of-date data Names and addresses are especially prone to Geocode matching (with reference address data) data entry errors (over phone, hand-written, scanned) Widespread use of data linkage Cleaned and standardised data is needed for Immigration, taxation, social security, census loading into databases and data warehouses Fraud, crime and terrorism intelligence data mining and other data analysis studies Business mailing lists, exchange of customer data data linkage and deduplication Social, health and biomedical research Peter Christen, August 2006 – p.5/32 Peter Christen, August 2006 – p.6/32 Challenge 3: Privacy and Challenge 2: Scalability confidentiality General public is worried about their information Data collections with tens or even hundreds of being linked and shared between organisations millions of records are not uncommon Good: research, health, statistics, crime and fraud Number of possible record pairs to compare detection (taxation, social security, etc.) equals the product of the sizes of the two data sets Scary: intelligence, surveillance, commercial data mining (linking two data sets with 1,000,000 records each will result in 10 6 × 10 6 = 10 12 record pairs) (not much information from businesses, no regulation) Performance bottleneck in a data linkage system is Bad: identify fraud, re-identification usually the (expensive) comparison of attribute Traditionally, identified data has to be given to the (field) values between record pairs person or organisation performing the linkage Blocking / indexing / filtering techniques are used Privacy of individuals in data sets is invaded to reduce the large amount of comparisons Consent of individuals involved is needed Linkage process should be automatic Alternatively, seek approval from ethics committees

  2. Computer assisted data linkage goes back as far What is data linkage? as the 1950s (based on ad-hoc heuristic methods) Applications and challenges Deterministic linkage The past Exact linkage, if a unique identifi er of high quality is A short history of data linkage available (has to be precise, robust, stable over time) Examples: Medicare , ABN or Tax fi le number The present (are they really unique, stable, trustworthy?) Computer science based approaches: Learning to link Rules based linkage (complex to build and maintain) The future Probabilistic linkage Scalability, automation, and privacy and confidentiality Apply linkage using available (personal) information Our project: Febrl (like names , addresses , dates of birth , etc) (Freely extensible biomedical record linkage) Peter Christen, August 2006 – p.9/32 Peter Christen, August 2006 – p.10/32 Probabilistic data linkage Fellegi and Sunter classification Basic ideas of probabilistic linkage were For each compared record pair a vector containing introduced by Newcombe & Kennedy (1962) matching weights is calculated Theoretical foundation by Fellegi & Sunter (1969) Record A: [‘dr’, ‘peter’, ‘paul’, ‘miller’] Record B: [‘mr’, ‘john’, ‘’, ‘miller’] No unique entity identifiers available Matching weights: [0.2, -3.2, 0.0, 2.4 ] Compare common record attributes (or fields) Fellegi & Sunter approach sums all weights Compute matching weights based on frequency ratios (then uses two thresholds to classify record pairs as (global or value specific) and error estimates non-matches , possible matches , or matches ) Sum of the matching weights is used to classify a pair of Lower Upper threshold threshold records as match , non-match , or possible match Many more with lower weights... Problems: Estimating errors and threshold values, assumption of independence, and manual clerical review Still the basis of many linkage systems −5 0 5 10 15 Total matching weight Peter Christen, August 2006 – p.11/32 Peter Christen, August 2006 – p.12/32 Traditional blocking Outline: The present Traditional blocking works by only comparing What is data linkage? record pairs that have the same value for a Applications and challenges blocking variable (for example, only compare records that have the same postcode value) The past Problems with traditional blocking A short history of data linkage An erroneous value in a blocking variable results in a The present record being inserted into the wrong block (several Computer science based approaches: Learning to link passes with different blocking variables can solve this) The future Values of blocking variable should be uniformly Scalability, automation, and privacy and confidentiality distributed (as the most frequent values determine Our project: Febrl the size of the largest blocks) (Freely extensible biomedical record linkage) Example: Frequency of ‘Smith’ in NSW: 25,425 Peter Christen, August 2006 – p.13/32 Peter Christen, August 2006 – p.14/32 Improved classification Classification challenges In many cases there is no training data available Summing of matching weights results in loss of information (e.g. two record pairs: same name but Possible to use results of earlier linkage projects? different address ⇔ different address but same name) Or from clerical review process? View record pair classification as a multi- How confident can we be about correct manual dimensional binary classification problem classification of possible links ? (use matching weight vectors to classify record pairs into Often there is no gold standard available matches or non-matches , but no possible matches ) (no data sets with true known linkage status) Different machine learning techniques can be used No test data set collection available Supervised: Manually prepared training data needed (like in information retrieval or data mining) (record pairs and their match status), almost like manual Recent small repository: RIDDLE clerical review before the linkage http://www.cs.utexas.edu/users/ml/riddle/ Un-supervised: Find (local) structure in the data (similar (Repository of Information on Duplicate Detection, Record Linkage, and Identity Uncertainty)

Recommend


More recommend