Data Matching – Overview of Computer Science Methods and Research at the ANU Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University Contact: peter.christen@anu.edu.au March 2013 – p.1/25
Outline Recent interest in data matching Data matching applications and challenges The data matching process Types of data matching techniques Improving scalability: Indexing techniques Improving matching quality: Learning techniques Privacy-preserving record linkage Research at the ANU: Febrl , privacy, matching historical census data, and real-time matching Challenges and research directions March 2013 – p.2/25
Recent interest in data matching Traditionally, data matching has been used in statistics (census) and health (epidemiology) In recent years, increased interest from businesses and governments Massive amounts of data are being collected Increased computing power and storage capacities Often data from different sources need to be integrated Need for data sharing between organisations Data mining (analysis) of large data collections E-Commerce and Web applications Geocode matching and spatial data analysis March 2013 – p.3/25
Applications of data matching Remove duplicates in one data set (deduplication) Merge new records into a larger master data set Create patient or customer oriented statistics (for example for longitudinal studies) Clean and enrich data for analysis and mining Geocode matching (with reference address data) Widespread use of data matching Immigration, taxation, social security, census Fraud, crime, and terrorism intelligence Business mailing lists, exchange of customer data Biomedical and social science research March 2013 – p.4/25
Data matching challenges Often no unique entity identifiers are available Real world data are dirty (typographical errors and variations, missing and out-of-date values, different coding schemes, etc.) Scalability Naïve comparison of all record pairs is quadratic Blocking, searching, or filtering is needed (indexing) No training data in many data matching applications (no record pairs or groups with known true match status) Privacy and confidentiality (because personal information, like names and addresses, are commonly required for matching) March 2013 – p.5/25
The data matching process Database A Database B Data pre− Data pre− processing processing Indexing / Searching Matches Classif− Non− Comparison Evaluation ication matches Potential Clerical Matches Review March 2013 – p.6/25
Types of data matching techniques Deterministic matching Exact matching (if a unique identifier of high quality is available: precise, robust, stable over time) Examples: Social security or Medicare numbers Rule-based matching (complex to build and maintain) Probabilistic record linkage ( Fellegi and Sunter , 69) Use available attributes for matching (often personal information, like names, addresses, dates of birth, etc.) Calculate matching weights for attributes ‘Computer science’ approaches (based on machine learning, data mining, database, or information retrieval techniques) March 2013 – p.7/25
Improving scalability: Indexing Number of record pair comparisons equals the product of the sizes of the two databases (matching two databases containing 1 and 5 million records will result in 5 × 10 12 – 5 trillion – record pairs) Number of true matches is generally less than the number of records in the smaller of the two databases (assuming no duplicate records) Performance bottleneck is usually the (expensive) detailed comparison of attribute values between records (using approximate string comparison functions) Aim of indexing: Cheaply remove record pairs that are obviously not matches March 2013 – p.8/25
Traditional blocking Traditional blocking works by only comparing record pairs that have the same value for a blocking variable (for example, only compare records that have the same postcode value) Problems with traditional blocking An erroneous value in a blocking variable results in a record being inserted into the wrong block (several passes with different blocking variables can solve this) Values of blocking variable should have uniform frequencies (as the most frequent values determine the size of the largest blocks) Example: Frequency of ‘Smith’ in NSW: 25,425 March 2013 – p.9/25
Recent indexing approaches (1) Sorted neighbourhood approach Sliding window over sorted databases Use several passes with different blocking variables Q -gram based blocking (e.g. 2-grams / bigrams ) Convert values into q -gram lists, then generate sub-lists ‘peter’ → [‘pe’,‘et’,‘te’,‘er’], [‘pe’,‘et’,‘te’] , [‘pe’,‘et’,‘er’], .. ‘pete’ → [‘pe’,‘et’,‘te’] , [‘pe’,‘et’], [‘pe’,‘te’], [‘et’,‘te’], ... Each record will be inserted into several blocks Overlapping canopy clustering Based on q -grams and a ‘cheap’ similarity measure, such as Jaccard (set intersection) or TF-IDF/Cosine March 2013 – p.10/25
Recent indexing approaches (2) StringMap based blocking Map strings into a multi-dimensional space such that distances between pairs of strings are preserved Use similarity join to find similar pairs (close strings) Suffix array based blocking Generate suffix array based inverted index (suffix array: ‘peter’ → ‘eter’, ‘ter’, ‘er’, ‘r’ ) Post-blocking filtering (for example, string length or q -grams count differences) US Census Bureau: BigMatch (pre-process ‘smaller’ data set so its values can be directly accessed; with all blocking passes in one go) March 2013 – p.11/25
Improving matching quality: Learning techniques View record pair classification as a multi- dimensional binary classification problem (use numerical attribute similarities to classify record pairs as matches or non-matches ) Many machine learning techniques can be used Supervised: Decision trees, SVMs, neural networks, learnable string comparisons, active learning, etc. Un-supervised: Various clustering algorithms Recently, collective classification techniques have been investigated (build graph of database and conduct overall classification, rather than each record pair independently) March 2013 – p.12/25
Collective classification example w1=? w3=? Dave White Paper 2 Paper 6 ? ? Susan Grey Paper 3 Intel Liz Pink w4=? w2=? MIT Don White Paper 5 John Black Paper 1 CMU Paper 4 Joe Brown (A1, Dave White, Intel) (P1, John Black / Don White) (A2, Don White, CMU) (P2, Sue Grey / D. White ) (A3, Susan Grey, MIT) (P3, Dave White) (A4, John Black, MIT) (P4, Don White / Joe Brown) (A5, Joe Brown, unknown) (P5, Joe Brown / Liz Pink) (A6, Liz Pink, unknown) (P6, Liz Pink / D. White ) Adapted from Kalashnikov and Mehrotra, ACM TODS, 31(2), 2006 March 2013 – p.13/25
Managing transitive closure a4 a1 a3 a2 If record a1 is classified as matching with record a2 , and record a2 as matching with record a3 , then records a1 and a3 must also be matching. Possibility of record chains occurring Various algorithms have been developed to find optimal solutions (special clustering algorithms) Collective classification deals with this problem by default March 2013 – p.14/25
Classification challenges In many cases there is no training data available Possible to use results of earlier data matching projects? Or from manual clerical review process? How confident can we be about correct manual classification of potential matches? Often there is no gold standard available (no data sets with known true match status) No large test data set collections available (like in information retrieval or machine learning) Many data matching researchers use synthetic or bibliographic data (which have very different characteristics) March 2013 – p.15/25
Privacy-preserving record linkage (1) (1) Alice Bob (2) Alice Bob (2) (2) (2) Carol (3) (3) (3) (3) Assume two data sources, and possibly a third (trusted) party to conduct the matching Objective: No party learns about the other parties’ private data, only matched records are revealed Various approaches with different assumptions about threats, what can be inferred by parties, and what is being released Based on some form of encoding or encryption techniques March 2013 – p.16/25
Research at ANU 1: Collaboration with NSW Health From 2002 to 2009, funded by ANU, APAC, and an ARC Linkage Project Developed open source software Febrl (Freely extensible biomedical record linkage) Several research areas Probabilistic techniques for automated data cleaning and standardisation (mainly of addresses) Novel geocode matching techniques New and improved blocking and indexing techniques Improved record pair classification using un-supervised machine learning techniques Improved performance (scalability and parallelism) March 2013 – p.17/25
Research at ANU 2: Privacy-preserving record linkage Currently 1 PhD student, with an ARC Discovery Project starting this year (collaboration with Vassilios Verykios, Greece) Work so far has focused on scalability to large databases, and two-party protocols Protocols based on Bloom filters (bit strings for calculating Dice/Jaccard similarities) We developed a taxonomy for PPRL techniques Current work is on privacy measures for PPRL Future work to focus on matching data from multiple parties, and assessing matching quality and completeness in PPRL March 2013 – p.18/25
Recommend
More recommend