privacy preserving probabilistic record linkage
play

Privacy Preserving Probabilistic Record Linkage Duncan Smith - PowerPoint PPT Presentation

Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of Manchester The research leading to these


  1. Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of Manchester The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n ° 262608 (DwB - Data without Boundaries). 1 1

  2. Topics Covered Introduction • Probabilistic Record Linkage • String Anonymisation • Putting the probabilities back into Privacy – • Preserving Record Linkage Experiment • Discussion • 2 2

  3. Introduction Probabilistic record linkage developed by Fellegi and • Sunter, 1969 Administrative sources are being used to improve the • quality of surveys or to replace traditional censuses Traditionally, all datasets in one location (NSI) and • matching variables (first name, last name, address) used to link data without the need for anonymisation Data on individuals may be in distinct databases and • may be owned by different custodians: Alice (A) and Bob (B) Privacy restrictions prevent the release of certain • variables or information is suppressed/coarsened that uniquely identifies an individual 3 3

  4. Introduction CS Literature, techniques for anonymising identifying • variables Third party (Carole) only sees matching variables and • returns pairs of unique record IDs (assigned by Alice and Bob) Two possible scenarios (there are more…): • Trusted Carole – sees the true values of single • matching variable Non-trusted Carole – sees anonymised values of • single matching variable Privacy preserving record linkage (PPRL) allows exact • matching and can allow linkage based on similarity scores generated from anonymised values F&S probabilistic record linkage typically not used in • PPRL 4 4

  5. Introduction Alice and Bob clean, harmonize and standardize • data and anonymise matching variables (using the same method and seed) In our new approach, we apply probabilistic record • linkage to anonymised values to obtain a probability of a correct match (PPPRL) Motivation: • Data can be held within an archive, users can • carry out PPPRL within a ‘black box’ for dynamic database integration Three party Alice, Bob, Carole scenario as set out • in UK Beyond 2011 project where Carole has access to original values and can calculate string comparators In PPRL, no possibility of clerical review and links • classified into 2 classes: true matches and false matches 5 5

  6. Probabilistic Record Linkage F&S probabilistic record linkage uses a Binomial EM • algorithm based on an agree/disagree indicator to estimate likelihood ratio    i  { i 1 .. p }) , Matching score based on the sum of the log of the •   m  likelihood ratio: where is the m ( ) / u ( ) ( ) u  probability of agree given it’s a match and ( ) probability of agree given its not a match String comparators, eg. Jaro-Winkler, are used to • adjust the matching score based on partial agreements, eg. typing errors, etc. 6 6

  7. String Anonymisation String anonymisation can use hash functions on bigrams: • 'john' → {'jo', 'oh', 'hn'} → {21299418, 21496024, 20971735} 'jon' → {'jo', 'on'} → {21299418, 21889246} Minwise hashing (Broder 1997) generates a random • permutation of a set of elements and returns the hash for the first ordered element The probability of a hash collision on the first ordered • element is the Jaccard similarity score:  | A B |  J A , B  | A B | Estimate of Jaccard similarity score based on many hash • values where the number of collisions is distributed: (m number of hash functions) n ~ Bin ( m , J ) A , B n And estimated by ˆ •  J A , B m 7 7

  8. String Anonymisation Proposed method: concatenated 1-bit minwise hashing • Estimation of the Jaccard similarity score is: • n ˆ   J 2 1 A , B m Example: Minwise hashes and 1-bit minwise hashes under a binary representation for S1 ={’jo’,’oh’,’hn’} and S 2 = {’jo’,’on’} H1 H2 H3 H4 H5 … Hm S1 451153726 1123790273 2501120381 2030682762 965995804 S2 797504823 1123790273 262296169 1744666338 965995804 … Sn H1 H2 H3 H4 H5 … Hm S1 0 1 1 0 0 S2 1 1 1 0 0 … Sn With 5 hash functions, estimate of the Jaccard similarity • score is 2/5 for minwise hashes and 3/5 for 1-bit hashes; true value is 1/4 8 8

  9. String Anonymisation Simulation Study: File A 300 names, File B obtained by • perturbing File A to simulate typographical errors Tokenized bigrams with leading and trailing underscores • True Jaccard scores compared with estimated scores on all pairs • in A x B Bias in Bloom filter • approaches Smaller variance in • minwise hash compared to concatenated 1-bit hash but requires more storage Concatenated hash • approximately same MSE as Bloom filter Precision can be • controlled by choice of m – the number of hash functions 9 9

  10. Privacy Preserving Probabilistic Record Linkage Extend Binomial EM Algorithm to K categories, k=1,…,K • where each category is a grouping of similarity scores (Jaro for original values; Jaccard for anonymised values) i.e. 8 categories with (inclusive) upper bounds: [0.2,0.4,0.6,0.8,0.9,0.95,0.999,1] Element in agreement vector for variable q of pair j •  q  j with similarity score in category k , , otherwise 0 1 , k Multinomial EM algorithm to estimate matching • ˆ parameters: , and ˆ ˆ p m u q , k q , k Blocking: In PPRL literature methods include: canopy • clustering (McCallum et al., 2000) which divides the pairs into overlapping subsets before classification; multibit tree structures to identify similar comparison vectors under the Bloom filter framework (Bachteler et al.,2013 ), and more... 10 10

  11. Experiment 1000 records from a Census database with attached • English names (File A) File B generated by perturbing File A under a • probabilistic approach including swapping, deleting and transposing characters on variables: Gender, Year of Birth, Month of Birth and First Name 4 Perturbed datasets perturbed at different levels of • perturbation A random sample of 700 records from File A and a • random sample of 400 records from perturbed files used for matching No blocking was carried out • 11 11

  12. Experiment PPPRL: Binary EM: standard EM approach based on exact matching of • strings. No similarity score used LR weighted: outputs of Binary EM and downweight likelihood • ratios Log LR weighted: outputs of Binary EM and downweight log • likelihood ratios EM (8): multinomial EM approach with 9 bins having upper • bounds [0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.999, 1]. Jaccard similarity score (with padded underscores on bigrams) EM (15): multinomial EM approach with 15 bins having upper • bounds [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.925, 0.95, 0.975, 0.999, 1]. As above PRL: Jaro: multinomial EM approach with 8 bins using Jaro string • comparator Jaro-Winkler: As above but with Jaro-Winkler string comparator • 12 12

  13. Experiment Correct links identified and used to construct precision- • recall plots Plots show for any given threshold the precision and • recall based on false positives , true positives, false negatives , true negatives, and can be used to compare approaches Good approaches will produce curves in the upper right • of the plot tp  Pr ecision  tp fp tp  Re call  tp fn 13 13

  14. Experiment low perturbation high perturbation All approaches perform better with low level of perturbation • Binary EM without similarity scores performs the worst • Down weighting log likelihood ratios outperforms down weighting • of likelihood ratios Multinomial EM outperforms Binary EM with no clear difference • between 8 category and 15 category Jaccard score schemes Jaro schemes provide the best performance, although these are • 14 14 not privacy preserving

  15. Discussion • PPPRL does not allow clerical review and one threshold is determined based on posterior probability of a correct link • PPPRL can be tailored to different types of variables via the choice/design of the tokenization scheme • So far dealt with 1 to 1 matching • Multinomial EM offers improved classification over the unweighted and weighted binary EM schemes • Under trusted Carole, the Jaro and Jaro-Winkler schemes outperformed the padded bigram tokenization scheme under PPPRL 15 15

  16. Thank you for your attention 16 16

Recommend


More recommend