11/25/2010 Privacy ‐ Preserving Record Linkage Linkage Elizabeth Ashley Durham Health Information Privacy Lab Department of Biomedical Informatics Department of Biomedical Informatics Vanderbilt University Wednesday, 24 November, 2010 1 Record linkage Set of records from Vanderbilt Set of records from Emory First Last First First Last Last First Last Name Name Name Name Name Name Name Name jon smyth jon smyth john smith john smith taylor swift lucille ball taylor swift lucille ball william clinton bill clinton bill clinton william clinton hillary clinton jon bon jovi hillary clinton jon bon jovi 2 1
11/25/2010 Privacy ‐ preserving record linkage (PPRL) Set of records from Vanderbilt Set of records from Emory First First Last Last First Last First Last Name Name Name Name Name Name Name Name P jon smyth john smith jon smyth john smith O L lucille ball taylor taylor swift swift lucille ball I I C bill clinton bill clinton william william clinton clinton Y hillary clinton hillary clinton jon bon jovi jon bon jovi 3 PPRL applications in healthcare sharing patient data for research 4 2
11/25/2010 The NIH requires researchers share de ‐ identified patient data • U.S. National Institutes of Health (NIH) data sharing policy • “Data should be made as widely & freely available as possible” • Researchers who receive $500,000 must develop a data sharing plan or describe why data sharing is not possible • Derived data must be shared in a manner that is devoid of “identifiable information” • NIH supported genome ‐ wide association studies policy NIH supported genome wide association studies policy • Researchers funded for genome ‐ wide association studies must share data 5 Duplicates: a flaw in the current model for sharing de ‐ identified data NIH flu flu fatal fatal E1:flu,fatal V1:flu,fatal flu flu fatal fatal E2:flu,surv V2:flu,fatal flu flu fatal fatal flu flu surv surv Vanderbilt Emory Diag ‐ ID First Last Out ‐ ID First Last Diag ‐ Out ‐ nosis nosis Name Name come Name Name come john V1 john smith flu fatal E1 jon smyth flu fatal E2 taylor swift flu surv V2 lucille ball flu fatal 6 3
11/25/2010 Fragmented data: a flaw in the current model for sharing de ‐ identified data NIH flu ?? flu ?? E1:??,fatal V1:flu,?? ?? ?? fatal fatal E2:flu,surv V2:flu,fatal flu flu fatal fatal flu flu surv surv Vanderbilt Emory ID First Last Diag ‐ Out ‐ ID First Last Diag ‐ Out ‐ nosis Name Name come nosis Name Name come john E1 jon smyth ?? fatal V1 john smith flu ?? E2 taylor swift flu surv V2 lucille ball flu fatal 7 PPRL can improve the model for sharing de ‐ identified data and enable more effective medical research V1 ‐ E1 NIH flu fatal V1:H(john),H(smith) V1:flu,?? E1:H(jon),H(smyth) E1:??,fatal flu fatal V2:flu,fatal V2:H(lucille),H(ball) E2:flu,surv E2:H(taylor),H(swift) flu surv Vanderbilt Emory ID First Last Diag ‐ Out ‐ ID First Last Diag ‐ Out ‐ nosis Name Name come nosis Name Name come E1 jon smyth ?? fatal V1 john smith flu ?? E2 taylor swift flu surv V2 lucille ball flu fatal 8 where H denotes a hash function 4
11/25/2010 PPRL applications in healthcare improving patient care john john john …. …. …. …. john …. …. 9 Other PPRL applications • Business • Counter ‐ terrorism efforts 10 5
11/25/2010 Roadmap • Definition • Motivation • Motivation • Record linkage • Privacy ‐ preserving record linkage – Background – Experimental design – Experimental results – Discussion – Open questions in record linkage – Conclusion 11 Roadmap • Definition • Motivation • Motivation • Record linkage • Privacy ‐ preserving record linkage – Background – Experimental design – Experimental results – Discussion – Open questions in record linkage – Conclusion 12 6
11/25/2010 Steps in record linkage Matches Record pair Record pair Field Blocking comparison comparison classification Non ‐ matches A few assumptions… 1) common schema 2) common method of data standardization 3) records from an institution have been deduplicated ( i.e. , record linkage has been applied within each institution such that an individual is represented by only a single record within an institution) 13 Steps in record linkage Matches Field Record pair Record pair Blocking comparison classification comparison Non ‐ matches 14 7
11/25/2010 Blocking: sample dataset Set of records from Vanderbilt Set of records from Vanderbilt Set of records from Emory Set of records from Emory First Last First Last Name Name Name Name jon smyth john smith lucille ball taylor swift bill clinton william clinton hillary clinton jon bon jovi 15 = match Blocking = non ‐ match blocking no blocking ( (first letter of last name) ) john s mith john smith lucille ball lucille b all bill clinton bill c linton hillary clinton hillary c linton 5 record pair comparisons |Vanderbilt||Emory| = 16 record pair comparisons 16 8
11/25/2010 Blocking: another perspective 2 1 S B First Last First Last First First Last Last First First Last Last Name Name Name Name Name Name Name Name john s mith jon s myth jon b on lucille b all taylor s wift jovi 2 C C First Last First Last Name Name Name Name wiliam c linton bill c linton hillary c linton 17 Steps in record linkage Matches Field Record pair Record pair Blocking comparison classification comparison Non ‐ matches 18 9
11/25/2010 The field comparison step of record linkage First Name Last Name Fields: Fields: Record V1: john smith jon smyth Record E1: Similarity Function Si il it F ti 0.75 0.8 Field comparison vector: 19 Steps in record linkage Matches Field Record pair Record pair Blocking comparison classification comparison Non ‐ matches 20 10
11/25/2010 The record pair comparison step of record linkage First Name Last Name Fields: Record V1: john smith jon smyth Record E1: Similarity Function Field comparison vector: 0.75 0.8 0.79 Record pair similarity “score”: 21 Steps in record linkage Matches Field Record pair Record pair Blocking comparison classification comparison Non ‐ matches 22 11
11/25/2010 The record pair classification step of record linkage Record pair p Vanderbilt Emory similarity Record pair records records “score” classification +7 Match john smith jon smyth Non ‐ match john smith taylor swift +3 jon smyth Non ‐ match lucille ball +0 lucille ball taylor swift Non ‐ match +0 23 Roadmap • Definition • Motivation • Record linkage • Privacy ‐ preserving record linkage – Background – Experimental design – Experimental results – Discussion – Open questions in record linkage – Conclusion 24 12
11/25/2010 How do we do all of this in a privacy ‐ preserving manner? Matches Record pair Field Record pair Blocking comparison comparison classification Non ‐ matches 25 Roadmap • Definition • Motivation • Record linkage • Privacy ‐ preserving record linkage – Background – Experimental design – Experimental results – Discussion – Open questions in record linkage – Conclusion 26 13
11/25/2010 Background Matches Record pair Field Record pair Blocking comparison comparison classification Non ‐ matches binary Fellegi ‐ Sunter approximate Winkler FS 27 Background Matches Record pair Record pair Field Blocking comparison comparison classification Non ‐ matches binary Fellegi ‐ Sunter approximate Winkler FS 28 14
11/25/2010 Binary Field Comparison First Last City Fields: Name Name Record V1: S john smith nashville xy9l br3f xt0uv H A jon smyth nashville nw2 vwer xt0uv Record E1: equal? 0 0 1 Field Comparison Vector: where SHA refers to the Secure Hash Algorithm 29 Background Matches Record pair Record pair Field Blocking comparison comparison classification Non ‐ matches binary Fellegi ‐ Sunter approximate Winkler FS 30 15
11/25/2010 Approximate Field Comparison record V1 record E1 jon john john _j jo on n_ _j jo oh hn n_ h 2 h 1 α : β : 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 | | 2 * 5 ( , ) 2 0 . 77 Dice coefficien t | | | | 13 Schnell 2009 31 Background Matches Record pair Record pair Field Blocking comparison comparison classification Non ‐ matches binary Fellegi ‐ Sunter approximate Winkler FS 32 16
11/25/2010 Fellegi ‐ Sunter (FS) Conditional probability vectors: Fields: Fields: First Name Last Name 0.8 0.9 Match: Non ‐ match: 0.05 0.02 0.9 log 0.02 Weight vectors: Fields: Fields: Fi First Name N L Last Name N Agreement: 1.2 1.95 ‐ 0.68 ‐ 1 Disagreement: 1 ‐ 0.9 log 1 ‐ 0.02 33 Fellegi 1969 Fellegi ‐ Sunter (FS) Fields: First Name Last Name Agreement weights: 1.2 1.95 ‐ 0.68 ‐ 1 Disagreement weights: Fields: First Name Last Name john smith Record V1: jon jon smyth smyth Record E1: Record E1: 0 0 Field comparison vector: Weight vector: ‐ 0.68 ‐ 1 Σ Σ Record pair similarity score: ‐ 1.68 34 Fellegi 1969 17
Recommend
More recommend