quantitative methods to measure the risk of re
play

Quantitative Methods to Measure the Risk of Re-identification: - PowerPoint PPT Presentation

Quantitative Methods to Measure the Risk of Re-identification: Methodology Review Bradley Malin, Ph.D. November 30, 2017 Vanderbilt University, Professor (Biomedical Informatics, Biostatistics, & Computer Science) A Very Simplified View on


  1. Quantitative Methods to Measure the Risk of Re-identification: Methodology Review Bradley Malin, Ph.D. November 30, 2017 Vanderbilt University, Professor (Biomedical Informatics, Biostatistics, & Computer Science)

  2. A Very Simplified View on Risk P( reid ) ≈ • Distinguishability • Replicability • Availability Data 2

  3. Risk is Contextual Bill Mike Brad Sample 3

  4. Risk is Contextual Population (Multiple clinical Mitch trials) Bill Will Mike Brad Sample Abe 4

  5. Risk is Contextual Population (All Eligible Mitch People) Bill Will Mike Brad Sample Abe 5

  6. Risk Measures ● Worst Case Risk Measures ● Prosecutor: Most risky record in the sample ● Journalist: Most risky record in the population ● Amortized (or Average) Measure ● Marketer: Expected risk of an arbitrary record 6

  7. A Very Simplified View on Risk Prior • Neighbor (friend) P( reid ) ≈ • Prosecutor • Journalist P( knowledge ) * • … pick a framework P( reid | knowledge ) • Distinguishability • Replicability • Availability Data 7

  8. Prosecutor Risk 1 (Because of Mike) Bill Mike Brad Sample 8

  9. Journalist Risk ½ = 0.5 (Because of Bill & Brad) Population Mitch Bill Will Mike Brad Sample Abe 9

  10. ½ + ½ + 1 Marketer Risk (Sample) ------------- = 2/3 3 ½ Bill ½ Brad 1 Mike Sample Sample 10

  11. .5 + .5 + .33 Marketer Risk (Population) ------------- = .44 3 3 1/2 Bill 1/2 Brad Mike 1/3 Will Abe Population Sample 11

  12. Central Dogma of Re-identification Anonymised Identified Linking Mechanism Data Data Necessary Necessary Necessary Condition Condition Condition Malin, Benitez, Loukides, & Clayton. Human Genetics. 2011. 12

  13. A Famous Linkage Attack High Profile Name Ethnicity Re-identification Address Visit date Date registered ZIP Code Diagnosis Party affiliation Birthdate Procedure Date last voted Gender Medication Total charge U.S. Hospital U.S. Voter List Discharge Data Sweeney, JLME. 1997 13

  14. But availability of Demographics Varies… IL MN TN WA WI WHO Registered Political MN Voters Anyone Anyone Anyone Committees (ANYONE – In Person) Format Disk Disk Disk Disk Disk Cost $500 $46; “use ONLY for $2500 $30 $12,500 elections, political activities, or law enforcement”      Name      Address     Date of Birth    Sex  Race   Phone Number Benitez & Malin. JAMIA. 2010. 14

  15. Adversaries are Not All Knowing ● Research subject demographics ● Unique (potential) ● Replicable ● Available ● Series of drug doses administered ● Unique (potential) ● Not replicable ● Not available 15

  16. Adversaries are Not All Knowing (Assume knowledge between x and y features) Feature Observation Feature Observation Date of Birth 1/1/1970 Date of Birth 1/1/1970 Gender Male Gender Male Feature Observation Race White Race White Country France Country France Date of Birth 1/1/1970 Death Yes Death Yes Gender Male Occupation Teacher Occupation Teacher Married Yes Married Yes Race White Country France Feature Observation Feature Observation Death Yes Date of Birth 1/1/1970 Date of Birth 1/1/1970 Occupation Teacher Gender Male Gender Male Race White Race White Married Yes Country France Country France Death Yes Death Yes Occupation Teacher Occupation Teacher Married Yes Married Yes 16

  17. … Which Means No Single Risk Score Feature Observation Feature Observation Date of Birth 1/1/1970 Date of Birth 1/1/1970 Gender Male Gender Male Race White Race White Country France Country France Death Yes Death Yes Frequency Occupation Teacher Occupation Teacher Married Yes Married Yes Feature Observation Feature Observation Date of Birth 1/1/1970 Date of Birth 1/1/1970 Gender Male Gender Male Race White Race White Country France Country France Re-identification Risk Death Yes Death Yes Occupation Teacher Occupation Teacher Married Yes Married Yes 17

  18. You May Not See Everything ● Field structured databases – you can issue exact queries! ● Semi-structured reports and narratives – you must rely on a mix of ● Artificial intelligence ● Human review ● Sampling ● Problem has become a bit more tricky because leaks can now include explicitly identifying values 18

  19. A Natural Language Setting ● Create a Gold Standard Dataset ● Ask X humans to read and label a selection of records ● Ensure concordance between human annotations Original PHI (e.g., via a Cohen’s Kappa) ● Apply (manual or automated) identifier detection Smith , 61 yo ... daughter, Lynn , to ... strategy oncologist Dr. White ... 5/13/10 to consider ... SWOG protocol 1811 , ... was randomized 5/10 ... ● Compute performance in terms of: to call Mr. Smith on ... PLAN: Dr White and I ... ● (R)ecall – rate at which real identifier instances were detected ● (P)recision – rate at which claimed identifiers are in fact real ● F-measure – weighted average of R and P 19

  20. A Natural Language Setting ● A leak is not necessarily a re-identification **Redacted PHI & Original PHI ● Need to assess the Leaked PHI potential given the leak rates Smith , 61 yo ... **pt_name<A> , **age<60s> yo ... daughter, Lynn , to ... daughter, Lynn , to ... oncologist Dr. White ... oncologist Dr. **MD_name<C> ... 5/13/10 to consider ... **date<5/28/10> to consider ... ● Can assess on a per SWOG protocol 1811 , ... SWOG protocol **other_id , ... was randomized 5/10 ... was randomized 5/10 ... research subject level (if to call Mr. Smith on ... to call Mr. **pt_name<A> on ... labels are person specific) PLAN: Dr White and I ... PLAN: Dr White and I ... P(date of birth leak, date of death leak) ● Alternatively, may assume each feature leaks at ≈ P(date of birth leak) * P(date of death leak) random 20

  21. Hiding in Plain Sight (Carrell et al, 2013) ● Must be careful – this is a relatively new technique ● Also evidence to suggest a computer may mimic the initial detection strategy… and redact the fakes situated in a pattern (Li et al, 2016) **Redacted PHI & Surrogate PHI & Original PHI Leaked PHI Hidden PHI Smith , 61 yo ... **pt_name<A> , **age<60s> yo ... Jones , a 64 yo ... daughter, Lynn , to ... daughter, Lynn , to ... daughter, Lynn , for ... oncologist Dr. White ... oncologist Dr. **MD_name<C> ... oncologist Dr. Howe ... 5/13/10 to consider ... **date<5/28/10> to consider ... 5/28/10 to consider ... SWOG protocol 1811 , ... SWOG protocol **other_id , ... SWOG protocol 1798, ... was randomized 5/10 ... was randomized 5/10 ... was randomized 5/10 ... to call Mr. Smith on ... to call Mr. **pt_name<A> on ... to call Mr. Jones on ... PLAN: Dr White and I ... PLAN: Dr White and I ... PLAN: Dr White and I ... ● In this case, we would need two risk measures ● Human recognition risk ● Computer-assisted recognition risk ● Recent approach to prevent this problem, but comes at a massive loss in precision (Li et al, 2017) 21

  22. If Time Permits… If Not… 22

  23. Latest Development ● Methods presented model data sharer and adversary separately ● New approaches use game theory to consider their interactions (Wan et al 2015) ● Game theory requires robust estimates of many parameters, such as ● benefit the sharer gets in providing data ● benefit the attacker gets in re-identifying the data 23

  24. Stackelberg Game Sharing Strategy 1 Attack Strategy A Utility 1 Utility A Risk ??? Risk A Attack Strategy B Strategies: Utility B - Generalize Age Risk B - Suppress Dates … - Perturb Geography Attack Strategy C Utility C Risk C Publisher Recipient 24

  25. Stackelberg Game Sharing Strategy 1 Attack Strategy A Utility 1 Utility A Risk ??? Risk A Attack Strategy B Utility B Risk B Recipient’s Best Strategy Attack Strategy C Utility C Risk C Publisher Recipient 25

  26. Stackelberg Game Sharing Strategy 1 Attack Strategy A Utility 1 Utility A Risk B Risk A Attack Strategy B Utility B Risk B Recipient’s Best Strategy Attack Strategy C Utility C Risk C Publisher Recipient 26

  27. Stackelberg Game Sharing Strategy 1 Attack Strategy A Utility 1 Utility A Risk B Risk A Sharing Strategy 2 Attack Strategy B Utility 2 Utility B Risk ??? Risk B Attack Strategy C Utility C Risk C Publisher Recipient 27

  28. Stackelberg Game Sharing Strategy 1 Attack Strategy A Recipient’s Best Strategy Utility 1 Utility A Risk B Risk A Sharing Strategy 2 Attack Strategy B Utility 2 Utility B Risk A Risk B Attack Strategy C Utility C Risk C Publisher Recipient 28

  29. Stackelberg Game Sharing Strategy 1 Utility 1 Risk B Sharing Strategy 2 Utility 2 Risk A Sharing Strategy Z Utility Z Risk Z Publisher 29

  30. Stackelberg Game Sharing Strategy 1 Utility 1 Risk B Choose strategy that maximizes Sharing Strategy 2 overall benefit Utility 2 Risk A Optimizes the Risk-Utility tradeoff Sharing Strategy Z Utility Z Risk Z Publisher 30

  31. Demographic Case Study ● $1200: Benefit per record ● ~30,000 Census records ● $300: Cost per violation ● Average Payoff Per Record ● $4: Access cost per record $3.00 $2.50 $2.00 Attacker $1.50 $1.00 US Safe Harbor $0.50 $0.00 $0.00 $500.00 $1,000.00 $1,500.00 Publisher 31

Recommend


More recommend