the de identification of longitudinal and geospatial data
play

The De-identification of Longitudinal and Geospatial Data Khaled El - PDF document

The De-identification of Longitudinal and Geospatial Data Khaled El Emam, CHEO RI & uOttawa Context The disclosure of health information for Th di l f h lth i f ti f secondary purposes, such as research Many practical


  1. The De-identification of Longitudinal and Geospatial Data Khaled El Emam, CHEO RI & uOttawa Context • The disclosure of health information for Th di l f h lth i f ti f secondary purposes, such as research • Many practical challenges to obtaining express individual consent from patients for large databases • Even if express consent can be obtained, there is compelling evidence that consenters differ from non-consenters – introduces bias • age, sex, race, marital status, educational level, socioeconomic status, health status, mortality, lifestyle factors, functioning Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 1

  2. Complex Data • Simplest data: cross-sectional Si l t d t ti l • Longitudinal data: data about individuals over time – Patients with multiple visits – Patients with multiple insurance claims • Geospatial data: data that contains location information – Residence postal codes or dissemination areas Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Longitudinal Data • This will usually have higher re- Thi ill ll h hi h identification risk than cross-sectional data • We need to make assumptions about the background information of the adversary (how many visits/ claims will adversary (how many visits/ claims will they have information about) • Multiple attributes collected per visit/ claim Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 2

  3. Residence Trails • In previous work we examined the re- I i k i d th identification risk from residence trails • The model was based on RAMQ data • Only considers location information over time and does not take the specifics of a data set into account f f d • Looked at uniqueness only • For example, … Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Representation - Tree {7/8/2000 M} {7/8/2000, M} {1/1/2009,K7G2C3} {18/4/2009,K7G2C4} {14/1/2009,K7G2C3} {14/1/2009 K G2C3} • Each patient can be represented as a tree (multiple levels are possible) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 3

  4. Representation - Tables PI D PI D Vi it D t Visit Date P Postal Code t l C d 10 1/ 1/ 2009 K7G2C3 PI D DoB Gender 10 14/ 1/ 2009 K7G2C3 10 7/ 8/ 2000 M 10 18/ 4/ 2009 K7G2C4 11 1/ 1/ 1975 F 11 1/ 1/ 2009 K1V7E6 12 24/ 6/ 1975 F 11 20/ 1/ 2009 K1V7E8 13 17/ 8/ 1975 F 11 22/ 2/ 2009 K1V7E8 K1V7E8 14 18/ 9/ 1975 8/ / F 12 15/ 12/ 2008 K1Y4L5 15 12/ 2/ 2000 M 12 20/ 1/ 2009 K1V7E8 13 22/ 12/ 2008 K1Z5H9 14 13/ 1/ 2009 K1Y4L5 15 20/ 4/ 2009 K7G2G5 Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Reduction in I nformation Loss Dataset Reduction in Entropy California 71% Florida 71% New York 64% Washington 80% • Comparison of two assumptions about adversary background knowledge • Incomplete knowledge has significantly less information loss Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 4

  5. Geographic Areas • Postal codes are most often collected P t l d t ft ll t d from patients • Some disadvantages: they change over time (number and boundaries) • Census geography is more stable and postal codes can be converted to these l d b d h • For now we are focusing on postal codes Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Postal Code Population Sizes 25 th 50 th 75 th P/T # PC Min Max AB 77,348 1 5 24 50 7,084 BC 113,222 , 1 6 19 40 13,537 , MB 24,015 1 6 25 49 6,298 NB 57,389 1 3 8 17 1.971 NL 10,376 2 7 18 39 5,506 NS 25,332 1 5 13 29 8,983 NU/NWT 535 2 14 33 82 5,794 ON 270,277 1 7 21 47 17,165 PE 3,165 2 5 12 26 8,327 QC 203,637 1 5 17 39 12,635 SK 21,563 1 6 22 36 6,939 YT 935 2 2 12 33 2,107 5

  6. 6000 6000 6000 6000 6000 5000 5000 5000 5000 5000 POPULATION 4000 POPULATION 4000 POPULATION 4000 POPULATION 4000 POPULATION 4000 3000 3000 3000 3000 3000 2000 2000 2000 2000 2000 1000 1000 1000 1000 1000 0 0 0 0 0 0 0 0 0 Rural Urban Rural Urban Rural Urban Rural Urban Rural Urban AB BC MB NB NL 6000 6000 6000 6000 6000 5000 5000 5000 5000 5000 PULATION PULATION 4000 PULATION 4000 PULATION 4000 PULATION 4000 4000 3000 3000 3000 3000 3000 POP POP POP POP POP 2000 2000 2000 2000 2000 1000 1000 1000 1000 1000 0 0 0 0 0 1 2 Rural Urban Rural Urban Rural Urban Rural Urban NS ON PE QC SK Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Cropping • The most common way for de-identifying areas was The most common way for de identifying areas was to crop them (i.e., remove characters/digits from the end) • This works for postal codes because the areas are hierarchical in structure • Larger areas will have more people living in them – reducing the re-identification risk reducing the re identification risk • But the effectiveness of this will depend on the data set itself and the variables that are being disclosed Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 6

  7. Cropping Example Cropping Postal Code Postal Code + Age 6 character 69% 97% 5 character 27.5% 93% 4 character 1% 39% 3 character 0.3% 9% • Ontario birth registry example for one year • Table shows the percentage of records with a high risk at a 0.2 threshold (prosecutor risk) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Aggregation • We have developed an algorithm that would W h d l d l ith th t ld aggregate small postal codes into larger ones • Finds much smaller areas that maintain acceptable re-identification risk levels (compared to cropping) • Allows much higher geographic specificity • We have shown that disease outbreak cluster detection is affected minimally by the aggregation • Currently implemented for postal codes – being developed for dissemination areas and ZCTAs Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 7

  8. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca kelemam@uottawa.ca www.ehealthinformation.ca www.ehealthinformation.ca/ knowledgebase 8

  9. References • K El Emam D Buckeridge R Tamblyn A Neisa E Jonker A Verma: “The K. El Emam, D. Buckeridge, R. Tamblyn, A. Neisa, E. Jonker, A. Verma: The Re-identification Risk of Canadians from Longitudinal Demographics.” BMC Medical Informatics and Decision Making 2011, 11:46, DOI:10.1186/1472- 6947-11-46, 2011. • El Emam K, Brown A, AbdelMalik P, Neisa A, Walker M, Bottomley J, Roffey T: A method for managing re-identification risk from small geographic areas in Canada. BMC Medical Informatics and Decision Making, 10, 2010. • El Emam K, Brown A, Abdelmalik P: Evaluating Predictors of Geographic Area Population Size Cutoffs to Manage Re-identification Risk. Journal of the American Medical Informatics Association 16:256-266 2009 the American Medical Informatics Association, 16:256-266, 2009. • El Emam K, Dankar F, Issa R, Jonker E, Amyot D, Cogo E, Corriveau J-P, Walker M, Chowdhury S, Vaillancourt R, Roffey T, Bottomley J: A Globally Optimal k-Anonymity Method for the De-identification of Health Data . Journal of the American Medical Informatics Association, 16(5):670-682, 2009. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 9

Recommend


More recommend