de identification 101
play

De-identification 101 Khaled El Emam, CHEO RI & uOttawa Trends - PDF document

De-identification 101 Khaled El Emam, CHEO RI & uOttawa Trends Increasing demands for health data for I i d d f h lth d t f secondary purposes Stronger enforcement of regulations with severe penalties One consequence


  1. De-identification 101 Khaled El Emam, CHEO RI & uOttawa Trends • Increasing demands for health data for I i d d f h lth d t f secondary purposes • Stronger enforcement of regulations with severe penalties • One consequence is the need for more defensible methods for the de-identification of the data Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 1

  2. Most Common Secondary ‘Uses’ Source: Pricewaterhouse Coopers Survey. 2009 Transforming healthcare through secondary use of health data. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Variable Distinctions • Directly identifying Di tl id tif i – Can uniquely identify an individual by itself or in conjunction with other readily available information • Quasi-identifiers – Can identify an individual by itself or in – Can identify an individual by itself or in conjunction with other information • Sensitive variables Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 2

  3. Examples of Direct I dentifiers • Name, address, telephone number, fax N dd t l h b f number, MRN, health card number, health plan beneficiary number, license plate number, email address, photograph, biometrics, SSN, SIN, implanted device number implanted device number Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Examples of Quasi-I dentifiers • sex, date of birth or age, geographic locations (such d t f bi th hi l ti ( h as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, aboriginal identity, total years of schooling, marital status, criminal history, total income, visible minority status, activity difficulties/reductions profession event activity difficulties/reductions, profession, event dates (such as admission, discharge, procedure, death, specimen collection, visit/encounter), codes (such as diagnosis codes, procedure codes, and adverse event codes), country of birth, birth weight, and birth plurality Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 3

  4. Attribute vs I dentity Disclosure • Attribute disclosure: discover something • Attribute disclosure: discover something new about an individual in the database without knowing which record belongs to that individual • Identity disclosure: determine which record in the database belongs to a particular in the database belongs to a particular individual (for example, determine that record number 7 belongs to Bob Smith – that is identity disclosure) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Attribute Disclosure - I • For example: • For example: NOT HPV Vaccinated HPV Vaccinated Religion A 5 40 Not 40 40 5 5 Religion A  Statistically significant relationship (chi-square, p<0.05)  High risk of attribute disclosure Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 4

  5. Attribute Disclosure - I • For example: • For example: NOT HPV Vaccinated HPV Vaccinated Religion A 5 40 Not 40 40 5 5 Religion A  Statistically significant relationship (chi-square, p<0.05)  High risk of attribute disclosure Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Attribute Disclosure - I I • Use suppression to eliminate risk: • Use suppression to eliminate risk: NOT HPV Vaccinated HPV Vaccinated Religion A 5 6 Not 6 6 5 5 Religion A  Not statistically significant relationship (chi-square)  Low risk of attribute disclosure Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 5

  6. Attribute Disclosure - I I I • Attribute disclosure is an important outcome • Attribute disclosure is an important outcome of analytics – it is arguably more of an ethics question whether it is acceptable to ask certain questions or discover certain things about individuals with certain characteristics • The HIPAA regulations do not require one to • The HIPAA regulations do not require one to address risks from attribute disclosure – only identity disclosure risks need to be address • All known re-identification attacks were identity disclosure Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Tables vs Records • Often disclosure control guidelines are stated in • Often disclosure control guidelines are stated in terms of tables or ‘aggregate’ tables • Tables of counts are exactly the same thing as individual-level data and can be converted from one to the other • When data is released in tabular format, however, additional issues can be raised: dditi l i b i d – Often small cells are suppressed. Using iterative algorithms it is easy to recover suppressed cells if the totals for (some of) the marginals are known – Tables with overlapping dimensions can leak information that is useful for recovering small cells Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 6

  7. Does De-identification Work ? • Existing evidence shows that data sets that have • Existing evidence shows that data sets that have been properly de-identified have a low probability of being re-identified • All publicly known examples of serious re- identification attacks were done on data sets that were not properly de-identified (i.e., it is possible to show that their risk of re identification was quite high show that their risk of re-identification was quite high and did not meet HIPAA de-identification standards) • As far as we know, proper de-identification works effectively in managing risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028071 Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 7

  8. De-identification Standards • The HIPAA Privacy Rule specified two de • The HIPAA Privacy Rule specified two de- identification standards: – Safe Harbor – Statistical method Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca HI PAA Safe Harbor Safe Harbor Direct Identifiers and Quasi identifiers Safe Harbor Direct Identifiers and Quasi-identifiers 1. Names 12.Vehicle identifiers 18. Any other unique 2. ZIP Codes (except and serial numbers, identifying number, first three) including license characteristic, or 3. All elements of dates plate numbers code (except year) 13.Device identifiers 4. Telephone numbers and serial numbers 5. Fax numbers 14.Web Universal 6. Electronic mail Resource Locators addresses (URLs) 7. Social security 15.Internet Protocol (IP) numbers numbers address numbers address numbers 8. Medical record 16.Biometric identifiers, numbers including finger and 9. Health plan voice prints beneficiary numbers 17.Full face 10.Account numbers photographic images 11.Certificate/license and any comparable numbers images; Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 8

  9. Names (Element 1) • Covers only the names of the individuals or • Covers only the names of the individuals or of relatives, employers, or household members of the individual • Names of providers would not be considered as part of the Safe Harbor list and therefore, strictly speaking, it would not be necessary strictly speaking it would not be necessary to remove them from the data set Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Element 18 • Any other unique identifying number • Any other unique identifying number, characteristic, or code: – Number: clinical record number – Characteristic: rare and visible diagnosis – Code: hashed SSN (derived from a direct identifier without a salt) de t e t out a sa t) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 9

Recommend


More recommend