a reality check on health information privacy how should
play

A Reality Check on Health Information Privacy: How should we - PDF document

A Reality Check on Health Information Privacy: How should we understand re-identification risks under HIPAA? Daniel C. Barth-Jones, M.P.H., Ph.D. Assist ant Professor of Clinical Epidemiology, Mailman S chool of Public Healt h Columbia


  1. A Reality Check on Health Information Privacy: How should we understand re-identification risks under HIPAA? Daniel C. Barth-Jones, M.P.H., Ph.D. Assist ant Professor of Clinical Epidemiology, Mailman S chool of Public Healt h Columbia Universit y db2431@ Columbia.edu The Value of De-identification  Properly de-identified health data is an invaluable “public good”. The broad availabilit y of de-ident ified dat a is an essent ial t ool for societ y support ing scient ific innovat ion and healt h syst em improvement and efficiency. improvement and efficiency  De-identified data does and can serve as the engine driving forward innumerable essential health systems improvements: quality improvement, health systems planning, healthcare fraud, waste and abuse detection, and medical/ public health research (e.g. comparative effectiveness research, adverse drug event monitoring, patient safety improvements and reducing health disparities).  De identified health data greatly benefits our society and provides  De-identified health data greatly benefits our society and provides strong privacy protections for the individuals. As the promise of EHRs and Health IT yields richer de-identified clinical data, the progress of our nation’ s healthcare reform will likely be built on a foundation of such de-identified health data. 2 1

  2. The Inconvenient Truth: Complete Protection ion Bad Decisions / B d S i Bad Science Ideal Situation osure Protect (Perfect Information & Trade-Off between Perfect Information Protection) Quality Unfortunately, and not achievable Privacy Protection Privacy Protection Disclo due to due to Poor mathematical Privacy constraints Protection No Protection Information Optimal Precision, No Lack of Bias Information 3 Misconceptions about HIPAA De-identified Data: “It doesn’t work… ” “ easy, cheap, powerful re-identification” ( Ohm, 2009 “ Broken Promises of Privacy” ) *Pre-HIPAA Re-identification Risks {Zip5, Birth date, Gender} Able to identify 87% id tif 87% 63% - 63% of US f US Population (S P l ti weeney, 2000, Golle, 2006)  Reality: HIPAA compliant de-identification provides important privacy protections — S afe harbor re-identification risks have been more recently estimated at 0.04%(4 in 10,000) (S weeney, NCVHS Testimony, 2007) — Safe Harbor de-identification provides protections that have been estimated to be a minimum of 400 to 1000 times more protective of privacy than permitting direct PHI access. (Benitez & Malin, JAMIA, 2010)  Reality: Under HIPAA de-identification requirements, re- identification is expensive and time-consuming to conduct, requires serious computer/ mathematical skills, is rarely successful, and uncertain as to whether it has actually succeeded 4 2

  3. Misconceptions about HIPAA De-identified Data: “ It works perfect ly and permanent ly… ”  Reality: —Perfect de-identification is not possible Perfect de identification is not possible —De-identifying does not free data from all possible subsequent privacy concerns —Data is never permanently “ de-identified” … (There is no guarantee that de-identified data will remain de-identified regardless of what you do to it after it is de-identified.) —S imply collapsing your coding categories until the data is “ k-anonymous” without considering the impact on statistical accuracy and utility can make the data unsuitable for many statistical analyses 5 Myth of the “Perfect Population Register” and importance of “Data Divergence”  The critical part of re-identification efforts that is virtually never tested by disclosure scientists is assumption of a perfect population register .  Probabilistic record linkage has some capacity to dealing with errors and inconsistencies in the linking data between the sample and the population caused by “ data divergence ” : — Time dynamics in the variables (e.g. changing Zip Codes when individuals move), — Missing and Incomplete data and — Keystroke or other coding errors in either dataset,  But the links created by probabilistic record linkage are subj ect to uncertainty. The data intruder is never really certain that the correct persons have been re-identified. 6 3

  4. Identification Spectrum De- “Breach LDS Fully No Identified Safe” Information Identified Totally Safe, But Useless Research, Treatment, Useful for Any Public Health, Payment, Permitted Uses: � Breach Purpose Healthcare Operations Avoidance Operations Protected Health Information (PHI)  Limited Data Set (LDS  ) § 164.514(e)  Eliminate 16 Direct Identifiers (Name, Address, S  Eliminate 16 Direct Identifiers (Name Address S S S N etc ) N, etc.) LDS w/o 5-digit Zip & Date of Birth (LDS afe” ) 8/24/09 FedReg  -“ Breach S  Eliminate 16 Direct Identifiers and Zip5, DoB Safe Harbor De-identified Data Set (SHDDS) § 164.514(b)(2)   Eliminate 18 Identifiers (including Geo < 3 digit Zip, All Dates except Yr) Statistically De-identified Data Sets (SDDS) § 164.514(b)(1)   Verified “very small” Risk of Re-identification 7 HIPAA Statistical De-identification Conditions  “ Risk is very small… ” — “ that the information could be used ” … “ th t th i f ti ld b d ” — “ alone or in combination with other reasonably available information”… , “ by an anticipated recipient ”… — — “ to identify an individual”… 8 4

  5. Statistically De-identified Data Sets (SDDSs)  S t at ist ical De-ident ificat ion often can be used to release some of the safe harbor “ prohibited identifiers” provided that the risk of re-identification is “very small” . that the risk of re identification is very small .  For example, more detailed geography , dat es of service or encrypt ion codes could possibly be used within statistical de-identified data based on statistical disclosure analyses showing that the risks are very small.  However, disclosure analyses must be conducted to assess risks of re-identification (e.g., encrypted data with strong statistical associations to unencrypted data can pose important re-identification risks) 9 Information Explosion - Rapid Increase in Publically Available Data  Any information which is a “ mat t er of public record ” or “ reasonably available ” in data sets record or reasonably available in data sets which contain actual identifiers should be considered a quasi-identifier under the HIPAA definition for statistical de-identification.  The amount of data that will need to be considered “ reasonably available ” quasi- identifiers should only be expected to increase identifiers should only be expected to increase due to the dramatic expansion of public records which are freely available via the internet or inexpensively purchased data from marketing data vendors. 10 5

  6. Successful Solutions: Balancing Disclosure Risk and Statistical Accuracy  When appropriately implemented, statistical de- identification seeks to protect and balance two vitally important societal interests: p — 1) Protection of the privacy of individuals in healthcare data sets, (Disclosure or Identification Risk), and — 2) Preserving the utility and accuracy of statistical analyses performed with de-identified data (Loss of Information).  Limiting disclosure inevitably reduces the quality of  Limiting disclosure inevitably reduces the quality of statistical information to some degree, but the appropriate disclosure control methods result in small information losses while substantially reducing identifiability. 11 Essential Re-identification Concepts  Essential Re-identification and S tatistical Disclosure Concepts — Record Linkage g — Linkage Keys (Quasi-identifiers) — S ample Uniques and Populat ion Uniques  S traightforward Methods for Controlling Re- identification Risk — Decreasing Uniques:  by Reducing Key Resolutions  by Increasing Reporting Population S izes  Understanding challenges for reporting geographies 12 6

  7. Record Linkage Record Linkage is achieved by matching records in separate data sets that have a common “ Key” or set of data fields. Population Register (w/ IDs) Population Register (w/ IDs) (e.g. Voter Registration) Age Name Address Gender (Y oB) … Px Dx Age Gender (Y oB) ... Codes Codes ... Sample S l Data file Quasi- Revealed Identifiers Identifiers Data (Keys) 13 Quasi-identifiers While individual fields may not be identifying by themselves, the contents of several fields in combination may be sufficient to result in identification, the set of y , fields in the Key is called the set of Quasi-identifiers . Ethnic Marital Geo- Name Address Gender Age Group S tatus graphy ^------- Quasi-identifiers ---------^ Fields that should be considered part of a Quasi- identifier are those variables which would be likely to identifier are those variables which would be likely to exist in “ reasonably available” data sets along with actual identifiers (names, etc.). Note that this includes even fields that are not “ PHI” . 14 7

Recommend


More recommend