Has there been a failure of anonymization ? Khaled El Emam www.ehealthinformation.ca www.ehealthinformation.ca/ knowledgebase 1
The Claim Contents • “Computer scientists have undermined “C t i ti t h d i d Backgrd our faith in the privacy-protecting Defs power of anonymization” (Ohm 2009) • “These advances should trigger a sea Examples of change in the law” (Ohm 2009) • “Irrefutable empirical evidence” that “ f bl l d ” h anonymization is broken (Dwork 2010) Lessons • Policy makers are concerned – is this End true ? Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca The Argument Contents • The evidence presented consists of Th id t d i t f Backgrd actual examples of successful re- Defs identification attacks • Because these re-identification attacks Examples exist then anonymization must not work work Lessons End Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 2
The Problem with the Argument Contents • The evidence often cited does not Th id ft it d d t Backgrd actually show that: Defs – Re-identification was successful, or – That the data that was attacked was Examples anonymized • Therefore there is actually no • Therefore, there is actually no empirical evidence that anonymized data can be re-identified Lessons End Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca This Presentation Contents • We will examine 21 empirical studies W ill i 21 i i l t di Backgrd that looked at re-identification on Defs actual data sets or in real world examples (i.e., not theoretical attacks) Examples to see what we can learn about them • Focus mostly on demographics that Focus mostly on demographics that can appear in clinical data (i.e., no genetic re-identification studies) Lessons • Some subset of these 21 studies is End often (mis-)cited as evidence Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 3
Variable Distinctions Contents • Directly identifying Di tl id tif i Backgrd – Can uniquely identify an individual by itself Defs or in conjunction with other readily available information Examples • Quasi-identifiers – Can identify an individual by itself or in – Can identify an individual by itself or in conjunction with other information • Sensitive variables Lessons End Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Five Levels of I dentifiability greater risk of re-identification Aggregate Data Level 5 not personal information identifiability below threshold Managed Data Level 4 identifiability above threshold personal information Exposed Data greater Level 3 effort, cost, effort cost irreversibly masked data time & skill Masked Data Level 2 to re-identify reversibly masked data Readily Identifiable Data Level 1 4
Evaluation Criteria - I Contents • Risk or re-identification: Ri k id tifi ti Backgrd – Some studies evaluate (estimate or Defs measure) the risk of re-identification but do not actually attempt to re-identify a Examples data set – Therefore, risk evaluation studies would , not be considered successful re- identification attacks Lessons End Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Evaluation Criteria - I I Contents • Was the data de-identified: W th d t d id tifi d Backgrd – This would be a question whether the Defs study was a risk assessment or a re- identification Examples – Was the data properly de-identified in a measurable way or was it only “masked y y data” ? – Masked data is not anonymized data – re- Lessons identifying masked data is trivial End Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 5
Evaluation Criteria - I I I Contents • Was a re-identification verified: W id tifi ti ifi d Backgrd – Verifying a re-identification match is Defs necessary to ensure that the match was correct (i.e., confirming that the match of Examples the record to the identity is correct) – If a match is not verified then it is not possible to know if the match was a correct one or not Lessons – The only exception would be a match to a population registry (e.g., voter list) End Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Example - I Contents Backgrd Defs Examples Lessons End Governor William Weld of MA Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 6
GI C Case Contents • The Group Insurance Commission is • The Group Insurance Commission is Backgrd responsible for purchasing health Defs insurance for state employees in Massachusetts Examples • Insurance data on 135,000 state employees and their families was employees and their families was released after being “anonymized” • Database was matched with the voter Lessons list for Cambridge, Massachusetts End Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca William Weld Contents • Six people in the database have the • Six people in the database have the Backgrd same DoB Defs • Three are men • One in his 5 digit zip code Examples • His insurance record was re-identified • William Weld was the governor of Massachusetts Lessons End Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 7
Evaluation of Example I Contents • This was a successful re identification • This was a successful re-identification Backgrd attack Defs • The match was not verified but it was a match to a population registry Examples • The data that was disclosed by GIC was Masked data and not properly de- was Masked data and not properly de identified Lessons End Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Example 2 - AOL Case Contents • In the Summer of 2006 AOL released • In the Summer of 2006 AOL released Backgrd “anonymized” data on ~20 million discrete Defs search queries for >650,000 individuals on a public web site for researchers to use Examples • The records include date and time of the query and the web site clicked on, as well as query and the web site clicked on as well as a unique identifier for each user so records can be linked to get a user profile Lessons End Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 8
AOL Users Contents • #2178: “foods to avoid when breast feeding” • #2178: foods to avoid when breast feeding Backgrd • #3482401: “calorie counting” Defs • #3505202: “depression and medical leave” Examples Lessons End Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca User # 4417749 Contents • “tea for good health” • tea for good health Backgrd • “numb fingers”, “hand tremors” Defs • “dry mouth” • “60 single men” Examples • “dog that urinates on everything” g y g • “landscapers in Lilburn, Ga” • “homes sold in shadow lake subdivision Lessons gwinnett county georgia” End Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 9
Thelma Arnold Contents • 62 year old widow 62 ld id Backgrd living in Lilburn Ga Defs re-identified by the New York Times Examples • She has three dogs Lessons End Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Evaluation of Example 2 Contents • This was a successful re identification • This was a successful re-identification Backgrd attack Defs • The match was verified (we have a picture) Examples • The data that was disclosed by AOL was Masked data and not properly de- was Masked data and not properly de identified Lessons End Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 10
Recommend
More recommend