Data Privacy – Anonymization Li Xiong CS573 Data Privacy and Security
Outline • Inference control • Anonymization problem • Anonymization notions and approaches (and how they fail to work!) – k-anonymity – l-diversity – t-closeness • Takeaways
Access Control vs. Inference Control Access control : protecting information from being accessed by unauthorized users Data Access Control Inference control (disclosure control) : protecting private data from being inferred from sanitized data or models by authorized users Sanitized Original Inference Control Data/ Data Models
Disclosure Risk and Information Loss • Privacy (disclosure risk) - the risk that a given form of disclosure will arise if the data is released • Utility (information loss) - the information which exist in the initial data but not in released data due to disclosure control methods Sanitized Original Inference Control Data/ Data Models
What to Protect: Classical Intuition for Privacy • Uninformative principle (Dalenius 1977) – Access to the published data does not reveal anything extra about any target victim, even with the presence of attacker’s background knowledge obtained from other sources • Similar to semantic security of encryption – Knowledge of the ciphertext (and length) of some unknown message does not reveal any additional information on the message that can be feasibly extracted slide 6
What to protect: types of disclosure • Membership disclosure: Attacker can tell that a given person is in the dataset • Identity disclosure: Attacker can tell which record corresponds to a given person • Sensitive attribute disclosure: Attacker can tell that a given person or record has a certain sensitive attribute slide 7
What’s published • Microdata represents a set of records containing information on an individual unit such as a person, a firm, an institution • Macrodata represents computed/derived statistics • Models and patterns from machine learning and data mining
Name Age Diagnosis Income Age Diagnosis Income Wayne 44 AIDS 45,500 44 AIDS 50,000 Gore 44 Asthma 37,900 44 Asthma 40,000 Banks 55 AIDS 67,000 55 AIDS 70,000 Casey 44 Asthma 21,000 44 Asthma 20,000 Stone 55 Asthma 90,000 55 Asthma 90,000 Kopi 45 Diabetes 48,000 45 Diabetes 50,000 Simms 25 Diabetes 49,000 - Diabetes 50,000 Wood 35 AIDS 66,000 - AIDS 70,000 Aaron 55 AIDS 69,000 55 AIDS 70,000 Pall 45 Tuberculosis 34,000 45 - 30,000 Masked Microdata Initial Microdata Disclosure Control For Microdata
Name Age Diagnosis Income Wayne 44 AIDS 45,500 Gore 44 Asthma 37,900 Banks 55 AIDS 67,000 Casey 44 Asthma 21,000 Stone 55 Asthma 90,000 Kopi 45 Diabetes 48,000 Simms 25 Diabetes 49,000 Wood 35 AIDS 66,000 Aaron 55 AIDS 69,000 Pall 45 Tuberculosis 34,000 Initial Microdata Count Diagnosis 4 AIDS Count Diagnosis 3 Asthma 4 AIDS Masked Table 1 3 Asthma 2 Diabetes 1 Tuberculosis Table 1 - Count Diagnosis Count Age Income 5 31 - 40 188,200 Count Age Income 3 41 - 50 226,000 1 <= 30 49,000 1 31- 40 66,000 Masked Table 2 5 41 - 50 188,200 3 51-60 226,000 0 > 60 0 Table 2 - Total Incoming Masked Tables from Tables Tables Disclosure Control for Macro Data (Statistics Tables)
Name Age Diagnosis Income Wayne 44 AIDS 45,500 Gore 44 Asthma 37,900 Banks 55 AIDS 67,000 Casey 44 Asthma 21,000 Stone 55 Asthma 90,000 Kopi 45 Diabetes 48,000 Simms 25 Diabetes 49,000 Wood 35 AIDS 66,000 Aaron 55 AIDS 69,000 Pall 45 Tuberculosis 34,000 Initial Microdata Disclosure Control For Data Mining/Machine Learning Models
Inference Control Methods • Microdata Release (Anonymization) – Input perturbation: attribute suppression, generalization, perturbation • Macrodata Release – Output perturbation: summary statistics with perturbation • Query restriction/auditing (interactive version) – Auditor decides which queries are OK, type of noise slide 15
Outline • Anonymization problem • Anonymization notions and approaches (and how they fail to work!) – Basic attempt: de-identification – k-anonymity – l-diversity – t-closeness • Takeaways
Basic Attempt • Remove/replace identifier attributes Original Sanitized De-identification Data Records
Data “ Anonymization ” • Remove “personally identifying information” (PII) – Name, Social Security number, phone number, email, address… what else? • Problem: PII has no technical meaning or common definition – Defined in sectoral laws such as HIPAA (PHI: Protected Health Information) • 18 identifiers – Any information can be personally identifying • E.g. Rare disease condition • Many de-anonymization examples: GIC, AOL dataset, Netflix Prize dataset slide 18
Massachusetts GIC Incident Massachusetts GIC released “ anonymized ” data on state employees’ hospital visit Then Governor William Weld assured public on privacy GIC Anonymized Name SSN Age Zip Diagnosis Age Zip Diagnosis Alice 123456789 44 48202 AIDS 44 48202 AIDS Bob 323232323 44 48202 AIDS 44 48202 AIDS Charley 232345656 44 48201 Asthma 44 48201 Asthma Dave 333333333 55 48310 Asthma 55 48310 Asthma Eva 666666666 55 48310 Diabetes 55 48310 Diabetes
Massachusetts GIC Then graduate student Sweeney linked the data with Voter registration data in Cambridge and identified Governor Weld’s record Name SSN Age Zip Diagnosis Age Zip Diagnosis Alice 123456789 44 48202 AIDS 44 48202 AIDS Bob 323232323 44 48202 AIDS 44 48202 AIDS Charley 232345656 44 48201 Asthma 44 48201 Asthma Dave 333333333 55 48310 Asthma 55 48310 Asthma Eva 666666666 55 48310 Diabetes 55 48310 Diabetes Voter roll for Cambridge Name Age Zip Alice 44 48202 Charley 44 48201 Dave 55 48310
Re-identification 9/9/2018 21
AOL Query Log Release 20 million Web search queries by AOL AnonID Query QueryTime ItemRank ClickURL 217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com 217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com 1268 gall stones 2006-05-11 02:12:51 1268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov 1268 ozark horse blankets 2006-03-01 17:39:28 8 http://www.blanketsnmore.com (Source: AOL Query Log)
User No. 4417749 • User 4417749 – “numb fingers”, – “60 single men” – “dog that urinates on everything” – “landscapers in Lilburn, Ga ” – Several people names with last name Arnold – “homes sold in shadow lake subdivision gwinnett county georgia ”
User No. 4417749 • User 4417749 – “numb fingers”, – “60 single men” – “dog that urinates on everything” – “landscapers in Lilburn, Ga ” – Several people names with last name Arnold – “homes sold in shadow lake subdivision gwinnett county georgia ” Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her dogs
The Genome Hacker (2013)
Outline • Anonymization problem • Anonymization notions and approaches (and how they fail to work!) – Basic attempt: de-identification – k-anonymity – l-diversity – t-closeness • Takeaways
K-Anonymity • The term was introduced in 1998 by Samarati and Sweeney. • Important papers: – Sweeney L. (2002), K-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, Vol. 10, No. 5, 557-570 – Sweeney L. (2002), Achieving K-Anonymity Privacy Protection using Generalization and Suppression, International Journal on Uncertainty, Fuzziness and Knowledge- based Systems, Vol. 10, No. 5, 571-588 – Samarati P. (2001), Protecting Respondents Identities in Microdata Release, IEEE Transactions on Knowledge and Data Engineering, Vol. 13, No. 6, 1010-1027 • Hundreds of papers on the topic in the past decade – Theoretical results – Many algorithms achieving k-anonymity – Many improved principles and algorithms
Motivating Example Original Sanitized De-identification Data Records Non-Sensitive Data Sensitive Data Non-Sensitive Data Sensitive Data # Zip Age Nationality Name Condition # Zip Age Nationality Condition 1 13053 28 Brazilian Ronaldo Heart Disease 1 13053 28 Brazilian Heart Disease 2 13067 29 US Bob Heart Disease 2 13067 29 US Heart Disease 3 13053 37 Indian Kumar Cancer 3 13053 37 Indian Cancer 4 13067 36 Japanese Umeko Cancer 4 13067 36 Japanese Cancer
Motivating Example Original Sanitized De-identification Data Records Non-Sensitive Data Sensitive Data Non-Sensitive Data Sensitive Data # Zip Age Nationality Name Condition # Zip Age Nationality Condition 1 13053 28 Brazilian Ronaldo Heart Disease 1 13053 28 Brazilian Heart Disease 2 13067 29 US Bob Heart Disease 2 13067 29 US Heart Disease 3 13053 37 Indian Kumar Cancer 3 13053 37 Indian Cancer 4 13067 36 Japanese Umeko Cancer 4 13067 36 Japanese Cancer Attacker’s Knowledge: Voter registration list # Name Zip Age Nationality 1 John 13067 45 US 2 Paul 13067 22 US 3 Bob 13067 29 US 4 Chris 13067 23 US
Recommend
More recommend