CS573 Data Privacy and Security Data Anonymization (cont.) Li Xiong Department of Mathematics and Computer Science Emory University
Today • Cont. Anonymization notions and approaches – l-diversity – t-closeness • Takeaways
Attacks on k-Anonymity • K-Anonymity protects against identity disclosure but not provide sufficient protection against attribute disclosure • k-Anonymity does not provide privacy if – Homogeneity attack: Sensitive values in each quasi-identifier group (equivalence class) lack diversity – The attacker has background knowledge A 3-anonymous patient table Homogeneity attack Zipcode Age Disease Bob 476** 2* Heart Disease Zipcode Age 476** 2* Heart Disease 47678 27 476** 2* Heart Disease 4790* ≥ 40 Flu 4790* ≥ 40 Heart Disease Background knowledge attack 4790* ≥ 40 Cancer Carl 476** 3* Heart Disease Zipcode Age 476** 3* Cancer 47673 36 476** 3* Cancer
Another Attempt: l-Diversity [Machanavajjhala et al. ICDE ‘06] Caucas 787XX Flu • Protect against attribute Caucas 787XX Shingles disclosure Caucas 787XX Acne • Sensitive attributes must be Caucas 787XX Flu • “diverse” within each • quasi-identifier equivalence Caucas 787XX Acne class. Caucas 787XX Flu • l-diversity equivalence class: at Asian/AfrAm 78XXX Flu least l “well - represented” values Asian/AfrAm 78XXX Flu for the sensitive attribute Asian/AfrAm 78XXX • l-diversity table: every Acne equivalence class of the table Asian/AfrAm 78XXX Shingles has l-diversity Asian/AfrAm 78XXX Acne Asian/AfrAm 78XXX Flu slide 4
Neither Necessary, Nor Sufficient Original dataset Anonymization A Anonymization B … HIV- Q1 HIV+ Q1 HIV- … HIV- Q1 HIV- Q1 HIV- … HIV- Q1 HIV+ Q1 HIV- … HIV- Q1 HIV- Q1 HIV+ … HIV- Q1 HIV+ Q1 HIV- … HIV+ Q1 HIV- Q1 HIV- … HIV- Q2 HIV- Q2 HIV- … HIV- Q2 HIV- Q2 HIV- 99% HIV- quasi-identifier group is not “diverse” … HIV- Q2 HIV- Q2 HIV- …yet anonymized database does not leak anything … HIV- Q2 HIV- Q2 HIV- … HIV- Q2 HIV- Q2 HIV- 50% HIV- quasi- identifier group is “diverse” … HIV- Q2 HIV- Q2 Flu This leaks a ton of information 99% have HIV- slide 5
Limitations of l-Diversity • Example: sensitive attribute is HIV+ (1%) or HIV- (99%) – Very different degrees of sensitivity! • l-diversity is unnecessary – 2-diversity is unnecessary for an equivalence class that contains only HIV- records • l-diversity is difficult to achieve – Suppose there are 10000 records in total – To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes slide 6
Skewness Attack • Example: sensitive attribute is HIV+ (1%) or HIV- (99%) • Consider an equivalence class that contains an equal number of HIV+ and HIV- records – Diverse, but potentially violates privacy! • l-diversity does not differentiate: – Equivalence class 1: 49 HIV+ and 1 HIV- – Equivalence class 2: 1 HIV+ and 49 HIV- l-diversity does not consider overall distribution of sensitive values! slide 7
Sensitive Attribute Disclosure A 3-diverse patient table Similarity attack Zipcode Age Salary Disease Bob 476** 2* 20K Gastric Ulcer 476** 2* 30K Gastritis Zip Age 476** 2* 40K Stomach Cancer 47678 27 4790* ≥ 40 50K Gastritis 4790* ≥ 40 100K Flu Conclusion 4790* ≥ 40 70K Bronchitis 1. Bob’s salary is in [20k,40k], 476** 3* 60K Bronchitis which is relatively low 476** 3* 80K Pneumonia 2. Bob has some stomach-related 476** 3* 90K Stomach Cancer disease l-diversity does not consider semantics of sensitive values! slide 8
t-Closeness [Li et al. ICDE ‘07] Caucas 787XX Flu Caucas 787XX Shingles Caucas 787XX Acne Distribution of sensitive Caucas 787XX Flu attributes within each Caucas 787XX Acne quasi-identifier group should Caucas 787XX Flu be “close” to their distribution Asian/AfrAm 78XXX Flu in the entire original database Asian/AfrAm 78XXX Flu Asian/AfrAm 78XXX Acne Asian/AfrAm 78XXX Shingles Asian/AfrAm 78XXX Acne Asian/AfrAm 78XXX Flu slide 9
k- Anonymous, “t - Close” Dataset 787XX HIV+ Caucas Flu 787XX HIV- Asian/AfrAm Flu This is k-anonymous, 787XX HIV+ l-diverse and t- close… Asian/AfrAm Shingles 787XX HIV- …so secure, right? Caucas Acne 787XX HIV- Caucas Shingles 787XX HIV- Caucas Acne slide 10
What Does Attacker Know? Bob is Caucasian and I heard he was 787XX HIV+ Caucas Flu admitted to hospital with flu… 787XX HIV- Asian/AfrAm Flu 787XX HIV+ Asian/AfrAm Shingles 787XX HIV- Caucas Acne 787XX HIV- Caucas Shingles 787XX HIV- Caucas Acne slide 11
What Does Attacker Know? Bob is Caucasian and 787XX HIV+ Caucas Flu I heard he was admitted to hospital … 787XX HIV- Asian/AfrAm Flu And I know three other Caucasions admitted to hospital with Acne or 787XX HIV+ Asian/AfrAm Shingles Shingles … 787XX HIV- Caucas Acne 787XX HIV- Caucas Shingles 787XX HIV- Caucas Acne slide 12
Issues with Syntactic Privacy notions • Syntactic – Focuses on data transformation, not on what can be learned from the anonymized dataset – “k - anonymous” dataset can leak sensitive information • “Quasi - identifier” fallacy – Assumes a priori that attacker will not know certain information about his target – Any attribute can be a potential quasi-identifier (AOL example) • Relies on locality – Destroys utility of many real-world datasets slide 13
Some Takeaways • “Security requires a particular mindset. Security professionals - at least the good ones- see the world differently. They can't walk into a store without noticing how they might shoplift. They can't vote without trying to figure out how to vote twice. They just can't help it.” – Bruce Schneier (2008) • Think about how things may fail instead of how it may work
The adversarial mindset: Four Key Questions 1. Security/privacy goal: What policy or good state is meant to be enforced? 2. Adversarial model: Who is the adversary? What is the adversary’s space of possible actions? 3. Mechanisms: Are the right security mechanisms in place to achieve the security goal given the adversarial model? 4. Incentives: Will human factors and economics favor or disfavor the security goal?
Recommend
More recommend