CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong
Today • Permutation based anonymization methods (cont.) • Other privacy principles for microdata publishing publishing • Statistical databases
Anonymization methods • Non-perturbative: don't distort the data – Generalization – Suppression • Perturbative: distort the data • Perturbative: distort the data – Microaggregation/clustering – Additive noise • Anatomization and permutation – De-associate relationship between QID and sensitive attribute
Concept of the Anatomy Algorithm • Release 2 tables, ���������������������� (QIT) and ���������� ����� (ST) • Use the same QI groups (satisfy l!diversity), replace the sensitive attribute values with a Group!ID column • Then produce a sensitive table with ������� statistics • Then produce a sensitive table with ������� statistics tuple ID ��� ��� ������� �������� 1 23 M 11000 1 �������� ������� ����� 2 27 M 13000 1 1 headache 2 3 35 M 59000 1 1 pneumonia 2 4 59 M 12000 1 2 bronchitis 1 5 61 F 54000 2 2 flu 2 6 65 F 25000 2 2 stomach ache 1 7 65 F 25000 2 8 70 F 30000 2 ST QIT
Specifications of Anatomy ����� D EFINITION 3. (Anatomy) With a given � !diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: ( � �� � � �� �������� �� � �������� ) ( � ��� � � ��� �������� ��� � �������� ) ST will be constructed as the following: ( �������� , � � , ����� )
Privacy properties T HEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/ � ��� ��� ������� �������� ������� ����� 23 M 11000 1 dyspepsia 2 23 M 11000 1 pneumonia 2 27 M 13000 1 dyspepsia 2 27 27 M M 13000 13000 1 1 pneumonia pneumonia 2 2 35 M 59000 1 dyspepsia 2 35 M 59000 1 pneumonia 2 59 M 12000 1 dyspepsia 2 59 M 12000 1 pneumonia 2 61 F 54000 2 bronchitis 1 61 F 54000 2 flu 2 61 F 54000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 70 F 30000 2 bronchitis 1 70 F 30000 2 flu 2 70 F 30000 2 stomachache 1
Comparison with generalization • Compare with generalization on two assumptions: A1: the adversary has the QI!values of the target individual A2: the adversary also knows that the individual is definitely in the ��������� If A1 and A2 are true, anatomy is as good as generalization 1/ �� If A1 and A2 are true, anatomy is as good as generalization 1/ �� holds true If A1 is true and A2 is false, generalization is stronger If A1 and A2 are false, generalization is still stronger
Preserving Data Correlation • Examine the correlation between Age and Disease in T using probability density function pdf • Example: t1 tuple ID ��� ��� ������� ������� 1 (Bob) 23 M 11000 pneumonia 2 27 M 13000 Dyspepsia 3 35 M 59000 Dyspepsia 4 59 M 12000 pneumonia 5 61 F 54000 flu 6 65 F 25000 stomach pain 7 (Alice) 65 F 25000 flu 8 70 F 30000 bronchitis table 1
Preserving Data Correlation ����� • To re!construct an approximate pdf of � � from the generalization table: tuple ID ��� ��� ������� ������� 1 1 [21,60] [21,60] M M [10001, 60000] [10001, 60000] pneumonia pneumonia 2 [21,60] M [10001, 60000] Dyspepsia 3 [21,60] M [10001, 60000] Dyspepsia 4 [21,60] M [10001, 60000] pneumonia 5 [61,70] F [10001, 60000] flu 6 [61,70] F [10001, 60000] stomach pain 7 [61,70] F [10001, 60000] flu 8 [61,70] F [10001, 60000] bronchitis table 2
Preserving Data Correlation ����� • To re!construct an approximate pdf of � � from the QIT and ST tables: tuple ID ��� ��� ������� �������� 1 23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 4 59 59 M M 12000 12000 1 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT �������� ������� ����� 1 headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST
Preserving Data Correlation ����� • To figure out a more rigorous comparison, calculate the “ � � distance” with the following equation: The distance for anatomy is 0.5 while the distance for The distance for anatomy is 0.5 while the distance for generalization is 22.5
Preserving Data Correlation ����� Idea: Measure the error for each tuple by using the following formula: Objective: for all tuples � in � and obtain a minimal ��� ������������������ (RCE): Algorithm: Nearly!Optimal Anatomizing Algorithm
Experiments • dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes • Created two sets of ��������� tables Set 1: 5 tables denoted as OCC!3, ..., OCC!7 so that OCC! � (3 ≤ � ≤ 7) uses the first � as QI!attributes and ��������� (3 ≤ � ≤ 7) uses the first � as QI!attributes and ��������� as the sensitive attribute � � Set 2: 5 tables denoted as SAL!3, ..., SAL!7 so that SAL! � (3 ≤ � ≤ 7) uses the first � as QI!attributes and !����"������� as the sensitive attribute � � g
Experiments �����
Today • Permutation based anonymization methods (cont.) • Other privacy principles for microdata publishing publishing • Statistical databases • Differential privacy
Attacks on k-Anonymity • k-Anonymity does not provide privacy if – Sensitive values in an equivalence class lack diversity – The attacker has background knowledge A 3-anonymous patient table A 3-anonymous patient table Homogeneity attack Homogeneity attack Zipcode Age Disease Bob 476** 2* Heart Disease ������� ��� 476** 2* Heart Disease 47678 27 476** 2* Heart Disease 4790* ≥40 Flu 4790* ≥40 Heart Disease Background knowledge attack 4790* ≥40 Cancer Carl 476** 3* Heart Disease ������� ��� 476** 3* Cancer 47673 36 476** 3* Cancer slide 16
l-Diversity [Machanavajjhala et al. ICDE ‘06] Caucas 787XX Flu Caucas 787XX Shingles Caucas 787XX Acne Caucas 787XX Flu Caucas Caucas 787XX 787XX Acne Acne Sensitive attributes must be Sensitive attributes must be “diverse” within each Caucas 787XX Flu quasi-identifier equivalence class Asian/AfrAm 78XXX Flu Asian/AfrAm 78XXX Flu Asian/AfrAm 78XXX Acne Asian/AfrAm 78XXX Shingles Asian/AfrAm 78XXX Acne Asian/AfrAm 78XXX Flu slide 17
Distinct l-Diversity • Each equivalence class has at least l well- represented sensitive values • Doesn’t prevent probabilistic inference attacks 8 records have HIV 10 records 2 records have other values slide 18
Other Versions of l-Diversity • Probabilistic l-diversity – The frequency of the most frequent value in an equivalence class is bounded by 1/l • Entropy l-diversity • Entropy l-diversity – The entropy of the distribution of sensitive values in each equivalence class is at least log(l) • Recursive (c,l)-diversity – r 1 <c(r l +r l+1 +…+r m ) where r i is the frequency of the i th most frequent value – Intuition: the most frequent value does not appear too frequently slide 19
Neither Necessary, Nor Sufficient Original dataset 2 Cancer 2 Cancer 2 Cancer 2 Flu 2 2 Cancer Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Flu 99% have cancer
Neither Necessary, Nor Sufficient Original dataset Anonymization A 2 Cancer Q1 Flu 2 Cancer Q1 Flu 2 Cancer Q1 Cancer 2 Flu Q1 Flu 2 2 Cancer Cancer Q1 Q1 Cancer Cancer 2 Cancer Q1 Cancer 2 Cancer Q2 Cancer 2 Cancer Q2 Cancer 2 Cancer Q2 Cancer 2 Cancer Q2 Cancer 2 Flu Q2 Cancer 50% cancer ⇒ quasi5identifier group is “diverse” 2 Flu Q2 Cancer 99% have cancer slide 21
Neither Necessary, Nor Sufficient Original dataset Anonymization A Anonymization B 2 Cancer Q1 Flu Q1 Flu 2 Cancer Q1 Flu Q1 Cancer 2 Cancer Q1 Cancer Q1 Cancer 2 Flu Q1 Flu Q1 Cancer 2 2 Cancer Cancer Q1 Q1 Cancer Cancer Q1 Q1 Cancer Cancer 2 Cancer Q1 Cancer Q1 Cancer 2 Cancer Q2 Cancer Q2 Cancer 2 Cancer Q2 Cancer Q2 Cancer 99% cancer ⇒ quasi5identifier group is not “diverse” 2 Cancer Q2 Cancer Q2 Cancer 2 Cancer Q2 Cancer Q2 Cancer 2 Flu Q2 Cancer Q2 Flu 50% cancer ⇒ quasi5identifier group is “diverse” 2 Flu Q2 Cancer Q2 Flu ������������������������������� 99% have cancer slide 22
Recommend
More recommend