Anonymization Algorithms - Other techniques, metrics, and extended scenarios Li Xiong CS573 Data Privacy and Anonymity
So far k-anonymity (protect identity disclosure) Anonymization algorithms Generalization and suppression Microaggregation and clustering Privacy principles beyond k-anonymity l-diversity, t-closeness (protect attribute disclosure) m-invariance (protect continuous publishing)
Agenda Other anonymization technique Anatomization Information metrics Extended scenarios
Anonymization methods Non-perturbative: don't distort the data Generalization Suppression Perturbative: distort the data Microaggregation/clustering Additive noise Anatomization and permutation De-associate relationship between QID and sensitive attribute
Problems with k-anonymity and l-diversity tuple ID Age Sex Zipcode Disease tuple ID Age Sex Zipcode Disease 1 (Bob) 23 M 11000 pneumonia 1 [21,60] M [10001, 60000] pneumonia 2 27 M 13000 Dyspepsia 2 [21,60] M [10001, 60000] Dyspepsia 3 35 M 59000 Dyspepsia 3 [21,60] M [10001, 60000] Dyspepsia 4 59 M 12000 pneumonia 4 [21,60] M [10001, 60000] pneumonia 5 61 F 54000 flu 5 [61,70] F [10001, 60000] flu 6 65 F 25000 stomach pain 6 [61,70] F [10001, 60000] stomach pain 7 (Alice) 65 F 25000 flu 7 [61,70] F [10001, 60000] flu 8 70 F 30000 bronchitis 8 [61,70] F [10001, 60000] bronchitis table 1 table 2 Query A: SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]
Querying generalized table • R 1 and R 2 are the anonymized QID groups • Q is the query range • p = Area (R 1 ∩ R Q )/ Area (R 1 ) = (10*10)/(50*40) = 0.05 • Estimated Answer for A: 2(0.05) = 0.1
Concept of the Anatomy Algorithm • Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST) • Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column • Then produce a sensitive table with Disease statistics tuple ID Age Sex Zipcode Group-ID 1 23 M 11000 1 Group-ID Disease Count 2 27 M 13000 1 1 headache 2 3 35 M 59000 1 1 pneumonia 2 4 59 M 12000 1 2 bronchitis 1 5 61 F 54000 2 2 flu 2 6 65 F 25000 2 2 stomach ache 1 7 65 F 25000 2 8 70 F 30000 2 ST QIT
Concept of the Anatomy Algorithm tuple ID Age Sex Zipcode Group-ID 1 23 M 11000 1 Group-ID Disease Count 2 27 M 13000 1 1 headache 2 3 35 M 59000 1 1 pneumonia 2 4 59 M 12000 1 2 bronchitis 1 5 61 F 54000 2 2 flu 2 6 65 F 25000 2 2 stomach ache 1 7 65 F 25000 2 8 70 F 30000 2 ST QIT • Does it satisfy k-anonymity? l-diversity? • Query results? SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]
Specifications of Anatomy • T is representation of the microdata to be published • T has d QI attributes A qi1 , A qi2 , ..., A qid and a sensitive attribute A s • Each A qii (1 ≤ i ≤ d ) is either numerical or categorical, but A s can only be categorical because of l -diversity • t is a tuple within T and A qii is the value of t with [ d + 1] as the A s value • With the above stated, we can consider t to be a point in a ( d +1)-dimensional data space regarded as DS
Specifications of Anatomy cont. D EFINITION 1. (Partition/QI-group) A partition is several subsets of T and only allow each tuple to belong to one subset Subsets are know as QI-groups and are denoted as follows QI 1 , QI 2 , ..., QI m
Specifications of Anatomy cont. D EFINITION 2. ( l -diverse partition) A partition is considered l -diverse if it conforms to the following: v is the most frequent sensitive value in a QI-group QI j and c j (v) is the number of tuples that match v c j (v) /| QIj | ≤ 1/ l | QI j | is the number of tuples of QI j c 1 (dyspepsia) = c 1 (pneumonia) = 2 and c 2 (flu) = 2 | QI 1 | = | QI 2 | = 4 so this satisfies the condition 2/4 ≤ 1/2
Specifications of Anatomy cont. D EFINITION 3. (Anatomy) With a given l -diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: ( A qi1 , A qi2 , ..., A qid , Group-ID ) ST will be constructed as the following: ( Group-ID , A s , Count )
Privacy properties T HEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/ l Age Sex Zipcode Group-ID Disease Count 23 M 11000 1 dyspepsia 2 23 M 11000 1 pneumonia 2 27 M 13000 1 dyspepsia 2 27 M 13000 1 pneumonia 2 35 M 59000 1 dyspepsia 2 35 M 59000 1 pneumonia 2 59 M 12000 1 dyspepsia 2 59 M 12000 1 pneumonia 2 61 F 54000 2 bronchitis 1 61 F 54000 2 flu 2 61 F 54000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 70 F 30000 2 bronchitis 1 70 F 30000 2 flu 2 70 F 30000 2 stomachache 1
Comparison with generalization • Compare with generalization on two assumptions: A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata If A1 and A2 are true, anatomy is as good as generalization 1/ l holds true If A1 is true and A2 is false, generalization is stronger If A1 and A2 are false, generalization is still stronger
Preserving Data Correlation • Examine the correlation between Age and Disease in T using probability density function pdf • Example: t1 tuple ID Age Sex Zipcode Disease 1 (Bob) 23 M 11000 pneumonia 2 27 M 13000 Dyspepsia 3 35 M 59000 Dyspepsia 4 59 M 12000 pneumonia 5 61 F 54000 flu 6 65 F 25000 stomach pain 7 (Alice) 65 F 25000 flu 8 70 F 30000 bronchitis table 1
Preserving Data Correlation cont. • To re-construct an approximate pdf of t 1 from the generalization table: tuple ID Age Sex Zipcode Disease 1 [21,60] M [10001, 60000] pneumonia 2 [21,60] M [10001, 60000] Dyspepsia 3 [21,60] M [10001, 60000] Dyspepsia 4 [21,60] M [10001, 60000] pneumonia 5 [61,70] F [10001, 60000] flu 6 [61,70] F [10001, 60000] stomach pain 7 [61,70] F [10001, 60000] flu 8 [61,70] F [10001, 60000] bronchitis table 2
Preserving Data Correlation cont. • To re-construct an approximate pdf of t 1 from the QIT and ST tables: tuple ID Age Sex Zipcode Group-ID 1 23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 59 M 12000 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT Group-ID Disease Count 1 headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST
Preserving Data Correlation cont. • To figure out a more rigorous comparison, calculate the “ L 2 distance” with the following equation: The distance for anatomy is 0.5 while the distance for generalization is 22.5 • Anatomy provides for better re-constructions of the probability density functions of all tuples.
Preserving Data Correlation cont. • measure the error for each pdf by using the following formula: Objective: for all tuples t in T and obtain a minimal re- construction error (RCE):
Nearly-Optimal Anatomizing Algorithm • They propose an efficient algorithm for anatomizing tables that will minimize the RCE • The resulting QIT and ST achieves an RCE that only deviates from the lower bound by a factor < 1 + 1/ n , where n is the size of T • This algorithm has linear I/O complexity O ( n / b ) where b is the page size
Nearly-Optimal Anatomizing Algorithm cont. P ROPERTY 1. At the end of the group-creation phase, each non-empty bucket has only one tuple. P ROPERTY 2. The set S' always includes at least one QI-group. PROPERTY 3. After the residue-assignment phase, each QI group has at least l tuples with distinct senstive attribute value
Experiments • dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes • Created two sets of microdata tables Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC- d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute A s Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL- d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute A s g
Experiments cont.
Experiments cont.
Experiments cont.
Experiments cont.
Conclusion • Anatomy was designed to overcome the problem of generalization of losing too much data and still obtain privacy • Anatomy has a significantly lower error rate as compared with generalization • Several items would require further research - Multiple sensitive attributes - Effective mining of patterns in microdata
Agenda Other anonymization technique Anatomization Information metrics Extended scenarios
Information Metrics General purpose metrics Special purpose metrics Trade-off metrics
Recommend
More recommend