Anonymization Algorithms - Other techniques, metrics, and extended - PowerPoint PPT Presentation

Anonymization Algorithms - Other techniques, metrics, and extended scenarios Li Xiong CS573 Data Privacy and Anonymity

So far  k-anonymity (protect identity disclosure)  Anonymization algorithms  Generalization and suppression  Microaggregation and clustering  Privacy principles beyond k-anonymity  l-diversity, t-closeness (protect attribute disclosure)  m-invariance (protect continuous publishing)

Agenda  Other anonymization technique  Anatomization  Information metrics  Extended scenarios

Anonymization methods  Non-perturbative: don't distort the data  Generalization  Suppression  Perturbative: distort the data  Microaggregation/clustering  Additive noise  Anatomization and permutation  De-associate relationship between QID and sensitive attribute

Problems with k-anonymity and l-diversity tuple ID Age Sex Zipcode Disease tuple ID Age Sex Zipcode Disease 1 (Bob) 23 M 11000 pneumonia 1 [21,60] M [10001, 60000] pneumonia 2 27 M 13000 Dyspepsia 2 [21,60] M [10001, 60000] Dyspepsia 3 35 M 59000 Dyspepsia 3 [21,60] M [10001, 60000] Dyspepsia 4 59 M 12000 pneumonia 4 [21,60] M [10001, 60000] pneumonia 5 61 F 54000 flu 5 [61,70] F [10001, 60000] flu 6 65 F 25000 stomach pain 6 [61,70] F [10001, 60000] stomach pain 7 (Alice) 65 F 25000 flu 7 [61,70] F [10001, 60000] flu 8 70 F 30000 bronchitis 8 [61,70] F [10001, 60000] bronchitis table 1 table 2 Query A: SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]

Querying generalized table • R 1 and R 2 are the anonymized QID groups • Q is the query range • p = Area (R 1 ∩ R Q )/ Area (R 1 ) = (10*10)/(50*40) = 0.05 • Estimated Answer for A: 2(0.05) = 0.1

Concept of the Anatomy Algorithm • Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST) • Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column • Then produce a sensitive table with Disease statistics tuple ID Age Sex Zipcode Group-ID 1 23 M 11000 1 Group-ID Disease Count 2 27 M 13000 1 1 headache 2 3 35 M 59000 1 1 pneumonia 2 4 59 M 12000 1 2 bronchitis 1 5 61 F 54000 2 2 flu 2 6 65 F 25000 2 2 stomach ache 1 7 65 F 25000 2 8 70 F 30000 2 ST QIT

Concept of the Anatomy Algorithm tuple ID Age Sex Zipcode Group-ID 1 23 M 11000 1 Group-ID Disease Count 2 27 M 13000 1 1 headache 2 3 35 M 59000 1 1 pneumonia 2 4 59 M 12000 1 2 bronchitis 1 5 61 F 54000 2 2 flu 2 6 65 F 25000 2 2 stomach ache 1 7 65 F 25000 2 8 70 F 30000 2 ST QIT • Does it satisfy k-anonymity? l-diversity? • Query results? SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]

Specifications of Anatomy • T is representation of the microdata to be published • T has d QI attributes A qi1 , A qi2 , ..., A qid and a sensitive attribute A s • Each A qii (1 ≤ i ≤ d ) is either numerical or categorical, but A s can only be categorical because of l -diversity • t is a tuple within T and A qii is the value of t with [ d + 1] as the A s value • With the above stated, we can consider t to be a point in a ( d +1)-dimensional data space regarded as DS

Specifications of Anatomy cont. D EFINITION 1. (Partition/QI-group) A partition is several subsets of T and only allow each tuple to belong to one subset Subsets are know as QI-groups and are denoted as follows QI 1 , QI 2 , ..., QI m

Specifications of Anatomy cont. D EFINITION 2. ( l -diverse partition) A partition is considered l -diverse if it conforms to the following: v is the most frequent sensitive value in a QI-group QI j and c j (v) is the number of tuples that match v c j (v) /| QIj | ≤ 1/ l | QI j | is the number of tuples of QI j c 1 (dyspepsia) = c 1 (pneumonia) = 2 and c 2 (flu) = 2 | QI 1 | = | QI 2 | = 4 so this satisfies the condition 2/4 ≤ 1/2

Specifications of Anatomy cont. D EFINITION 3. (Anatomy) With a given l -diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: ( A qi1 , A qi2 , ..., A qid , Group-ID ) ST will be constructed as the following: ( Group-ID , A s , Count )

Privacy properties T HEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/ l Age Sex Zipcode Group-ID Disease Count 23 M 11000 1 dyspepsia 2 23 M 11000 1 pneumonia 2 27 M 13000 1 dyspepsia 2 27 M 13000 1 pneumonia 2 35 M 59000 1 dyspepsia 2 35 M 59000 1 pneumonia 2 59 M 12000 1 dyspepsia 2 59 M 12000 1 pneumonia 2 61 F 54000 2 bronchitis 1 61 F 54000 2 flu 2 61 F 54000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 70 F 30000 2 bronchitis 1 70 F 30000 2 flu 2 70 F 30000 2 stomachache 1

Comparison with generalization • Compare with generalization on two assumptions: A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata If A1 and A2 are true, anatomy is as good as generalization 1/ l holds true If A1 is true and A2 is false, generalization is stronger If A1 and A2 are false, generalization is still stronger

Preserving Data Correlation • Examine the correlation between Age and Disease in T using probability density function pdf • Example: t1 tuple ID Age Sex Zipcode Disease 1 (Bob) 23 M 11000 pneumonia 2 27 M 13000 Dyspepsia 3 35 M 59000 Dyspepsia 4 59 M 12000 pneumonia 5 61 F 54000 flu 6 65 F 25000 stomach pain 7 (Alice) 65 F 25000 flu 8 70 F 30000 bronchitis table 1

Preserving Data Correlation cont. • To re-construct an approximate pdf of t 1 from the generalization table: tuple ID Age Sex Zipcode Disease 1 [21,60] M [10001, 60000] pneumonia 2 [21,60] M [10001, 60000] Dyspepsia 3 [21,60] M [10001, 60000] Dyspepsia 4 [21,60] M [10001, 60000] pneumonia 5 [61,70] F [10001, 60000] flu 6 [61,70] F [10001, 60000] stomach pain 7 [61,70] F [10001, 60000] flu 8 [61,70] F [10001, 60000] bronchitis table 2

Preserving Data Correlation cont. • To re-construct an approximate pdf of t 1 from the QIT and ST tables: tuple ID Age Sex Zipcode Group-ID 1 23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 59 M 12000 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT Group-ID Disease Count 1 headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST

Preserving Data Correlation cont. • To figure out a more rigorous comparison, calculate the “ L 2 distance” with the following equation: The distance for anatomy is 0.5 while the distance for generalization is 22.5 • Anatomy provides for better re-constructions of the probability density functions of all tuples.

Preserving Data Correlation cont. • measure the error for each pdf by using the following formula: Objective: for all tuples t in T and obtain a minimal re- construction error (RCE):

Nearly-Optimal Anatomizing Algorithm • They propose an efficient algorithm for anatomizing tables that will minimize the RCE • The resulting QIT and ST achieves an RCE that only deviates from the lower bound by a factor < 1 + 1/ n , where n is the size of T • This algorithm has linear I/O complexity O ( n / b ) where b is the page size

Nearly-Optimal Anatomizing Algorithm cont. P ROPERTY 1. At the end of the group-creation phase, each non-empty bucket has only one tuple. P ROPERTY 2. The set S' always includes at least one QI-group. PROPERTY 3. After the residue-assignment phase, each QI group has at least l tuples with distinct senstive attribute value

Experiments • dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes • Created two sets of microdata tables Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC- d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute A s Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL- d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute A s g

Experiments cont.

Conclusion • Anatomy was designed to overcome the problem of generalization of losing too much data and still obtain privacy • Anatomy has a significantly lower error rate as compared with generalization • Several items would require further research - Multiple sensitive attributes - Effective mining of patterns in microdata

Agenda  Other anonymization technique  Anatomization  Information metrics  Extended scenarios

Information Metrics  General purpose metrics  Special purpose metrics  Trade-off metrics

Anonymization Algorithms - Other techniques, metrics, and extended - PowerPoint PPT Presentation

Anonymization Algorithms - Other techniques, metrics, and extended scenarios Li Xiong CS573 Data Privacy and Anonymity So far k-anonymity (protect identity disclosure) Anonymization algorithms Generalization and suppression

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

Introduction to Anonymization (I) Claire McKay Bowen Postdoctoral Researcher, Los Alamos

Data Privacy Anonymization Li Xiong CS573 Data Privacy and Security Outline Inference

Data Masking and Anonymization for PostgreSQL 1 The Anonymization Challenge 8 Strategies

Towards Plausible Graph Anonymization Yang Zhang, Mathias Humbert, Bartlomiej Surma, Praveen

Big Data and the application of anonymization techniques Annual Privacy Forum 2015 7-8 October,

Anonymization Algorithms - Microaggregation and Clustering Li Xiong CS573 Data Privacy and

What we learned from Community Metrics Agenda Why are metrics used? How metrics are used

Proposal Metrics Dashboard What Gets Measured Gets Done Topics Why Keep Metrics? What

Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics

AGENCY OPERATIONS METRICS The Metrics of Me The Metrics of Me x 159 13,006 5 days old books

Encryption and Anonymization in Hadoop Current and Future needs Sept-28-2015 ApacheCon, Budapest

CS573 Data Privacy and Security Data Anonymization (cont.) Li Xiong Department of Mathematics

Sequential Composition Claire McKay Bowen Postdoctoral Researcher, Los Alamos National Laboratory

realisation in Opt-Out trial site Nepean Blue Mountains Jo-Anna Wood Change & Adoption

Disclaimer This webinar may be recorded. This webinar presents a sampling of best practices and

Surgical Treatment of Female Stress Urinary Incontinence: AUA/SUFU Guideline Kathleen C. Kobashi,

during Emergency Situations Silvia Declich Italian Institute of Health Istituto Superiore di

Simulatability The enemy knows the system, Claude Shannon

Simulatability The enemy knows the system, Claude Shannon CompSci 590.03 Instructor: Ashwin

CME for POTS/EDS/Chiari My story with POTS/EDS/Chiari By Amanda Ross Frequent Questions How

I U Health Motility Conference July 2, 2014 Anne Mary Montero, PhD, HSPP Prevalence

Anonymization Algorithms - Other techniques, metrics, and extended - PowerPoint PPT Presentation

Anonymization Algorithms - Other techniques, metrics, and extended scenarios Li Xiong CS573 Data Privacy and Anonymity So far k-anonymity (protect identity disclosure) Anonymization algorithms Generalization and suppression

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

Introduction to Anonymization (I) Claire McKay Bowen Postdoctoral Researcher, Los Alamos

Data Privacy Anonymization Li Xiong CS573 Data Privacy and Security Outline Inference

Data Masking and Anonymization for PostgreSQL 1 The Anonymization Challenge 8 Strategies

Towards Plausible Graph Anonymization Yang Zhang, Mathias Humbert, Bartlomiej Surma, Praveen

Big Data and the application of anonymization techniques Annual Privacy Forum 2015 7-8 October,

Anonymization Algorithms - Microaggregation and Clustering Li Xiong CS573 Data Privacy and

What we learned from Community Metrics Agenda Why are metrics used? How metrics are used

Proposal Metrics Dashboard What Gets Measured Gets Done Topics Why Keep Metrics? What

Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics

AGENCY OPERATIONS METRICS The Metrics of Me The Metrics of Me x 159 13,006 5 days old books

Encryption and Anonymization in Hadoop Current and Future needs Sept-28-2015 ApacheCon, Budapest

CS573 Data Privacy and Security Data Anonymization (cont.) Li Xiong Department of Mathematics

Sequential Composition Claire McKay Bowen Postdoctoral Researcher, Los Alamos National Laboratory

realisation in Opt-Out trial site Nepean Blue Mountains Jo-Anna Wood Change &amp; Adoption

Disclaimer This webinar may be recorded. This webinar presents a sampling of best practices and

Surgical Treatment of Female Stress Urinary Incontinence: AUA/SUFU Guideline Kathleen C. Kobashi,

during Emergency Situations Silvia Declich Italian Institute of Health Istituto Superiore di

Simulatability The enemy knows the system, Claude Shannon

Simulatability The enemy knows the system, Claude Shannon CompSci 590.03 Instructor: Ashwin

CME for POTS/EDS/Chiari My story with POTS/EDS/Chiari By Amanda Ross Frequent Questions How

I U Health Motility Conference July 2, 2014 Anne Mary Montero, PhD, HSPP Prevalence

realisation in Opt-Out trial site Nepean Blue Mountains Jo-Anna Wood Change & Adoption