K-Anonymity & Algorithms CompSci 590.03 Instructor: Ashwin - PowerPoint PPT Presentation

K-Anonymity & Algorithms CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 3 : 590.03 Fall 12 1

Announcements • Project ideas are posted on the site. – You are welcome to send me (or talk to me about) your own ideas. Lecture 3 : 590.03 Fall 12 2

Outline • K-Anonymity: a metric for anonymity for data publishing [Sweeney IJUFKS 2002] • Algorithms for K-anonymous data publishing – Generalization/Suppression [Lefevre et al SIGMOD 2006] – Curse of Dimensionality [Agarwal VLDB 2005] Lecture 3 : 590.03 Fall 12 3

Offline Data Publishing Database Researcher Microdata Data at the granularity of individuals

Sample Microdata SSN Zip Age Nationality Disease 631-35-1210 13053 28 Russian Heart 051-34-1430 13068 29 American Heart 120-30-1243 13068 21 Japanese Viral 070-97-2432 13053 23 American Viral 238-50-0890 14853 50 Indian Cancer 265-04-1275 14853 55 Russian Heart 574-22-0242 14850 47 American Viral 388-32-1539 14850 59 American Viral 005-24-3424 13053 31 American Cancer 248-223-2956 13053 37 Indian Cancer 221-22-9713 13068 36 Japanese Cancer 615-84-1924 13068 32 American Cancer

Removing SSN … Zip Age Nationality Disease 13053 28 Russian Heart 13068 29 American Heart 13068 21 Japanese Viral 13053 23 American Viral 14853 50 Indian Cancer 14853 55 Russian Heart 14850 47 American Viral 14850 59 American Viral 13053 31 American Cancer 13053 37 Indian Cancer 13068 36 Japanese Cancer 13068 32 American Cancer

The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002] • Governor of MA 87 % of US population • Name • Name uniquely identified • SSN • Address • Zip using ZipCode, • Date • Visit Date Birth Date, and Sex. • Birth Registered • Diagnosis date • Party • Procedure affiliation • Medication • Sex • Date last • Total Charge voted Quasi Identifier Medical Data Voter List Lecture 2 : 590.03 Fall 12 7

Linkage Attacks Zip Age Nationality Disease 13053 28 Russian Heart 13068 29 American Heart 13068 21 Japanese Viral Quasi- 13053 23 American Viral Identifier 14853 50 Indian Cancer 14853 55 Russian Heart 14850 47 American Viral 14850 59 American Viral 13053 31 American Cancer 13053 37 Indian Cancer 13068 36 Japanese Cancer 13068 32 American Cancer Public Information

We saw examples in last class • Massachusetts governor attack • AOL privacy breach • Netflix attack • Social Network attacks Lecture 3 : 590.03 Fall 12 9

K-Anonymity [Samarati et al, PODS 1998] • Generalize, modify, or distort quasi-identifier values so that no individual is uniquely identifiable from a group of k • In SQL, table T is k-anonymous if each SELECT COUNT(*) FROM T GROUP BY Quasi-Identifier is ≥ k • Parameter k indicates the “degree” of anonymity

Example 1: Generalization (Coarsening) Zip Age Nationality Disease Zip Age Nationality Disease 130** <30 * Heart 13053 28 Russian Heart 130** <30 * Heart 13068 29 American Heart 130** <30 * Flu 13068 21 Japanese Flu 130** <30 * Flu 13053 23 American Flu 1485* >40 * Cancer 14853 50 Indian Cancer 1485* >40 * Heart 14853 55 Russian Heart 1485* >40 * Flu 14850 47 American Flu 1485* >40 * Flu 14850 59 American Flu 130** 30-40 * Cancer 13053 31 American Cancer Equivalence Class : Group 130** 30-40 * Cancer 13053 37 Indian Cancer of k-anonymous records 130** 30-40 * Cancer 13068 36 Japanese Cancer that share the same value 130** 30-40 * Cancer 13068 32 American Cancer for Quasi-identifier attribtutes

Example 2: Clustering Lecture 3 : 590.03 Fall 12 12

Example 3: Microaggregation Zip Age Nationality Disease Zip Age Nationality Disease 13053 28 Russian Heart 4 tuples 2 Heart 13068 29 American Heart and Zip code = 130** 13068 21 Japanese Flu 2 Flu 23 < Age < 29 13053 23 American Flu Average(age) = 25 4 tuples 1 Cancer, 14853 50 Indian Cancer Zip = 1485* 1 Heart 14853 55 Russian Heart and 47 < Age < 59 14850 47 American Flu 2 Flu Average(age) = 53 14850 59 American Flu 4 tuples All Cancer Zip = 130** 13053 31 American Cancer patients 31 < Age < 37 13053 37 Indian Cancer Avergae(age) = 34 13068 36 Japanese Cancer 13068 32 American Cancer

K-Anonymity • Joining the published data to an external dataset using quasi- identifiers results in at least k records per quasi-identifier combination. • What is a quasi-identifier? – Combination of attributes (that an adversary may know) that uniquely identify a large fraction of the population. – There can be many sets of quasi-identifiers. If Q = {B, Z, S} is a quasi-identifier, then Q + {N} is also a quasi-identifier. – Need to guarantee k-anonymity against the largest set of quasi-identifiers Lecture 3 : 590.03 Fall 12 14

Outline • K-Anonymity: a metric for anonymity for data publishing [Sweeney IJUFKS 2002] • Algorithms for K-anonymous data publishing – Generalization/Suppression [Lefevre et al SIGMOD 2006] – Curse of Dimensionality [Agarwal VLDB 2005] Lecture 3 : 590.03 Fall 12 15

Generalization • Coarsen (or suppress) an attribute to a more general value. Generation Step • Numeric Values – Suppress low significant bits: 12345 -> 1234* -> 123** – Ranges: 23 -> [20-25]; (30.5N 20.3E) -> box(30N-31N,20E-22E) Lecture 3 : 590.03 Fall 12 16

Generalization • Coarsen (or suppress) an attribute to a more general value. Generation Step • Categorical Values – Domain Generalization Hierarchies State-gov occupation  Government occupation  Workclass Equivalent to suppressing the value Lecture 3 : 590.03 Fall 12 17

Full Domain vs Local Generalization • Full Domain: Generalize all values in an attribute to the same “level” – Every occurrence of 12345 is replaced with 1234* in the database. – Answering queries on such datasets is easier. • Local Generalization: Values can be generalized to different levels. – 12345 in one tuple may be generalized to 1234*, and in another tuple entirely suppressed. – Allows k-anonymous datasets with lesser information loss. Lecture 3 : 590.03 Fall 12 18

Generalization Lattice • Generalization step D - > D’: D’ is constructed from D using one generalization step. Nationality Zip * 130** Suppress tens digit of Zip Suppress nationality * 130** * 148** Nationality Zip Nationality Zip * 1306* American 130** * 1305* Japanese 130** * 1485* Japanese 148** Nationality Zip Suppress nationality Suppress tens digit of Zip American 1306* Japanese 1305* Japanese 1485* Lecture 3 : 590.03 Fall 12 19

Utility: Quantifying error • Each generalization step introduces error. • Larger equivalence classes also may lead to more error. Utility Metrics : • Average size of equivalence classes • Number of steps in generalization lattice • Discernibility metric – Assign a penalty to each tuple – Penalty depends on how many other tuples are indistinguishable from it Do not take into account the distribution of values in each equivalence class. Lecture 3 : 590.03 Fall 12 20

Utility Metrics • Classification metric – Assign a penalty to each tuple t: • If t‘s sensitive value == majority sensitive value in the group: Penalty = 0 • Otherwise: Penalty = size of equivalence class Does not take into account the distribution of the quasi- identifier attributes. • Information Loss – Penalty for each tuple = 1 - 1/ # values that can generalize to that tuple – E.g., Penalty (14850, 47) = 1 – 1 /1 = 0 – Penalty(1485*, [40-50]) = 1 – 1 / (10*10) = .99 Lecture 3 : 590.03 Fall 12 21

Empirical Distribution • P(X=x) = fraction of tuples in the data with value x. 0.25 0.2 0.15 0.1 0.05 0 110 140 170 200 230 260 290 200 weights drawn from a normal distribution with mean 200 and sd 25. Lecture 3 : 590.03 Fall 12 22

Empirical Distribution • P(X=x) = fraction of tuples in the data with value x. 2000 weights drawn from a normal distribution with mean 200 and sd 25. Lecture 3 : 590.03 Fall 12 23

Utility Metrics KL-Divergence: • Suppose records were sampled from some multi-dimensional distribution F – iid (identically and independently distributed) • Given a table, we can estimate F with the empirical distribution F’ F’(14850, 47, American) = fraction of tuples in the database with Zip = 14850 AND Age=47 AND Nationality = American Lecture 3 : 590.03 Fall 12 24

Utility Metrics KL-Divergence: • Similarly, given a k-anonymous table, we can compute the empirical distribution F’ k-anon F’ k-anon (14850, 47, American) = 1/N * ( Σ equivalence class C P[(14850, 47, American) in C] * |C|) Lecture 3 : 590.03 Fall 12 25

Example Zip Age Nationality Disease 13053 28 Russian Heart 13068 29 American Heart 13068 21 Japanese Flu 13053 23 American Flu F’(13053, 37, Indian) = 1/12 14853 50 Indian Cancer 14853 55 Russian Heart 14850 47 American Flu 14850 59 American Flu 13053 31 American Cancer 13053 37 Indian Cancer 13068 36 Japanese Cancer 13068 32 American Cancer

K-Anonymity & Algorithms CompSci 590.03 Instructor: Ashwin - PowerPoint PPT Presentation

K-Anonymity & Algorithms CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 3 : 590.03 Fall 12 1 Announcements Project ideas are posted on the site. You are welcome to send me (or talk to me about) your own ideas. Lecture 3

Anonymization Algorithms - Other techniques, metrics, and extended scenarios Li Xiong CS573

Bitcoin and Anonymity Anonymity Basics How to de-anonymize Bitcoin Mixing

Measures of Anonymity/Privacy: k-Anonymity, L-Diversity,

but quite a lot is. Coordination among users can help with anonymity. Debajyoti Das 1 Sebastian

Streaming algorithms for k -center clustering with outliers and with anonymity Richard Matthew

Online Anonymity Andrew Lewman andrew@torproject.org June 8, 2010 What is anonymity? Anonymity

Anonymity in Bitcoin Tumbler/Mixer Oct 9, 2019 Anonymity and Pseudonymity anonymous =

Anonymity Jiayi Fu What is Anonymity - Describe the situation in which someone's name is not

Lecture 24 Anonymity and Privacy Stephen Checkoway University of Illinois at Chicago CS 487

11-830 Computational Ethics for NLP Lecture 11: Privacy and Anonymity Privacy and Anonymity

Anonymity in Cryptocurrencies Foteini Baldimtsi Bitcoin Anonymity? Satoshi Nakamoto, 2008

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

Slicing the licing the Onion: Onion: Anonymity Without PKI Anonymity Without PKI Sachin Katti

Applications for Measurement: Improving Anonymity Online Rishab Nithyanand | Rachee Singh |

Identity and Identity and anonymity anonymity Engineering & Public Policy Lorrie Faith

for anonymity, but quite a lot is. Debajyoti Das 1 Sebastian Meiser 2 Esfandiar Mohammadi 3 Aniket

Free Software, Free Internet, Anonymity & Tor Andrew Lewman andrew@torproject.org 24 Feb

Anonymity with Identity Escrow Aybek Mukhamedov and Mark Ryan The University of Birmingham March

Crowds: Anonymity for Web Transactions Paper by: Michael K. Reiter & Aviel D. Rubin of

W hy is anonymity so hard? Roger D ingledine T he Free Haven Project 1 M any people need

Conscript Your Friends into Larger Anonymity Sets with JavaScript Henry Corrigan-Gibbs

Privacy-Enhancing Overlays in Bitcoin Sarah Meiklejohn (University College London) Claudio

Anonymity Networks for Crypto Geeks, the Department of Defense, and you. Nick Mathewson

Anonymity Loves Company: Usability and the network effect Roger Dingledine, Nick Mathewson The

K-Anonymity & Algorithms CompSci 590.03 Instructor: Ashwin - PowerPoint PPT Presentation

K-Anonymity & Algorithms CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 3 : 590.03 Fall 12 1 Announcements Project ideas are posted on the site. You are welcome to send me (or talk to me about) your own ideas. Lecture 3

Anonymization Algorithms - Other techniques, metrics, and extended scenarios Li Xiong CS573

Bitcoin and Anonymity Anonymity Basics How to de-anonymize Bitcoin Mixing

Measures of Anonymity/Privacy: k-Anonymity, L-Diversity,

but quite a lot is. Coordination among users can help with anonymity. Debajyoti Das 1 Sebastian

Streaming algorithms for k -center clustering with outliers and with anonymity Richard Matthew

Online Anonymity Andrew Lewman andrew@torproject.org June 8, 2010 What is anonymity? Anonymity

Anonymity in Bitcoin Tumbler/Mixer Oct 9, 2019 Anonymity and Pseudonymity anonymous =

Anonymity Jiayi Fu What is Anonymity - Describe the situation in which someone's name is not

Lecture 24 Anonymity and Privacy Stephen Checkoway University of Illinois at Chicago CS 487

11-830 Computational Ethics for NLP Lecture 11: Privacy and Anonymity Privacy and Anonymity

Anonymity in Cryptocurrencies Foteini Baldimtsi Bitcoin Anonymity? Satoshi Nakamoto, 2008

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

Slicing the licing the Onion: Onion: Anonymity Without PKI Anonymity Without PKI Sachin Katti

Applications for Measurement: Improving Anonymity Online Rishab Nithyanand | Rachee Singh |

Identity and Identity and anonymity anonymity Engineering &amp; Public Policy Lorrie Faith

for anonymity, but quite a lot is. Debajyoti Das 1 Sebastian Meiser 2 Esfandiar Mohammadi 3 Aniket

Free Software, Free Internet, Anonymity &amp; Tor Andrew Lewman andrew@torproject.org 24 Feb

Anonymity with Identity Escrow Aybek Mukhamedov and Mark Ryan The University of Birmingham March

Crowds: Anonymity for Web Transactions Paper by: Michael K. Reiter &amp; Aviel D. Rubin of

W hy is anonymity so hard? Roger D ingledine T he Free Haven Project 1 M any people need

Conscript Your Friends into Larger Anonymity Sets with JavaScript Henry Corrigan-Gibbs

Privacy-Enhancing Overlays in Bitcoin Sarah Meiklejohn (University College London) Claudio

Anonymity Networks for Crypto Geeks, the Department of Defense, and you. Nick Mathewson

Anonymity Loves Company: Usability and the network effect Roger Dingledine, Nick Mathewson The

Identity and Identity and anonymity anonymity Engineering & Public Policy Lorrie Faith

Free Software, Free Internet, Anonymity & Tor Andrew Lewman andrew@torproject.org 24 Feb

Crowds: Anonymity for Web Transactions Paper by: Michael K. Reiter & Aviel D. Rubin of