de anonymizing data
play

De-anonymizing Data CompSci 590.03 Instructor: Ashwin - PowerPoint PPT Presentation

Source (http://xkcd.org/834/) De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12 1 Announcements Project ideas will be posted on the site by Friday. You are welcome to send me (or talk to


  1. Source (http://xkcd.org/834/) De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12 1

  2. Announcements • Project ideas will be posted on the site by Friday. – You are welcome to send me (or talk to me about) your own ideas. Lecture 2 : 590.03 Fall 12 2

  3. Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks – Passive Attacks – Active Attacks Lecture 2 : 590.03 Fall 12 3

  4. Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks – Passive Attacks – Active Attacks Lecture 2 : 590.03 Fall 12 4

  5. Personal Big-Data Person 1 Person 2 Person 3 Person N r 1 r 2 r 3 r N Google Census Hospital DB DB DB Information Recommen- Medical Doctors Economists Retrieval dation Researchers Researchers Algorithms Lecture 2 : 590.03 Fall 12 5

  6. The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002] • Governor of MA • Name • Name uniquely identified • SSN • Address • Zip using ZipCode, • Date • Visit Date Birth Date, and Sex. • Birth Registered • Diagnosis date • Party • Procedure Name linked to Diagnosis affiliation • Medication • Sex • Date last • Total Charge voted Medical Data Voter List Lecture 2 : 590.03 Fall 12 6

  7. The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002] • Governor of MA 87 % of US population • Name • Name uniquely identified • SSN • Address • Zip using ZipCode, • Date • Visit Date Birth Date, and Sex. • Birth Registered • Diagnosis date • Party • Procedure affiliation • Medication • Sex • Date last • Total Charge voted Quasi Identifier Medical Data Voter List Lecture 2 : 590.03 Fall 12 7

  8. Statistical Privacy (Trusted Collector) Problem Utility: Privacy: No breach about any individual Server D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 8

  9. Statistical Privacy (Untrusted Collector) Problem Server f ( ) D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 9

  10. Randomized Response • Flip a coin – heads with probability p, and – tails with probability 1-p (p > ½) • Answer question according to the following table: True Answer = Yes True Answer = No Heads Yes No Tails No Yes Lecture 2 : 590.03 Fall 12 10

  11. Statistical Privacy (Trusted Collector) Problem Server D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 11

  12. Query Answering How many allergy patients? Hospital ‘ D B Correlate Genome to disease Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 12

  13. Query Answering • Need to know the list of questions up front • Each answer will leak some information about individuals. After answering a few questions, server will run out of privacy budget and not be able to answer any more questions. • Will see this in detail later in the course. Lecture 2 : 590.03 Fall 12 13

  14. Anonymous/ Sanitized Data Publishing Hospital D B writingcenterunderground.wordpress.com I wont tell you what questions I am interested in! Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 14

  15. Anonymous/ Sanitized Data Publishing Hospital Answer any # of questions directly on D B ’ without D’ B any modifications. D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 15

  16. Today’s class • Identifying individual records and their sensitive values from data publishing (with insufficient sanitization). Lecture 2 : 590.03 Fall 12 16

  17. Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks – Passive Attacks – Active Attacks Lecture 2 : 590.03 Fall 12 17

  18. Terms • Coin tosses of an algorithm • Union Bound • Heavy Tailed Distribution Lecture 2 : 590.03 Fall 12 18

  19. Terms (contd.) • Heavy Tailed Distribution Normal Distribution Not heavy tailed. Lecture 2 : 590.03 Fall 12 19

  20. Terms (contd.) • Heavy Tailed Distribution Laplace Distribution Heavy tailed. Lecture 2 : 590.03 Fall 12 20

  21. Terms (contd.) • Heavy Tailed Distribution Zipf Distribution Heavy tailed. Lecture 2 : 590.03 Fall 12 21

  22. Terms (contd.) • Cosine Similarity θ • Collaborative filtering – Problem of recommending new items to a user based on their ratings on previously seen items. Lecture 2 : 590.03 Fall 12 22

  23. Netflix Dataset Column/Attribute Movies 3 4 2 1 5 Record (r) 1 1 1 5 5 1 Users 5 2 2 1 4 2 1 4 3 3 5 4 3 1 3 2 4 Rating + TimeStamp Lecture 2 : 590.03 Fall 12 23

  24. Definitions • Support – Set (or number) of non-null attributes in a record or column • Similarity • Sparsity Lecture 2 : 590.03 Fall 12 24

  25. Adversary Model • Aux(r) – some subset of attributes from r Lecture 2 : 590.03 Fall 12 25

  26. Privacy Breach • Definition 1: An algorithm A outputs an r’ such that • Definition 2: (When only a sample of the dataset is input) Lecture 2 : 590.03 Fall 12 26

  27. Algorithm ScoreBoard • For each record r’, compute Score(r’, aux) to be the minimum similarity of an attribute in aux to the same attribute in r’. • Pick r’ with the maximum score OR • Return all records with Score > α Lecture 2 : 590.03 Fall 12 27

  28. Analysis Theorem 1: Suppose we use Scoreboard with α = 1 – ε . If Aux contains m randomly chosen attributes s.t. Then Scoreboard returns a record r’ such that Pr [ Sim (m, r’) > 1 – ε – δ ] > 1 – ε Lecture 2 : 590.03 Fall 12 28

  29. Proof of Theorem 1 • Call r’ a false match if Sim (Aux, r’) < 1 – ε – δ . • For any false match, Pr[ Sim(Aux i , r i ’) > 1 – ε ] < 1 – δ • Sim (Aux, r’) = min Sim(Aux i , r i ’) • Therefore, Pr[ Sim (Aux, r’) > 1 – ε ] < (1 – δ ) m • Pr[some false match has similarity > 1- ε ] < N(1- δ ) m • N(1- δ ) m < ε when m > log(N/ ε ) / log(1/1- δ ) Lecture 2 : 590.03 Fall 12 29

  30. Other results • If dataset D is (1- ε - δ , ε )-sparse, then D can be (1, 1- ε )- deanonymized. • Analogous results when a list of candidate records are returned Lecture 2 : 590.03 Fall 12 30

  31. Netflix Dataset • Slightly different algorithm Lecture 2 : 590.03 Fall 12 31

  32. Summary of Netflix Paper • Adversary can use a subset of ratings made by a user to uniquely identify the user’s record from the “ anonymized ” dataset with high probability • Simple Scoreboard algorithm provably guarantees identification of records. • A variant of Scoreboard can de-anonymize Netflix dataset. • Algorithms are robust to noise in the adversary’s background knowledge Lecture 2 : 590.03 Fall 12 32

  33. Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks – Passive Attacks – Active Attacks Lecture 2 : 590.03 Fall 12 33

  34. Social Network Data • Social networks: graphs where each node represents a social entity, and each edge represents certain relationship between two entities • Example: email communication graphs, social interactions like in Facebook, Yahoo! Messenger, etc. Lecture 2 : 590.03 Fall 12 34

  35. Anonymizing Social Networks Alice Bob Cathy Diane Ed Fred Grace • Naïve anonymization – removes the label of each node and publish only the structure of the network • Information Leaks – Nodes may still be re-identified based on network structure Lecture 2 : 590.03 Fall 12 35

  36. Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Consider the above email communication graph – Each node represents an individual – Each edge between two individuals indicates that they have exchanged emails Lecture 2 : 590.03 Fall 12 36

  37. Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Alice has sent emails to three individuals only Lecture 2 : 590.03 Fall 12 37

  38. Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Alice has sent emails to three individuals only • Only one node in the anonymized network has a degree three • Hence, Alice can re-identify herself Lecture 2 : 590.03 Fall 12 38

  39. Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Cathy has sent emails to five individuals Lecture 2 : 590.03 Fall 12 39

  40. Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Cathy has sent emails to five individuals • Only one node has a degree five • Hence, Cathy can re-identify herself Lecture 2 : 590.03 Fall 12 40

  41. Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Now consider that Alice and Cathy share their knowledge about the anonymized network • What can they learn about the other individuals? Lecture 2 : 590.03 Fall 12 41

Recommend


More recommend