Source (http://xkcd.org/834/) De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12 1
Announcements • Project ideas will be posted on the site by Friday. – You are welcome to send me (or talk to me about) your own ideas. Lecture 2 : 590.03 Fall 12 2
Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks – Passive Attacks – Active Attacks Lecture 2 : 590.03 Fall 12 3
Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks – Passive Attacks – Active Attacks Lecture 2 : 590.03 Fall 12 4
Personal Big-Data Person 1 Person 2 Person 3 Person N r 1 r 2 r 3 r N Google Census Hospital DB DB DB Information Recommen- Medical Doctors Economists Retrieval dation Researchers Researchers Algorithms Lecture 2 : 590.03 Fall 12 5
The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002] • Governor of MA • Name • Name uniquely identified • SSN • Address • Zip using ZipCode, • Date • Visit Date Birth Date, and Sex. • Birth Registered • Diagnosis date • Party • Procedure Name linked to Diagnosis affiliation • Medication • Sex • Date last • Total Charge voted Medical Data Voter List Lecture 2 : 590.03 Fall 12 6
The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002] • Governor of MA 87 % of US population • Name • Name uniquely identified • SSN • Address • Zip using ZipCode, • Date • Visit Date Birth Date, and Sex. • Birth Registered • Diagnosis date • Party • Procedure affiliation • Medication • Sex • Date last • Total Charge voted Quasi Identifier Medical Data Voter List Lecture 2 : 590.03 Fall 12 7
Statistical Privacy (Trusted Collector) Problem Utility: Privacy: No breach about any individual Server D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 8
Statistical Privacy (Untrusted Collector) Problem Server f ( ) D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 9
Randomized Response • Flip a coin – heads with probability p, and – tails with probability 1-p (p > ½) • Answer question according to the following table: True Answer = Yes True Answer = No Heads Yes No Tails No Yes Lecture 2 : 590.03 Fall 12 10
Statistical Privacy (Trusted Collector) Problem Server D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 11
Query Answering How many allergy patients? Hospital ‘ D B Correlate Genome to disease Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 12
Query Answering • Need to know the list of questions up front • Each answer will leak some information about individuals. After answering a few questions, server will run out of privacy budget and not be able to answer any more questions. • Will see this in detail later in the course. Lecture 2 : 590.03 Fall 12 13
Anonymous/ Sanitized Data Publishing Hospital D B writingcenterunderground.wordpress.com I wont tell you what questions I am interested in! Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 14
Anonymous/ Sanitized Data Publishing Hospital Answer any # of questions directly on D B ’ without D’ B any modifications. D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 15
Today’s class • Identifying individual records and their sensitive values from data publishing (with insufficient sanitization). Lecture 2 : 590.03 Fall 12 16
Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks – Passive Attacks – Active Attacks Lecture 2 : 590.03 Fall 12 17
Terms • Coin tosses of an algorithm • Union Bound • Heavy Tailed Distribution Lecture 2 : 590.03 Fall 12 18
Terms (contd.) • Heavy Tailed Distribution Normal Distribution Not heavy tailed. Lecture 2 : 590.03 Fall 12 19
Terms (contd.) • Heavy Tailed Distribution Laplace Distribution Heavy tailed. Lecture 2 : 590.03 Fall 12 20
Terms (contd.) • Heavy Tailed Distribution Zipf Distribution Heavy tailed. Lecture 2 : 590.03 Fall 12 21
Terms (contd.) • Cosine Similarity θ • Collaborative filtering – Problem of recommending new items to a user based on their ratings on previously seen items. Lecture 2 : 590.03 Fall 12 22
Netflix Dataset Column/Attribute Movies 3 4 2 1 5 Record (r) 1 1 1 5 5 1 Users 5 2 2 1 4 2 1 4 3 3 5 4 3 1 3 2 4 Rating + TimeStamp Lecture 2 : 590.03 Fall 12 23
Definitions • Support – Set (or number) of non-null attributes in a record or column • Similarity • Sparsity Lecture 2 : 590.03 Fall 12 24
Adversary Model • Aux(r) – some subset of attributes from r Lecture 2 : 590.03 Fall 12 25
Privacy Breach • Definition 1: An algorithm A outputs an r’ such that • Definition 2: (When only a sample of the dataset is input) Lecture 2 : 590.03 Fall 12 26
Algorithm ScoreBoard • For each record r’, compute Score(r’, aux) to be the minimum similarity of an attribute in aux to the same attribute in r’. • Pick r’ with the maximum score OR • Return all records with Score > α Lecture 2 : 590.03 Fall 12 27
Analysis Theorem 1: Suppose we use Scoreboard with α = 1 – ε . If Aux contains m randomly chosen attributes s.t. Then Scoreboard returns a record r’ such that Pr [ Sim (m, r’) > 1 – ε – δ ] > 1 – ε Lecture 2 : 590.03 Fall 12 28
Proof of Theorem 1 • Call r’ a false match if Sim (Aux, r’) < 1 – ε – δ . • For any false match, Pr[ Sim(Aux i , r i ’) > 1 – ε ] < 1 – δ • Sim (Aux, r’) = min Sim(Aux i , r i ’) • Therefore, Pr[ Sim (Aux, r’) > 1 – ε ] < (1 – δ ) m • Pr[some false match has similarity > 1- ε ] < N(1- δ ) m • N(1- δ ) m < ε when m > log(N/ ε ) / log(1/1- δ ) Lecture 2 : 590.03 Fall 12 29
Other results • If dataset D is (1- ε - δ , ε )-sparse, then D can be (1, 1- ε )- deanonymized. • Analogous results when a list of candidate records are returned Lecture 2 : 590.03 Fall 12 30
Netflix Dataset • Slightly different algorithm Lecture 2 : 590.03 Fall 12 31
Summary of Netflix Paper • Adversary can use a subset of ratings made by a user to uniquely identify the user’s record from the “ anonymized ” dataset with high probability • Simple Scoreboard algorithm provably guarantees identification of records. • A variant of Scoreboard can de-anonymize Netflix dataset. • Algorithms are robust to noise in the adversary’s background knowledge Lecture 2 : 590.03 Fall 12 32
Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks – Passive Attacks – Active Attacks Lecture 2 : 590.03 Fall 12 33
Social Network Data • Social networks: graphs where each node represents a social entity, and each edge represents certain relationship between two entities • Example: email communication graphs, social interactions like in Facebook, Yahoo! Messenger, etc. Lecture 2 : 590.03 Fall 12 34
Anonymizing Social Networks Alice Bob Cathy Diane Ed Fred Grace • Naïve anonymization – removes the label of each node and publish only the structure of the network • Information Leaks – Nodes may still be re-identified based on network structure Lecture 2 : 590.03 Fall 12 35
Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Consider the above email communication graph – Each node represents an individual – Each edge between two individuals indicates that they have exchanged emails Lecture 2 : 590.03 Fall 12 36
Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Alice has sent emails to three individuals only Lecture 2 : 590.03 Fall 12 37
Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Alice has sent emails to three individuals only • Only one node in the anonymized network has a degree three • Hence, Alice can re-identify herself Lecture 2 : 590.03 Fall 12 38
Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Cathy has sent emails to five individuals Lecture 2 : 590.03 Fall 12 39
Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Cathy has sent emails to five individuals • Only one node has a degree five • Hence, Cathy can re-identify herself Lecture 2 : 590.03 Fall 12 40
Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Now consider that Alice and Cathy share their knowledge about the anonymized network • What can they learn about the other individuals? Lecture 2 : 590.03 Fall 12 41
Recommend
More recommend