Learning to de-anonymize social networks A machine learning approach to social graph de-anonymization Kumar Sharad October 28, 2016 Royal Holloway, University of London ACM Workshop on Artificial Intelligence and Security, Vienna, Austria (AISec 2016)
This talk 1. True Friends Let You Down: Benchmarking Social Graph Anonymization Schemes 2. Change of Guard: The Next Generation of Social Graph De-anonymization Attacks 2
This talk 1. True Friends Let You Down: Benchmarking Social Graph Anonymization Schemes 2. Change of Guard: The Next Generation of Social Graph De-anonymization Attacks 2
This talk 1. True Friends Let You Down: Benchmarking Social Graph Anonymization Schemes 2. Change of Guard: The Next Generation of Social Graph De-anonymization Attacks 2
Overview 1. Introduction 2. Preliminaries 3. Benchmarking social graph anonymization schemes 4. The next generation of social graph de-anonymization attacks 5. Conclusions 3
Introduction
The art of data anonymization • Goal : process data without jeopardizing privacy. • Popular : randomize identifiers and/or perturb data. • Pros : cheap, preserves utility, provides legal immunity. • Cons : lack of privacy guarantees. 4
Privacy challenges in anonymized social graphs • Social graphs tend to be particularly notorious to anonymize. • How can we compare various anonymization schemes? • Can we measure privacy leak purely based on graph topology? • Could this lead to end-to-end graph de-anonymization? de-anonymization function. 5 • Intuition : Train a machine learning model to learn the
Preliminaries
Node features • Graph nodes represent individuals and the edges represent relationship among them. • Feature vector purely based on topology (no edge weights or directionality). • Too generic: high false positives. • Too specific: low true positives. • Quantize neighborhood degree distribution. 6
The 2-hop neighborhood of a node Ego 1-hop 1-hop 1-hop 2-hop 2-hop 2-hop 2-hop 7
Node feature vector Feature vector of a node with neighbors of degrees – size = 15 70 bins … … … … 8 [ 1 , 1 , 3 , 3 , 5 , 6 , 7 , 13 , 16 , 20 , 21 , 30 , 65 , 69 , 72 , 1030 , 1100 ] . c 0 = 8 c 1 = 4 c 2 = 0 c 4 = 3 c 69 = 2
Node feature vector Feature vector of a node with neighbors of degrees – size = 15 70 bins … … … … 8 [ 1 , 1 , 3 , 3 , 5 , 6 , 7 , 13 , 16 , 20 , 21 , 30 , 65 , 69 , 72 , 1030 , 1100 ] . c 0 = 8 c 1 = 4 c 2 = 0 c 4 = 3 c 69 = 2
Node feature vector Feature vector of a node with neighbors of degrees – size = 15 70 bins … … … … 8 [ 1 , 1 , 3 , 3 , 5 , 6 , 7 , 13 , 16 , 20 , 21 , 30 , 65 , 69 , 72 , 1030 , 1100 ] . c 0 = 8 c 1 = 4 c 2 = 0 c 4 = 3 c 69 = 2
Node feature vector Feature vector of a node with neighbors of degrees – size = 15 70 bins … … … … 8 [ 1 , 1 , 3 , 3 , 5 , 6 , 7 , 13 , 16 , 20 , 21 , 30 , 65 , 69 , 72 , 1030 , 1100 ] . c 0 = 8 c 1 = 4 c 2 = 0 c 4 = 3 c 69 = 2
Node feature vector Feature vector of a node with neighbors of degrees – size = 15 70 bins … … … … 8 [ 1 , 1 , 3 , 3 , 5 , 6 , 7 , 13 , 16 , 20 , 21 , 30 , 65 , 69 , 72 , 1030 , 1100 ] . c 0 = 8 c 1 = 4 c 2 = 0 c 4 = 3 c 69 = 2
Node feature vector Feature vector of a node with neighbors of degrees – size = 15 70 bins … … … … 8 [ 1 , 1 , 3 , 3 , 5 , 6 , 7 , 13 , 16 , 20 , 21 , 30 , 65 , 69 , 72 , 1030 , 1100 ] . c 0 = 8 c 1 = 4 c 2 = 0 c 4 = 3 c 69 = 2
Node feature vector 1-hop nodes 2-hop nodes … … 8 c 0 = 8 c 1 = 4 c 2 = 0 c 137 = 3 c 138 = 1 c 139 = 46
The learning task • Distinguish whether a pair of graph node feature vectors represent the same individual. classify node pairs. 9 • Given a node pair classify them as identical or non-identical. • We use random forest which is a collection of decision trees to • Prediction: aggregate the decision of all trees.
Benchmarking social graph anonymization schemes
Motivation • A large number of social graph anonymization schemes have been proposed with varied goals. • None of them provide any privacy guarantees. • Preserving privacy vs. preserving utility. • Ad-hoc development of schemes has created a skewed ecosystem. • Research gap: how to compare social graph anonymization schemes? 10
Approach • Compare social graph anonymization schemes based on anonymity provided vs. utility preserved. • Use a machine learning framework to benchmark perturbation-based social graph anonymization schemes. • Automates evaluation and levels the playing field. 11
• A sanitized social network is released. • Adversary uses graph topology to predict the true correspondences. 1A. Narayanan and V. Shmatikov. De-anonymizing social networks. IEEE S&P 2009. 12 The adversarial model 1 (1/2) • Adversary obtains an auxiliary social network with some overlap. • Sample nodes at random from original graph G to generate two graphs G 1 and G 2 with an overlap.
The adversarial model (2/2) Coefficient. Jaccard Coefficient Jaccard Coefficient between sets X and Y at least one of which is non-empty is defined as: 13 • Perturb G 1 and G 2 to produce G aux and G san . • Overlap between G 1 and G 2 is measured using Jaccard JC ( X , Y ) = | X ∩ Y | | X ∪ Y |
G G 1 G 2 Sample G aux G san Perturb Perturb Generating perturbed graphs. 14
G G 1 G 2 Sample G aux G san Perturb Perturb Generating perturbed graphs. 14
G G 1 G 2 Sample G aux G san Perturb Perturb Generating perturbed graphs. 14
Schemes analyzed 1. Random Sparsification (RSP) 2. Random Edge Perturbation (REP) 3. k -Degree Anonymous (KDA) 4. 1-hop k -Anonymous (1HKA) 5. Random Add/Delete (RAD) 6. Random Switch (RSW) 15
Schemes analyzed 1. Random Sparsification (RSP) 2. Random Edge Perturbation (REP) 3. k -Degree Anonymous (KDA) 4. 1-hop k -Anonymous (1HKA) 5. Random Add/Delete (RAD) 6. Random Switch (RSW) 15
Measuring utility 1. Degree distribution (DD) 2. Joint degree distribution (JDD) 3. Average degree connectivity 4. Degree centrality 5. Eigenvector centrality 16
Measuring utility 1. Degree distribution (DD) 2. Joint degree distribution (JDD) 3. Average degree connectivity 4. Degree centrality 5. Eigenvector centrality 16
Measuring anonymity • Measured by the de-anonymization success achieved as depicted by ROC curves with varying perturbation. • A higher AUC implies weaker anonymity. • An increase in perturbation should produce a commensurate decrease in de-anonymization success while minimizing damage to utility. 17
Training without ground truth • What is the best way to train a model given the adversary only data. • The datasets do not need to be further anonymized. • Identical and non-identical node pairs are extremely different. 18 has access to G aux and G san ? • Ideal: generate G aux and G san from G . • Practical: split G aux and G san individually and merge the sampled
Split Split G aux G san G aux G san G san G aux Training without ground truth by splitting the original graphs. 19
Split G aux G san G aux G san G san G aux Training without ground truth by splitting the original graphs. 19 Split ′ ′
G aux G san Training without ground truth by splitting the original graphs. aux G san G san G aux G 19 Split Split ′ ′ ′′ ′′
Evaluation and results Publicly available datasets used • Flickr (80 513 nodes, 5 899 882 edges). • Facebook New Orleans dataset (63 731 nodes, 817 090 edges). 20
Degree distribution 21 Facebook: Random Sparsification Facebook: Random Edge Perturbation 10 4 10 4 Original Original µ = 10 − 4 α E = 0 . 75 µ = 10 − 3 α E = 0 . 50 10 3 10 3 µ = 10 − 2 α E = 0 . 25 Frequency (log) Frequency (log) 10 2 10 2 10 1 10 1 10 0 10 0 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 Degree (log) Degree (log) Facebook: k -Degree Anonymity 10 4 Original k = 10 k = 50 10 3 k = 100 Frequency (log) 10 2 10 1 10 0 10 0 10 1 10 2 10 3 10 4 Degree (log)
Joint degree distribution: RSP 22 FB: RSP ( α E = 0 . 75) FB: No Anonymization 100 100 480 540 420 480 80 80 420 360 360 60 60 300 300 240 240 40 40 180 180 120 20 20 120 60 60 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FB: RSP ( α E = 0 . 50) FB: RSP ( α E = 0 . 25) 100 100 800 540 480 700 80 80 420 600 60 360 60 500 300 400 40 240 40 300 180 200 20 120 20 100 60 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100
Joint degree distribution: REP 23 FB: REP ( µ = 10 − 4 ) FB: No Anonymization 100 100 1050 480 900 420 80 80 360 750 60 60 300 600 240 450 40 40 180 300 120 20 20 150 60 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 FB: REP ( µ = 10 − 3 ) FB: REP ( µ = 10 − 2 ) 100 500 2250 4500 4000 2000 80 400 3500 1750 1500 3000 60 300 2500 1250 1000 2000 40 200 750 1500 1000 20 500 100 250 500 0 0 0 0 0 20 40 60 80 100 0 100 200 300 400 500
Recommend
More recommend