De-anonymizing D4D Datasets Kumar Sharad 1 George Danezis 2 1 - PowerPoint PPT Presentation

De-anonymizing D4D Datasets Kumar Sharad 1 George Danezis 2 1 University of Cambridge 2 Microsoft Research July 12, 2013 6th Workshop on Hot Topics in Privacy Enhancing Technologies 2013, Bloomington, Indiana, USA

Can Personally Identifiable Information be Anonymized? Research indicates that anonymyzing feature rich data is hard. In general it is not possible while preserving the usefulness of data. Release of real data presents an interesting opportunity to test the science. Encourages responsible data release.

Overview 1 The D4D Challenge 2 The Dataset 4 3 Re-identification 4 Results 5 Open Problem

The Data for Development (D4D) Challenge 1 Introduced by Orange in July 2012 for research related to social development in Ivory Coast. Four datasets of anonymized call patterns released. We were provided a preliminary version of the datasets. Ivory Coast facts Population - 22.4 million. Mobile phone users - 17.3 million. Orange subscribers - 5 million. A country fraught with civil war. 1http://www.d4d.orange.com/

The Dataset 4 Contains communication sub-graphs (ego nets) of 8300 randomly selected individuals (egos). Provides all communication between egos and their neighbours upto 2 degrees of separation. All nodes have random identifiers. Nodes common between sub-graphs have a different identifier in each sub-graph.

Toy Example

5 1 3 2 6 4 The ego net G 0

5 1 7 3 6 The ego net G 1

5 1 3 6 Sub-graph common to both G 0 and G 1

Real World Example

The ego net G 0

The ego net G 1

Sub-graph common to both G 0 and G 1

Re-identification 1-hop nodes Complete neighbourhood graph available. The degree distribution of a node’s neighbours is almost unique. Graph invariants completely preserved even after anonymization! Use this to map nodes across ego nets.

2-hop nodes Parts of neighbourhood graph missing. Graph invariants partially preserved after anonymization. Observe the 1-hop nodes common between a pair of nodes in two ego nets. For pairs with significant match, find the cosine similarity between them based on the degree distribution of neighbourhood. Use bipartite matching to maximize the overall similarity score across pairs.

Results 2 1-hop nodes Almost all the common nodes were re-identified with over 98% success rate. Hard to identify secluded nodes. 2-hop nodes Close to 15% (often over 20%) of common nodes re-identified. Success rate over 75% (occasionally over 90%). 2Based on EU email communication network - http://snap.stanford.edu/data/email-EuAll.html

Open Problem How to efficiently re-identify nodes across ego nets which have no 1-hop nodes in common?

Contact Kumar Sharad Kumar.Sharad@cl.cam.ac.uk research.sharad.de George Danezis gdane@microsoft.com research.microsoft.com/en-us/um/people/gdane

De-anonymizing D4D Datasets Kumar Sharad 1 George Danezis 2 1 - PowerPoint PPT Presentation

De-anonymizing D4D Datasets Kumar Sharad 1 George Danezis 2 1 University of Cambridge 2 Microsoft Research July 12, 2013 6th Workshop on Hot Topics in Privacy Enhancing Technologies 2013, Bloomington, Indiana, USA Can Personally Identifiable

De-anonymizing D4D Datasets Kumar Sharad 1 and George Danezis 2 1 University of Cambridge Computer

Data 4 Development (D4D) Examples of results 6 November, New-York D4D extracts - Data Revolution

Data for Development D4D February 2014 Data 4 Development an Open Innovation Data

Multi-perspective analysis of D4D fine resolution data Movers Gennady & Natalia Andrienko,

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

De-anonymizing Data CompSci 590.03 Instructor: Ashwin

De#anonymizing,Social,Networks, and,Inferring,Private,Attributes, Using,Knowledge,Graphs,

Tor: An Anonymizing Overlay Network for TCP Roger Dingledine The Free Haven Project

Tarzan: A Peer-to-Peer Anonymizing Network Layer Michael J. Freedman, NYU Robert Morris, MIT

De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12

De-Anonymizing Live CDs through Physical Memory Analysis Andrew Case Senior Security Analyst

Rumor Riding: Anonymizing Unstructured Peer-to-Peer Systems Narrated by Christo Wilson Table of

A Practical Congestion Attack on Tor Using Long Paths Towards De-anonymizing Tor Nathan S. Evans

De-Anonymizing Live CDs through Physical Memory Analysis

Anonymizing your hacktop A brief tour of unique identifiers accessible by software @ Unique

VEA: Validating, Evolving & Anonymizing Data in Real Time Albert Franzi Cros, Data Engineer |

Numerical Analysis of initial data identification of parabolic problems Dmitriy Leykekhman

Lasso Regularization Paths for NARMAX Models via Coordinate Descent Ant onio H. Ribeiro, Luis

Econometric Evaluation of Social Programs Part I: Identification James J. Heckman and Edward J.

Enough? Interactive Selection of Bonds between Pairs of Tangible Molecules Patrick Maier, Marcus

The CAPCM (Welch, Chapter 10-A) Ivo Welch Maintained Assumptions Perfect Markets 1. No

Risk management for hedge funds AQF 2005 Nicolas Papageorgiou Outline VaR and drawbacks

Model-free computation of risk contributions in credit portfolios Alvaro Leitao and Luis

= 2( u + ) 1. v = 2 t 1, pro ving the inductiv e step.

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

De-anonymizing D4D Datasets Kumar Sharad 1 George Danezis 2 1 - PowerPoint PPT Presentation

De-anonymizing D4D Datasets Kumar Sharad 1 George Danezis 2 1 University of Cambridge 2 Microsoft Research July 12, 2013 6th Workshop on Hot Topics in Privacy Enhancing Technologies 2013, Bloomington, Indiana, USA Can Personally Identifiable

De-anonymizing D4D Datasets Kumar Sharad 1 and George Danezis 2 1 University of Cambridge Computer

Data 4 Development (D4D) Examples of results 6 November, New-York D4D extracts - Data Revolution

Data for Development D4D February 2014 Data 4 Development an Open Innovation Data

Multi-perspective analysis of D4D fine resolution data Movers Gennady &amp; Natalia Andrienko,

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

De-anonymizing Data CompSci 590.03 Instructor: Ashwin

De#anonymizing,Social,Networks, and,Inferring,Private,Attributes, Using,Knowledge,Graphs,

Tor: An Anonymizing Overlay Network for TCP Roger Dingledine The Free Haven Project

Tarzan: A Peer-to-Peer Anonymizing Network Layer Michael J. Freedman, NYU Robert Morris, MIT

De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12

De-Anonymizing Live CDs through Physical Memory Analysis Andrew Case Senior Security Analyst

Rumor Riding: Anonymizing Unstructured Peer-to-Peer Systems Narrated by Christo Wilson Table of

A Practical Congestion Attack on Tor Using Long Paths Towards De-anonymizing Tor Nathan S. Evans

De-Anonymizing Live CDs through Physical Memory Analysis

Anonymizing your hacktop A brief tour of unique identifiers accessible by software @ Unique

VEA: Validating, Evolving &amp; Anonymizing Data in Real Time Albert Franzi Cros, Data Engineer |

Numerical Analysis of initial data identification of parabolic problems Dmitriy Leykekhman

Lasso Regularization Paths for NARMAX Models via Coordinate Descent Ant onio H. Ribeiro, Luis

Econometric Evaluation of Social Programs Part I: Identification James J. Heckman and Edward J.

Enough? Interactive Selection of Bonds between Pairs of Tangible Molecules Patrick Maier, Marcus

The CAPCM (Welch, Chapter 10-A) Ivo Welch Maintained Assumptions Perfect Markets 1. No

Risk management for hedge funds AQF 2005 Nicolas Papageorgiou Outline VaR and drawbacks

Model-free computation of risk contributions in credit portfolios Alvaro Leitao and Luis

= 2( u + ) 1. v = 2 t 1, pro ving the inductiv e step.

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Multi-perspective analysis of D4D fine resolution data Movers Gennady & Natalia Andrienko,

VEA: Validating, Evolving & Anonymizing Data in Real Time Albert Franzi Cros, Data Engineer |