On Privacy Risk of Releasing Data and Models Ashish Dandekar - PowerPoint PPT Presentation

Introduction Publication of data Publication of models Privacy at risk Conclusion References On Privacy Risk of Releasing Data and Models Ashish Dandekar Supervised by: A/P St´ ephane Bressan July 18, 2019 1 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Data is the new oil! (The Economist, 6 May 2017). Introduction 2 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References AI is the new electricity! Introduction 3 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Privacy risk: Publishing Data Mea culpa, mea culpa, mea maxima culpa! ‘Facebooks failure to compel Cambridge Analytica to delete all traces of data from its servers including any “derivatives” enabled the company to retain predictive models derived from millions of social media profiles!’ (The Guardian, 6 May 2018) . Introduction 4 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Privacy risk: Publishing Data An arms race between anonymisation and re-identification! Re-identification of the governor of Massachusetts in 2000 Re-identification of Thelma Arnold from AOL searches in 2006 Re-identification of the users from Netflix dataset in 2006 Re-identification of the cabs in New York City taxi dataset in 2014 Introduction 4 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Privacy risk: Publishing Models If machine learning models learn latent patterns in the dataset, what are the odds that they learn something that they are not supposed to learn? Attacks on machine learning models Inference attack. [Homer et al., 2008] infer presence of a certain genome in the dataset from the published statistics of genomic mixture dataset. Model inversion attack. [Fredrikson et al., 2014] infer genetic marker of patients given the access to machine learning model trained on the warfarin drug usage dataset. Membership inference attack. [Shokri et al., 2017] infer the presence of a data-point in the training dataset given the access to machine learning models hosted on cloud platforms. Introduction 5 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Our contributions “Synthetic datasets put a full stop on the arms race between anonymisation and re-identification.” [Bellovin et al., 2018] Publication of data We illustrate partially and fully synthetic dataset generation techniques using a selection of discriminative models. We adapt and extend Latent Dirichlet Allocation, a generative model, to work with spatiotemporal data. Introduction 6 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Our contributions We use differential privacy [Dwork et al., 2014] to provide quantifiable privacy guarantee while releasing machine learning models. Publication of models We illustrate use of the functional mechanism to provide differential privacy guarantees for releasing regularised Linear regression. We illustrate use of perturbation of model functions to provide differential privacy guarantees for a selection of non-parametric models. Introduction 6 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Our contributions In the spirit of making differential privacy amenable to business entities, we propose privacy at risk . It is a probabilistic relaxation of differential privacy. Privacy at risk We define privacy at risk that provides probabilistic bounds on the privacy guarantee of differential privacy by accounting for various sources of randomness. We illustrate privacy at risk for Laplace mechanism. We propose a cost model that bridges the gap between the abstract guarantee and compensation budget estimated by a GDPR compliant business entity. Introduction 6 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Summary Privacy risk Data Priva cy Differential Statistical Privacy Disclosure Releasing Releasing ● Functio Risk data models nal (Multiple Regul Mechan Imputation) arisati Mac Synt ism on Synthetic Differential Privacy at hine hetic ● Privacy data privacy risk Lear Data at risk ning set Generative Discriminative Parametric Non-parametric models models models models Laplace Regularised mechanism Histogram, LDA Linear linear KDE, Gaussian RNN regression, regression process, kernel Decision tree, SVM SVDD Cost Model Introduction 7 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Publication of data (Privacy risk of re-identification) Publication of data 7 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Synthetic Data As authentic as these“Nike” shoes! Publication of data Discriminative data synthesiser 8 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Synthetic dataset generation techniques With the help of a domain expert, a data scientist classifies features of any data-point into two categories. Identifying features. This is a set of attributes that are not typical to the dataset under study. These attributes can be publicly available as a part of other datasets. Sensitive features. This is a set of attributes that are typical to the dataset under study. These attributes contain data that is deemed to be sensitive. DOB, Marital Status, Gender , Income → Census dataset DOB, Marital Status, Gender , HIVStatus → Health dataset � �� Identifying features Publication of data Discriminative data synthesiser 9 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Synthetic dataset generation techniques Fully synthetic dataset generation [Rubin, 1993] Sample m datasets from imputed population and release them publicly Partially synthetic dataset generation [Reiter, 2003] Instead of imputing all values of sensitive features, only impute those values that bear higher cost of disclosure Publication of data Discriminative data synthesiser 9 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Experimental evaluation We extend comparative study of [Drechsler and Reiter, 2011] by using linear regression as well as neural networks as data synthesisers on the US Census dataset of 2003 1 . Utility Evaluation Data Original Sample Fully Synthetic Data Feature Synthesisers Mean Synthetic Mean Overlap Norm KL Div. Linear Regression 27112.61 27074.80 0.52 0.55 Decision Tree 27081.45 27091.02 0.55 0.58 Income Random Forest 27107.04 28720.93 0.54 0.64 Neural Network 27185.26 26694.54 0.54 0.99 Data Original Sample Partially Synthetic Data Feature Synthesisers Mean Synthetic Mean Overlap Norm KL Div. Linear Regression 27112.61 27117.99 0.98 0.54 Decision Tree 27081.45 27078.93 0.98 0.99 Income Random Forest 27107.04 27254.38 0.95 0.58 Neural Network 27185.26 27370.99 0.81 0.99 1 https://usa.ipums.org/usa/ Publication of data Discriminative data synthesiser 10 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Experimental evaluation Disclosure risk evaluation scenario Consider, an intruder who is interested in people who are born in US and earn more than $250,000. We consider a tolerance of 2 when matching on the age of a person. We assume that the intruder knows that the target is present in the publicly released dataset. Data Synthesisers True match rate False match rate Linear Regression 0.06 0.82 Decision Tree 0.18 0.68 Random Forest 0.35 0.50 Neural Network 0.03 0.92 Publication of data Discriminative data synthesiser 10 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Why generative models? Generative models learn P ( Data | pattern ) unlike discriminative models that learn P ( pattern | Data ) Generative models do not tend to overfit the training data Generative models have a data-generative process at the heart of its inception Publication of data Generative data synthesiser 11 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Latent Dirichlet Allocation [Blei et al., 2003] (LDA) Notation N : Vocabulary size D : Total number of Documents K : Total number of Topics Intuition Bag of Words assumption A document is a distribution over topics ◮ θ m → K -dim vector; m ∈ [1 ... D ] A topic is a distribution over words ◮ φ k → N -dim vector; k ∈ [1 ... K ] Publication of data Generative data synthesiser 12 / 36

Introduction Publication of data Publication of models Privacy at risk Conclusion References Latent Dirichlet Allocation [Blei et al., 2003] (LDA) Generative Process 1 Draw a topic distribution θ d ∼ Dir( α ) for a document 2 For each word in the document: Draw a topic z ∼ Mult( θ d ) 1 Draw a word w d , z ∼ DirMult( φ z | β ) 2 Publication of data Generative data synthesiser 12 / 36

On Privacy Risk of Releasing Data and Models Ashish Dandekar - PowerPoint PPT Presentation

Introduction Publication of data Publication of models Privacy at risk Conclusion References On Privacy Risk of Releasing Data and Models Ashish Dandekar Supervised by: A/P St ephane Bressan July 18, 2019 1 / 36 Introduction

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

Releasing a Differentially Private Password Frequency Corpus from 70 Million Yahoo! Passwords

Releasing Cloud Databases from the Chains of Prediction Models Ryan Marcus and Olga Papaemmanouil

Addressing Your Number 1 Security Risk: Data Privacy, Data Encryption A Complimentary

Releasing Search Queries and Clicks Privately Arne Bayer July 24, 2017 Arne Bayer Releasing

1 Data Mining and Privacy The primary task in data mining: Develop models about aggregated

CONSTRAINT -BASED DIFFERENTIAL PRIVACY Releasing Optimal Power Flow Benchmarks Privately

PREDICTIVE MODELING CONFERENCE Data Workshop Cyber Risk Models Loss Aggregation Models

Privacy by Deletion: 5 Steps to Reducing Data Risk July 19, 2017 Agenda Introductions The

Data privacy: introduction Vicen c Torra January 15, 2018 Privacy, Information and

Releasing ESO Public Survey Data through the Phase 3 Jrg Retzlaff European Southern

Conducting DPIAs to Reduce Business Risk Lecio de Paula Director of Data Privacy, FIP,

Differen'al Privacy and Risk Ra'os: The seman'cs of privacy

Rigorous Foundations for Statistical Data Privacy Adam Smith Boston University CWI, Amsterdam

Why We Cant Be Bothered to Read Privacy Policies: Models of Privacy Economics as a Lemons

Visualizing Prosopographical Data based on Correspondences Next steps towards releasing the data

Preserving the Privacy of Sensitive Relationships in Graph Data Motivation Valuable Data! No

EU Art.29 Data Protection Users care about privacy Working Party From: Special Eurobarometer

Data Privacy for IEEE Volunteer Data Managers 2 Overview Changes in Data Privacy IEEE

Transparency and disclosure risk in data privacy c Torra 1 Vicen March, 2015 1 School of

EHR Privacy Risk Assessment Using Qualitative Methods Maria Madsen CQUniversity, Gladstone,

Data privacy: an introduction (part 1) Klara Stokes What is privacy? Privacy has been defined in

h s i l a c o L Distributed Models for Statistical Data Privacy Adam Smith Based on

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security