Introduction Publication of data Publication of models Privacy at risk Conclusion References On Privacy Risk of Releasing Data and Models Ashish Dandekar Supervised by: A/P St´ ephane Bressan July 18, 2019 1 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Data is the new oil! (The Economist, 6 May 2017). Introduction 2 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References AI is the new electricity! Introduction 3 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Privacy risk: Publishing Data Mea culpa, mea culpa, mea maxima culpa! ‘Facebooks failure to compel Cambridge Analytica to delete all traces of data from its servers including any “derivatives” enabled the company to retain predictive models derived from millions of social media profiles!’ (The Guardian, 6 May 2018) . Introduction 4 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Privacy risk: Publishing Data An arms race between anonymisation and re-identification! Re-identification of the governor of Massachusetts in 2000 Re-identification of Thelma Arnold from AOL searches in 2006 Re-identification of the users from Netflix dataset in 2006 Re-identification of the cabs in New York City taxi dataset in 2014 Introduction 4 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Privacy risk: Publishing Models If machine learning models learn latent patterns in the dataset, what are the odds that they learn something that they are not supposed to learn? Attacks on machine learning models Inference attack. [Homer et al., 2008] infer presence of a certain genome in the dataset from the published statistics of genomic mixture dataset. Model inversion attack. [Fredrikson et al., 2014] infer genetic marker of patients given the access to machine learning model trained on the warfarin drug usage dataset. Membership inference attack. [Shokri et al., 2017] infer the presence of a data-point in the training dataset given the access to machine learning models hosted on cloud platforms. Introduction 5 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Our contributions “Synthetic datasets put a full stop on the arms race between anonymisation and re-identification.” [Bellovin et al., 2018] Publication of data We illustrate partially and fully synthetic dataset generation techniques using a selection of discriminative models. We adapt and extend Latent Dirichlet Allocation, a generative model, to work with spatiotemporal data. Introduction 6 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Our contributions We use differential privacy [Dwork et al., 2014] to provide quantifiable privacy guarantee while releasing machine learning models. Publication of models We illustrate use of the functional mechanism to provide differential privacy guarantees for releasing regularised Linear regression. We illustrate use of perturbation of model functions to provide differential privacy guarantees for a selection of non-parametric models. Introduction 6 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Our contributions In the spirit of making differential privacy amenable to business entities, we propose privacy at risk . It is a probabilistic relaxation of differential privacy. Privacy at risk We define privacy at risk that provides probabilistic bounds on the privacy guarantee of differential privacy by accounting for various sources of randomness. We illustrate privacy at risk for Laplace mechanism. We propose a cost model that bridges the gap between the abstract guarantee and compensation budget estimated by a GDPR compliant business entity. Introduction 6 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Summary Privacy risk Data Priva cy Differential Statistical Privacy Disclosure Releasing Releasing ● Functio Risk data models nal (Multiple Regul Mechan Imputation) arisati Mac Synt ism on Synthetic Differential Privacy at hine hetic ● Privacy data privacy risk Lear Data at risk ning set Generative Discriminative Parametric Non-parametric models models models models Laplace Regularised mechanism Histogram, LDA Linear linear KDE, Gaussian RNN regression, regression process, kernel Decision tree, SVM SVDD Cost Model Introduction 7 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Publication of data (Privacy risk of re-identification) Publication of data 7 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Synthetic Data As authentic as these“Nike” shoes! Publication of data Discriminative data synthesiser 8 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Synthetic dataset generation techniques With the help of a domain expert, a data scientist classifies features of any data-point into two categories. Identifying features. This is a set of attributes that are not typical to the dataset under study. These attributes can be publicly available as a part of other datasets. Sensitive features. This is a set of attributes that are typical to the dataset under study. These attributes contain data that is deemed to be sensitive. DOB, Marital Status, Gender , Income → Census dataset DOB, Marital Status, Gender , HIVStatus → Health dataset � �� � Identifying features Publication of data Discriminative data synthesiser 9 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Synthetic dataset generation techniques Fully synthetic dataset generation [Rubin, 1993] Sample m datasets from imputed population and release them publicly Partially synthetic dataset generation [Reiter, 2003] Instead of imputing all values of sensitive features, only impute those values that bear higher cost of disclosure Publication of data Discriminative data synthesiser 9 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Experimental evaluation We extend comparative study of [Drechsler and Reiter, 2011] by using linear regression as well as neural networks as data synthesisers on the US Census dataset of 2003 1 . Utility Evaluation Data Original Sample Fully Synthetic Data Feature Synthesisers Mean Synthetic Mean Overlap Norm KL Div. Linear Regression 27112.61 27074.80 0.52 0.55 Decision Tree 27081.45 27091.02 0.55 0.58 Income Random Forest 27107.04 28720.93 0.54 0.64 Neural Network 27185.26 26694.54 0.54 0.99 Data Original Sample Partially Synthetic Data Feature Synthesisers Mean Synthetic Mean Overlap Norm KL Div. Linear Regression 27112.61 27117.99 0.98 0.54 Decision Tree 27081.45 27078.93 0.98 0.99 Income Random Forest 27107.04 27254.38 0.95 0.58 Neural Network 27185.26 27370.99 0.81 0.99 1 https://usa.ipums.org/usa/ Publication of data Discriminative data synthesiser 10 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Experimental evaluation Disclosure risk evaluation scenario Consider, an intruder who is interested in people who are born in US and earn more than $250,000. We consider a tolerance of 2 when matching on the age of a person. We assume that the intruder knows that the target is present in the publicly released dataset. Data Synthesisers True match rate False match rate Linear Regression 0.06 0.82 Decision Tree 0.18 0.68 Random Forest 0.35 0.50 Neural Network 0.03 0.92 Publication of data Discriminative data synthesiser 10 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Why generative models? Generative models learn P ( Data | pattern ) unlike discriminative models that learn P ( pattern | Data ) Generative models do not tend to overfit the training data Generative models have a data-generative process at the heart of its inception Publication of data Generative data synthesiser 11 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Latent Dirichlet Allocation [Blei et al., 2003] (LDA) Notation N : Vocabulary size D : Total number of Documents K : Total number of Topics Intuition Bag of Words assumption A document is a distribution over topics ◮ θ m → K -dim vector; m ∈ [1 ... D ] A topic is a distribution over words ◮ φ k → N -dim vector; k ∈ [1 ... K ] Publication of data Generative data synthesiser 12 / 36
Introduction Publication of data Publication of models Privacy at risk Conclusion References Latent Dirichlet Allocation [Blei et al., 2003] (LDA) Generative Process 1 Draw a topic distribution θ d ∼ Dir( α ) for a document 2 For each word in the document: Draw a topic z ∼ Mult( θ d ) 1 Draw a word w d , z ∼ DirMult( φ z | β ) 2 Publication of data Generative data synthesiser 12 / 36
Recommend
More recommend