Rigorous Foundations for Statistical Data Privacy Adam Smith Boston University CWI, Amsterdam November 15, 2018
“Privacy” is changing • Data-driven systems guiding decisions in many areas • Models increasingly complex Privacy Benefits of data (better diagnoses, Transparency lower recidivism…) Control 2
Privacy in Statistical Databases Researchers Individuals Summaries queries “Agency” answers Complex models Large collections of personal information • census data Synthetic data • medical/public health • online advertising • education … 3
Two conflicting goals • Utility: release aggregate statistics • Privacy: individual information stays hidden Utility Privacy How do we define “privacy”? • Studied since 1960’s in Ø Statistics Ø Databases & data mining Ø Cryptography • This talk: Rigorous foundations and analysis 4
“Relax – it can only see metadata.” 5
This talk • Why is privacy challenging? Ø Anonymization often fails Ø Example: membership attacks, in theory and in practice • Differential Privacy [DMNS’06] Ø “Privacy” as stability to small changes Ø Widely studied and deployed • The “frontier” of research on statistical privacy Ø Three topics 6
First attempt: Remove obvious identifiers “AI recognizes blurred faces” [McPherson Shokri Shmatikov ’16] N a m e : E t h n i c i t y : [Gymrek McGuire Golan Halperin Erlich ’13] [Pandurangan ‘14] Everything is an identifier [Ganta Kasiviswanathan S ’08] Images: whitehouse.gov, genesandhealth.org, medium.com 7
Is the problem granularity? What if we only release aggregate information? Statistics together may encode data • Example: Average salary before/after resignation • More generally: Too many, “too accurate” statistics reveal individual information Ø Reconstruction attacks [Dinur Nissim 2003, …, Cohen Nissim 2017] Ø Membership attacks [next slide] Cannot release everything everyone would want to know 8
A Few Membership Attacks • [Homer et al. 2008] Exact high-dimensional summaries allow an attacker to test membership in a data set Ø Caused US NIH to change data sharing practices • [Dwork, S , Steinke, Ullman, Vadhan, FOCS ‘15] Distorted high-dimensional summaries allow an attacker to test membership in a data set • [Shokri, Stronati, Song, Shmatikov, Oakland 2017] Membership inference using ML as a service (from exact answers) 9
Membership Attacks Population ! attributes 0 1 1 0 1 0 0 0 1 " data 0 1 0 1 0 1 0 0 1 1 0 1 1 1 1 0 1 0 people 1 1 0 0 1 0 1 0 0 Suppose • We have a data set in which membership is sensitive Ø Participants in clinical trial Ø Targeted ad audience • Data has many binary attributes for each person Ø Genome-wide association studies # = 1 000 000 (“SNPs”), ' < 2000 10
Membership Attacks “Out” Population ! attributes Alice’s data 0 1 1 0 1 0 0 0 1 “In” " data 0 1 0 1 0 1 0 0 1 1 0 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 0 people 1 1 0 0 1 0 1 0 0 “In”/ # $ Attacker .50 .75 .50 .50 .75 .50 .25 .25 .50 “Out” • Release exact column averages • Attacker succeeds with high probability when there are more attributes than people and % ≪ '/) 11
Membership Attacks “Out” Population ( attributes Alice’s data 0 1 1 0 1 0 0 0 1 “In” ) data 0 1 0 1 0 1 0 0 1 1 0 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 0 people 1 1 0 0 1 0 1 0 0 “In”/ * + Attacker .50 .75 .50 .50 .75 .50 .25 .25 .50 “Out” ± " in each coordinate .4 .7 .6 .5 .8 .4 .2 .3 .6 • Release exact distorted column averages ( ± ") No matter how • Attacker succeeds with high probability when distortion performed there are more attributes than people and " ≪ %/' 12
Machine Learning as a Service Model Prediction API Training API Input from Classification DATA users, apps … Sensitive! Transactions, preferences, online and offline behavior 13
Exploiting Trained Models Model Prediction API Training API Input from Classification the training set DATA Input not from Classification the training set recognize the difference 14
Exploiting Trained Models … without knowing the specifics of the actual model! Model Prediction API Training API Input from Classification the training set DATA Input not from Classification the training set Train a model to… recognize the difference 15
This talk • Why is privacy challenging? Ø Anonymization often fails Ø Example: membership attacks, in theory and in practice • Differential Privacy [DMNS’06] Ø “Privacy” as stability to small changes Ø Widely studied and deployed • The “frontier” of research on statistical privacy Ø Three topics 16
Differential Privacy • Several current deployments Apple Google US Census • Burgeoning field of research Databases, Crypto, Statistics, Game theory, Law, Algorithms programming security learning economics policy languages 17
Differential Privacy A A(x) random bits • Data set ! Ø Domain D can be numbers, categories, tax forms Ø Think of x as fixed (not random) • A = randomized procedure Ø A(x) is a random variable Ø Randomness might come from adding noise, resampling, etc. 18
Differential Privacy A A A(x’) A(x) random random bits bits • A thought experiment Ø Change one person’s data (or add or remove them) For any set of Ø Will the probabilities of outcomes change? outcomes, (e.g. I get denied health insurance) about the same probability in both worlds 19
Differential Privacy A A A(x’) A(x) local random local random coins coins !’ is a neighbor of ! if they differ in one data point Neighboring databases induce close distributions Definition : A is # -differentially private if, on outputs for all neighbors $ , $’ , 0 1 for all subsets S of outputs Pr ' $ ∈ ) ≤ (1 + #) Pr ' $ / ∈ ) # is a leakage measure 20
Randomized Response [Warner 1965] A 1 ! = ' ( , … , ' + local random coins • Say we want to release the proportion of diabetics in a data set Ø Each person’s data is 1 bit: ! " = 0 or ! " = 1 • Randomized response: each individual rolls a die Ø 1, 2, 3 or 4: Report true value ! " Ø 5 or 6: Report opposite value & ! " • Output is list of reported values ' ( , … , ' + Ø Satisfies our definition when , ≈ 0.7 Ø Can estimate fraction of ! " ’s that are 1 when 0 is large 22
Laplace Mechanism function f A 0 " = ! " + 23456 local random coins • Say we want to release a summary ! " ∈ ℝ % Ø e.g., proportion of diabetics: " & ∈ 0,1 and ! " = + , ∑ & " & • Simple approach: add noise to !(") Ø How much noise is needed? Ø Idea: Calibrate noise to some measure of ! ’s volatility 23
Laplace Mechanism function f A ! " = $ " + &'()* local random coins • Global Sensitivity: Ø Example: f(x) x f(x’) x’ 24
Laplace Mechanism function f A % & = ( & + *+,-. local random coins • Global Sensitivity: Ø Example: Ø Laplace distribution Lap $ has density Ø Changing one point translates curve 25
Attacks “match” differential privacy function f A & ' = ) ' + +,-./ local random coins # • Can release ! proportions with noise ≈ $% per entry • Requires “approximate” variant of DP 2 2 Differential Reconstruction Robust 1/ + privacy attacks 34 4 membership attacks Sampling error 26
A rich algorithmic field Noise addition Exponential sampling ! ∼ # $ % ∝ exp(+ ⋅ -./012$ $, % ) Local 5 6 Untrusted perturbation 5 7 aggregator 5 8 A 27
Interpreting Differential Privacy • A naïve hope: Your beliefs about me are the same after you see the output as they were before • Impossible Ø Suppose you know that I smoke Ø Clinical study: “smoking and cancer correlated” Ø You learn something about me • Whether or not my data were used • Differential privacy implies: No matter what you know ahead of time, You learn (almost) the same things about me whether or not my data are used Ø Provably resists attacks mentioned earlier 28
Research on (differential) privacy • Definitions Ø Pinning down “privacy” • Algorithms: what can we compute privately? Ø Fundamental techniques Ø Specific applications • Usable systems • Attacks: “Cryptanalysis” for data privacy • Protocols: Cryptographic tools for large-scale analysis • Implications for other areas Ø Adaptive data analysis Ø Law and policy 29
This talk • Why is privacy challenging? Ø Anonymization often fails Ø Example: membership attacks, in theory and in practice • Differential Privacy [DMNS’06] Ø “Privacy” as stability to small changes Ø Widely studied and deployed • The “frontier” of research on statistical privacy Ø Three topics 30
Frontier 1: Deep Learning with DP [Abadi et al 2016, …] s r e t e m a r a P Sensitive Deep Data Learning Revealed now, Thought of as but should be private now, but hidden Model better to reason as if public 31
Recommend
More recommend