Differential Privacy and Applications Marco Gaboardi Boston University
Data
Private Queries?
Private Queries? medical correlation? y r e u q r e w s n a
Private Queries? Does Joe have cancer? query a n s w e r
Private Queries? Does Joe have cancer?
Anonymization?
Anonymization? medical correlation y r e u q r e w s n a
Anonymization? query ?!? a n s w e r
Attacks on Anonymization (Narayanan, Shmatikov: Robust De-anonymization of Large Sparse Datasets. IEEE Symposium on Security and Privacy 2008) correlations Additional Data Anonymous Data
A Possible Solution: randomization
Adding noise Noise
Adding noise medical correlation? Noise y r e u q e s i o n + r e w s n a
Adding noise Noise ?!? query a n s w e r + n o i s e
Adding noise Noise ?!?
Data analyst
Privacy vs Utility Privacy Utility
Differential privacy: understanding the mathematical and computational meaning of this trade-off. [Dwork, McSherry, Nissim, Smith, TCC06]
Some Official Users • US Census Bureau - onTheMap, new releases in 2020 • Google - RAPPOR tool for Chrome • Apple - typing statistics reports in devices • Facebook - social science data release • Uber / Amazon / Mozilla / Snapchat • Many startups
The rest of the class • Today: Fundamental of reconstruction attacks and definition of differential privacy. • Tuesday: Basic mechanisms and basic properties of differential privacy and how to support them in programming languages. • Thursday: More advanced mechanisms and their verification. • Friday: Other models and applications.
Today: Fundamental of reconstruction attacks and definition of differential privacy
Data Statistics over Data
Is this data private? D1 D2 D3 D4 D5 D6 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 I1 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 I2 1 0 1 1 1 0 1 0 1 0 1 0 0 1 0 0 I3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 1 0 I4 1 0 1 0 0 1 1 0 1 1 0 0 0 0 1 1 I5 0 0 0 1 1 0 1 1 0 1 0 1 0 1 0 0 I6 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 I7 1 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 I8 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 I9 0 1 0 0 1 0 1 1 0 1 1 1 0 1 1 0 I10 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 1 I11 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 0 I12 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 I13 1 1 1 0 1 1 1 1 0 0 1 0 1 0 1 0 I14 0 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 I15 0 1 0 1 0 1 1 0 1 0 1 0 1 0 0 1
How about if we also have this data? ID Disease ID Name 1 D1 AMAN 1 I1 Alice 2 D2 Behcet 2 I2 Bob 3 D3 Celiac 3 I3 Cynthia a 4 D4 Dermatitis 4 I4 Dan 5 D5 Evans synd. 5 I5 Eve 6 D6 Fibrosis 6 I6 Frank 7 I7 Guy 7 D7 Graves’ dis. 8 I8 Hannah 8 D8 Henoch-Schonlein 9 I9 Ivan 9 D9 IGA Neph. 10 I10 Jon 10 D10 Juv. Diabetes 11 I11 Ken 11 D11 Kawasaki dis. 12 I12 Lou 12 D12 Lichen planus 13 I13 Mike 13 D13 Myositis 14 I14 Noa 14 D14 Narcolepsy 15 I15 Omer 15 D15 Optic Neuritis
How about this? D1 D2 D3 D4 D5 D6 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 6 7 6 7 8 7 9 7 5 5 1 6 6 5 7 5 ID Disease ID Disease 1 D1 AMAN 9 D9 IGA Neph. 2 D2 Behcet 10 D10 Juv. Diabetes 3 D3 Celiac 11 D11 Kawasaki dis. 4 D4 Dermatitis 12 D12 Lichen planus 5 D5 Evans synd. 13 D13 Myositis 6 D6 Fibrosis 14 D14 Narcolepsy 7 D7 Graves’ dis. 15 D15 Optic Neuritis 8 D8 Henoch-Schonlein
The answers to this kind of questions depend on the additional information we have available
Database • We can think about a database as a list of records from some universe set: D ∈ X n = DB • Sometimes we will think to them as functions D[i] ∈ X • and sometimes we will write elements explicitly (x 1 ,…,x n ) ∈ DB
(Normalized) Counting Queries • A counting query q : X n → [0,1] is a function counting the proportion of element in a dataset satisfying the predicate: q : X → {0,1} • In symbols: q ( D ) = 1 n ∑ q ( D [ i ]) i
Example 1 Let’s consider an arbitrary universe domain X and let’s consider the following predicate for y ∈ X I 1 if y = x q y ( x ) = 0 otherwise we call a point function the associated counting query q y : X n → [0,1] Question: Suppose that we answer all the point function queries for y ∈ X . What well know statistics do we obtain?
Example 1 D1 D2 D3 I1 0 0 0 I2 1 0 1 I3 0 1 0 I4 1 0 1 D ∈ X 10 = X={0,1} 3 I5 0 0 0 I6 0 0 1 I7 1 1 0 I8 0 0 0 I9 0 1 0 I10 1 0 1 0.3 q 000 (D) = .3 q 100 (D) = 0 0.225 q 001 (D) = .1 q 101 (D) = .3 0.15 q 010 (D) = .2 q 110 (D) = .1 0.075 q 011 (D) = 0 q 111 (D) = 0 0 000 001 010 011 100 101 110 111
Example 1 Question: Suppose that we answer all the point function queries for y ∈ X . What well know statistics do we obtain? Answer: Histogram of the database.
Example II Let’s consider an arbitrary ordered universe domain X and let’s consider the following predicate for y ∈ X I 1 if x ≤ y q y ( x ) = 0 otherwise we call a threshold function the associated counting query q y : X n → [0,1] Question: Suppose that we answer all the threshold function queries for y ∈ X . What well know statistics do we obtain?
Example II D1 D2 D3 I1 0 0 0 X={0,1} 3 I2 1 0 1 I3 0 1 0 with order I4 1 0 1 D ∈ X 10 = given by the I5 0 0 0 I6 0 0 1 corresponding I7 1 1 0 binary encoding. I8 0 0 0 I9 0 1 0 I10 1 0 1 1 q 000 (D) = .3 q 100 (D) = .6 0.75 q 001 (D) = .4 q 101 (D) = .9 0.5 q 010 (D) = .6 q 110 (D) = 1 0.25 q 011 (D) = .6 q 111 (D) = 1 0 000 001 010 011 100 101 110 111
Example II Question: Suppose that we answer all the threshold function queries for y ∈ X . What well know statistics do we obtain? Answer: CDF of the database.
Example III Let’s consider the universe domain X={0,1} d and let’s consider the following predicate for an index 1 ≤ j ≤ d q j (x) = x[j] we call an attribute counting function the associated counting query q j : {0,1} n*d → [0,1] Question: Which statistics does correspond to releasing all the attribute counting functions?
Example III D1 D1 D2 D2 D3 D3 I1 I1 0 0 0 0 0 0 I2 I2 1 1 0 0 1 1 I3 I3 0 0 1 1 0 0 I4 1 0 1 I4 1 0 1 D ∈ X 10 = X={0,1} 3 I5 I5 0 0 0 0 0 0 I6 0 0 1 I6 0 0 1 I7 1 1 0 I7 1 1 0 I8 I8 0 0 0 0 0 0 I9 0 1 0 I9 0 1 0 I10 1 0 1 I10 1 0 1 margin 4 3 4 q 1 (D) = .4 q 2 (D) = .3 q 3 (D) = .4
Example III Question: Which statistics does correspond to releasing all the attribute counting functions? Answer: (1-way) Marginals of the database
⃗ ⃗ Example IV Let’s consider the universe domain X={0,1} d and let’s consider 2, …, d , ¯ v ∈ List [ k ]{1,¯ 1,2,¯ and d } q v ( x ) = q v 1 ( x ) ∧ q v 2 ( x ) ∧ ⋯ ∧ q v k ( x ) j ( x ) = ¬ x j f and d q ¯ where as q j ( x ) = x j an ounting query q : conjunction the c We call a conjunction or k-way marginal the associated counting query q j : {0,1} n*d → [0,1] Question: Which statistics does correspond to releasing conjunctions?
Example IV D1 D2 D3 I1 0 0 0 I2 1 0 1 I3 0 1 0 I4 1 0 1 D ∈ X 10 = X={0,1} 3 I5 0 0 0 I6 0 0 1 I7 1 1 0 I8 0 0 0 I9 0 1 0 I10 1 0 1 k=2 q 12 (D) = .1 q /12 (D) = .2 D1 /D1 q 1/2 (D) = .3 q /13 (D) = .1 D2 0.1 0.2 q 13 (D) = .3 q /1/2 (D) = .4 /D2 0.3 0.4 q 1/3 (D) = .1 q /1/3 (D) = .5
Example IV Question: Which statistics does correspond to releasing conjunctions? Answer: contingency tables
Linear Queries • A linear query q : X n → [0,1] is a function averagint the value of a function q : X → [0,1] over the elements of the dataset. • In symbols: q ( D ) = ∑ 1 nq ( D [ i ]) i
Sum queries • Let’s denote by I ⊆ ℕ a subset of the rows of the dataset. • A sum query q I : List(X) → ℕ is defined as q I ( D ) = ∑ D [ i ] i ∈ I
Example D1 D2 D3 I1 0 0 0 I2 1 0 1 I3 0 1 0 I4 1 0 1 X=List[3]{0,1} D = I5 0 0 0 I6 0 0 1 I7 1 1 0 I8 0 0 0 I9 0 1 0 I10 1 0 1 q {1,2,3} (D) = (1,1,1) q {1,2,4} (D) = (2,0,2) q {5,8} (D) = (0,0,0) q {2,4,7,10} (D) = (4,1,3)
Question: Is releasing the result of sum (counting or linear) queries private?
Question: How can we make statistical queries private?
Private Statistical database statistical Noise query answer+noise Question: What kind of noise?
Additive Noise Perturbation • We say that M is a privacy mechanism obtained by adding noise if for every query q, M creates a new randomized query: q*(D) = q(D) + Y • We say that a mechanism M add noise within perturbation E iff for every q and every D: |q*(D)-q(D)| ≤ E
Question: Does this approach protect privacy?
Reconstruction attack q1 Attacker Noise q2 D … qk D’
Recommend
More recommend