Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks Yang Cao Emory University 2017.11.15
Outline • Data Mining with Di ff erential Privacy (DP) • Scenario: Spatiotemporal Data Mining using DP • Markov Chain for temporal correlations • Gaussian Random Markov Field for user-user correlations • Summary and open problems
Outline • Data Mining with Di ff erential Privacy • Scenario: Spatiotemporal Data Mining using DP • Markov Chain for temporal correlations • Gaussian Random Markov Field for user-user correlations • Summary and open problems
Data Mining Company Institute sensitive database* ! a t t a c k Attacker Public
Privacy-Preserving Data Mining (PPDM)* How? ε -Differential Privacy! Institute Or X attack Sensitive data noisy data Adversary
What is Differential Privacy • Privacy: the right to be forgotten. • DP: output of an algorithm should NOT be significantly affected by individual’s data. D’ D 1 1 ≈ M( M( Q( ) Q( ) ) ) 0 0 1 0 • Formally, M satisfies ε -DP if… ε ⬆ , privacy ⬇ . ( ) = r ) ( ) log Pr( M Q D e.g. 2 ε -DP means ) = r ) ≤ ε ( ( ) more privacy loss than ε -DP. ′ Pr( M Q D • e.g., Laplace mechanism: add Lap(1/ ε ) noise to Q(D) • Sequential Composition. e.g., run M twice → 2 ε -DP
An open problem of DP on Correlation Data • When data are independent: D’ D 1 1 ≈ M( M( Q( ) Q( ) ) ⇒ ε -DP ) 0 0 1 0 • When data are correlated (e.g. u1 and u3 always same): D’ D 1 0 ≈ M( M( Q( ) Q( ) ) ) / 0 0 ⇒ ?-DP 1 0 • It is still controversial [*][**] about the “guarantee” of DP [*] Di ff erential Privacy as a Causal Property, https://arxiv.org/abs/1710.05899 [**] https://github.com/frankmcsherry/blog/blob/master/posts/2016-08-29.md
Quantifying DP on Correlated Data • A few recent papers [Cao17][Yang15][Song17] use a Quantification approach to achieve ε -DP (protecting each user private data value) Traditional approach (if attacker knows correlations, ε -DP may not hold): Laplace sensitive Mechanism ε -DP data data Lap(1/ ε ) Quantification approach (protect against attackers with knowledge of correlation): Laplace sensitive model data attacker Mechanism ε -DP data data correlations inference Lap(1/ ε ’ ) [Cao17]: Markov Chain [Yang15]: Gaussian Markov Random Field (GMRF) [Song17]: Bayesian Network
Outline • Data Mining with Di ff erential Privacy • Scenario: Spatiotemporal Data Mining using DP • Markov Chain for temporal correlations • Gaussian Random Markov Field for user-user correlations • Summary and open problems
Spatiotemporal Data Mining with DP Sensitive data Private data ε -DP ε -DP ε -DP D 3 … … r 2 r 3 D 1 D 2 r 1 t= 1 2 3 … t= 1 2 3 t= 1 2 3 .. .. Count Laplace 0 1 3 u1 … Query Noise 0 2 2 loc 3 loc 1 loc 1 loc1 .. loc1 .. 3 1 0 2 0 0 loc2 .. u2 … loc2 .. loc 2 loc 4 loc 5 1 0 1 1 0 1 loc3 .. loc3 .. u3 … loc 2 loc 4 loc 5 2 1 0 1 2 0 loc4 .. loc4 .. Lap(1/ ε ) u4 … loc 4 loc 5 loc 3 1 3 3 0 1 2 loc5 .. loc5 .. (a) Location Data (c) Private Counts (b) True Counts
What types of data correlations ? loc 3 loc 5 colleague couple u2 loc 4 u1 u3 (a) Road Network (b) Social Ties temporal correlation spatial correlation for single user for user-user … D 1 D 2 D 3 7:00 8:00 9:00 … u1 … loc 3 loc 1 loc 1 u2 … loc 2 loc 1 loc 1 u3 … loc 2 loc 4 loc 5 u4 … loc 4 loc 5 loc 3 (a) Location Data
Outline • Data Mining with Di ff erential Privacy • Scenario: Spatiotemporal Data Mining using DP • Markov Chain for temporal correlations - what is MC - how can (attacker) learn MC from data - how can (attacker) infer private data using MC • Gaussian Random Markov Field for user-user correlations • Summary and open problems
What is Markov Chain • A Markov chain is a stochastic process with the Markov property. • 1- order Markov property: the state at time t only depends on the state at time t-1 Pr(x_t|x_t-1)=Pr(x_t|x_t-1,…,x_1) • Time-homogeneous: the transition matrix is the same after each step ∀ t>0, Pr(x_t+1|x_t)=Pr(x_t+2|x_t+1) t+1 loc1 loc2 loc3 7:00 8:00 9:00 … loc1 0.2 0.1 0.7 … u1 loc1 loc3 loc2 t loc2 0.1 0.2 0.7 … u2 loc2 loc2 loc2 loc3 0.3 0.4 0.3 … u3 loc3 loc1 loc1 … u4 loc1 loc2 loc2 Transition Matrix Raw Trajectories
How can (attacker) learn MC • If attacker knows partial user trajectory, he can directly learn transition matrix by Maximum Likelihood estimation • If attacker knows road network, he may learn MC using google-like model [*] [*] E. Crisostomi, S. Kirkland, and R. Shorten, “A Google-like model of road network dynamics and its application to regulation and control,” International Journal of Control , vol. 84, no. 3, pp. 633–651, Mar. 2011.
How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL • Model temporal correlations using Markov Chain e.g., user i : loc 1 → loc 3 → loc 2 → … t − 1 l i t ) t l i t − 1 ) (a) Transition Matrix Pr( l i (b) Transition Matrix Pr( l i time t time t-1 loc 1 loc 2 loc 3 loc 1 loc 2 loc 3 time t-1 loc 1 loc 1 0.2 0.3 0.5 0.1 0.2 0.7 time t loc 2 loc 2 0.1 0.1 0.8 0 0 1 loc 3 0.6 0.2 0.2 loc 3 0.3 0.3 0.4 B F P P Backward Temporal Correlation Forward Temporal Correlation i i
How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL • DP can protect against the attacker with knowledge of all tuples + Temporal Correlation ? except the one of victim D t= 1 l i ? u1 loc 3 u2 loc 2 } D K u3 loc 2 T ( D K , P B , P F ) A i ( D K ) A i u4 loc 4 i i B , ∅ ) T ( D K , P A i (i) i T ( D K , ∅ , P F ) (ii) A i i T ( D K , P B , P F ) A i (iii) i i
How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL • Recall the definition of DP: PL 0 ( M ) ≤ ε if , then satisfies ε -DP. M • Definition of TPL:
How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL • Definition of TPL: • If no temporal correlation… TPL = PL 0 Eqn(2)= log Pr( r 1 | l i t ) + ... + log Pr( r t | l i t ) + ... + log Pr( r T | l i t ) t , D k t , D k t , D k Pr( r 1 | l i t ) Pr( r t | l i t ) Pr( r T | l i t ) t ʹ , D k t ʹ , D k t ʹ , D k { { { PL 0 0 0
How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL • Definition of TPL: Hard to quantify • If with temporal correlation… TPL = ? Eqn(2)… Eqn(2)= log Pr( r 1 | l i t ) + ... + log Pr( r t | l i t ) + ... + log Pr( r T | l i t ) t , D k t , D k t , D k Pr( r 1 | l i t ) Pr( r t | l i t ) Pr( r T | l i t ) t ʹ , D k t ʹ , D k t ʹ , D k { { { PL 0 ? ?
⇒ How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL B , ∅ ) T ( D K , ∅ , P T ( D K , P F ) (i) A i (ii) A i i i (BPL) (FPL) T ( D K , P B , P F ) (iii) A i i i (i) (ii) r 1 …. r t-1 r t r t+1 …. r T
How can (attacker) infer private data using MC BPL Model Attacker Define TPL Find structure of TPL • Analyze BPL Backward temporal correlations Eqn(6)= Backward privacy loss function . how to calculate it? ⇒
How can (attacker) infer private data using MC FPL Model Attacker Define TPL Find structure of TPL • Analyze FPL Forward temporal correlations Forward privacy loss function . how to calculate it? ⇒
Calculating BPL & FPL Privacy Quantification Upper bound • We convert the problem of BPL/FPL calculation to finding an optimal solution of a linear-fractional programming problem . • This problem can be solved by simplex algorithm in O(2 n ). • We designed a O(n 2 ) algorithm for quantifying BPL/FPL.
⇒ Calculating BPL & FPL Privacy Quantification Upper bound • Example of BPL under different temporal corr. (i) Strong temporal corr. (ii) Moderate temporal corr. (iii) No temporal corr. 1.0 0.9 0.8 0.7 0.6 Privacy Loss 0.50 0.48 0.45 0.5 0.42 0.39 0.35 0.4 0.30 0.25 0.3 0.18 0.2 0.10 0.1 t=1 2 3 4 5 6 7 8 9 10 Time
Calculating BPL & FPL Privacy Quantification Upper bound q = 0.8; d = 0.1; ε = 0.23 q = 0.8; d = 0; ε = 0.15 q=0.8, d=0.1, ε =0.23 q=0.8, d=0, ε =0.15 BPL BPL 0.8 1.2 Privacy Loss 1.0 case 2 case 1 0.6 (a) (b) 0.8 0.4 0.6 B B P i = ( ) P i = ( ) 0.8 0.2 0.8 0.2 0.1 0.9 0 1 0.4 0.2 0.2 100 t 100 t 20 40 60 80 20 40 60 80 time q = 1; d = 0; ε = 0.23 q = 0.8; d = 0; ε = 0.23 BPL BPL q=0.8, d=0, ε =0.23 q=1, d=0, ε =0.23 3.5 20 (d) 3.0 (c) case 3 case 4 2.5 15 2.0 10 1.5 B B 1 0 P i = ( ) P i = ( ) 0.8 0.2 0 1 0 1 1.0 5 0.5 100 t 100 t 20 40 60 80 20 40 60 80 Refer to Theorem 5 in our paper
Outline • Data Mining with Di ff erential Privacy • Scenario: Spatiotemporal Data Mining using DP • Markov Chain for temporal correlations • Gaussian Random Markov Field for user-user correlations - what is GMRF • Summary and open problems - how can (attacker) learn GMRF from data - how can (attacker) infer private data using GMRF
Recommend
More recommend