challenges in privacy preserving analysis of structured
play

Challenges in Privacy-Preserving Analysis of Structured Data - PowerPoint PPT Presentation

Challenges in Privacy-Preserving Analysis of Structured Data Kamalika Chaudhuri Computer Science and Engineering University of California, San Diego Sensitive Structured Data Medical Records Search Logs Social Networks This Talk: Two Case


  1. Challenges in Privacy-Preserving Analysis of Structured Data Kamalika Chaudhuri Computer Science and Engineering University of California, San Diego

  2. Sensitive Structured Data Medical Records Search Logs Social Networks

  3. This Talk: Two Case Studies 1. Privacy-preserving HIV Epidemiology 2. Privacy in Time-series data

  4. HIV Epidemiology Goal: Understand how HIV spreads among people

  5. HIV Transmission Data HIV transmission A B Virus Seq-A Virus Seq-B distance (Seq-A, Seq-B) < t

  6. From Sequences to Transmission Graphs Node = Patient Viral Sequences Edge = Plausible transmission

  7. …Growing over Time 2015 Node = Patient Edge = Transmission

  8. …Growing over Time 2015 2016 Node = Patient Edge = Transmission

  9. …Growing over Time 2015 2016 2017 Node = Patient Edge = Transmission

  10. …Growing over Time 2015 2016 2017 Goal: Release properties of G with privacy across time

  11. Problem: Continual Graph Statistics Release Given: (Growing) graph G At time t, nodes and adjacent edges arrive ( ∂ V t , ∂ E t ) Goal: At time t, release f(G t ), where f = graph statistic, and G t = ( ∪ s ≤ t ∂ V s , ∪ s ≤ t ∂ E s ) while preserving patient privacy and high accuracy

  12. What kind of Privacy? Node = Patient Edge = Transmission Patient A is in the graph Hide: Release: Large scale properties

  13. What kind of Privacy? Node = Patient Edge = Transmission Hide: A particular patient has HIV Release: Statistical properties (degree distribution, clusters, Privacy notion: Node Differential Privacy does therapy help, etc)

  14. Talk Outline • The Problem: Private HIV Epidemiology • Privacy Definition: Differential Privacy

  15. Differential Privacy [DMNS06] Randomized Data + Algorithm “similar” Randomized Data + Algorithm Participation of a single person does not change output

  16. Differential Privacy: Attacker’s View Algorithm Prior Conclusion Output on + = on Knowledge Data & Algorithm Prior Conclusion Output on + = on Knowledge Data & Note: a. Algorithm could draw personal conclusions about Alice b. Alice has the agency to participate or not

  17. Differential Privacy [DMNS06] D D’ p[A(D’) = t] p[A(D) = t] t For all D, D’ that differ in one person’s value, If A = -differentially private randomized algorithm, then: ✏ � log p ( A ( D ) = t ) � � sup � ≤ ✏ � � p ( A ( D 0 ) = t ) t

  18. Differential Privacy 1. Provably strong notion of privacy 2. Good approximations for many functions e.g, means, histograms, etc.

  19. Node Differential Privacy Node = Patient Edge = Transmission

  20. Node Differential Privacy Node = Patient Edge = Transmission One person’s value = One node + adjacent edges

  21. Talk Outline • The Problem: Private HIV Epidemiology • Privacy Definition: Node Differential Privacy • Challenges

  22. Problem: Continual Graph Statistics Release Given: (Growing) graph G At time t, nodes and adjacent edges arrive ( ∂ V t , ∂ E t ) Goal: At time t, release f(G t ), where f = graph statistic, and G t = ( ∪ s ≤ t ∂ V s , ∪ s ≤ t ∂ E s ) with node differential privacy and high accuracy

  23. Why is Continual Release of Graphs with Node Differential Privacy hard? 1. Node DP challenging in static graphs [KNRS13, BBDS13] 2. Continual release of graph data has extra challenges

  24. Challenge 1: Node DP Removing one node can change properties by a lot (even for static graphs) #edges = 0 #edges = 6 (size of V) Hiding one node needs high noise low accuracy

  25. Prior Work: Node DP in Static Graphs Approach 1 [BCS15]: - Assume bounded max degree Approach 2 [KNRS13, RS15]: - Project to low degree graph G’ and use node DP on G’ - Projection algorithm needs to be “smooth” and computationally efficient

  26. Challenge 2: Continual Release of Graphs - Methods for tabular data [DNPR10, CSS10] do not apply - Sequential composition gives poor utility - Graph projection methods are not “smooth” over time

  27. Talk Outline • The Problem: Private HIV Epidemiology • Privacy Definition: Node Differential Privacy • Challenges • Approach

  28. Algorithm: Main Ideas Strategy 1: Assume bounded max degree of G (from domain) Strategy 2: Privately release “difference sequence” of statistic (instead of the direct statistic)

  29. Difference Sequence Graph Sequence: G 1 G 2 G 3 Statistic f(G 1 ) f(G 2 ) f(G 3 ) Sequence: Difference f(G 3 ) - f(G 2 ) f(G 1 ) f(G 2 ) - f(G 1 ) Sequence:

  30. Key Observation Key Observation: For many graph statistics, when G is degree bounded, the difference sequence has low sensitivity Example Theorem: If max degree(G) = D, then sensitivity of the difference sequence for #high degree nodes is at most 2D + 1.

  31. From Observation to Algorithm Algorithm: 1. Add noise to each item of difference sequence to hide effect of single node and publish 2. Reconstruct private statistic sequence from private difference sequence

  32. How does this work?

  33. Experiments - Privacy vs. Utility #edges #high degree nodes Baselines: Our Algorithm, DP Composition 1, DP Composition 2

  34. Experiments - #Releases vs. Utility #edges #high degree nodes Baselines: Our Algorithm, DP Composition 1, DP Composition 2

  35. Talk Agenda Privacy is application-dependent! Two applications: 1. HIV Epidemiology 2. Privacy of time-series data - activity monitoring, power consumption, etc

  36. Time Series Data Physical Activity Monitoring Location traces

  37. Example: Activity Monitoring Data: Activity trace of a subject Hide: Activity at each time against adversary with prior knowledge Release: (Approximate) aggregate activity

  38. Why is Differential Privacy not Right for Correlated data?

  39. Example: Activity Monitoring D = (x 1 , .., x T ), x t = activity at time t Correlation Network Data from a single subject 1-DP: Output histogram of activities + noise with stdev T Too much noise - no utility!

  40. Example: Activity Monitoring D = (x 1 , .., x T ), x t = activity at time t Correlation Network 1-entry-DP: Output activity histogram + noise with stdev 1 Not enough noise - activities across time are correlated!

  41. Example: Activity Monitoring D = (x 1 , .., x T ), x t = activity at time t Correlation Network 1-entry-group DP: Output activity histogram + noise with stdev T Too much noise - no utility!

  42. How to define privacy for Correlated Data ?

  43. Pufferfish Privacy [KM12] Secret Set S S: Information to be protected e.g: Alice’s age is 25, Bob has a disease

  44. Pufferfish Privacy [KM12] Secret Pairs Secret Set S Set Q Q: Pairs of secrets we want to be indistinguishable e.g: (Alice’s age is 25, Alice’s age is 40) (Bob is in dataset, Bob is not in dataset)

  45. Pufferfish Privacy [KM12] Secret Pairs Distribution Secret Set S Set Q Class Θ : A set of distributions that plausibly generate the data Θ e.g: (connection graph G, disease transmits w.p [0.1, 0.5]) (Markov Chain with transition matrix in set P ) May be used to model correlation in data

  46. Pufferfish Privacy [KM12] Secret Pairs Distribution Secret Set S Set Q Class Θ An algorithm A is -Pufferfish private with parameters ✏ ( S, Q, Θ ) if for all (s i , s j ) in Q, for all , all t, θ ∈ Θ X ∼ θ , p ✓ ,A ( A ( X ) = t | s i , θ ) ≤ e ✏ · p ✓ ,A ( A ( X ) = t | s j , θ ) whenever P ( s i | θ ) , P ( s j | θ ) > 0 t p ( A ( X ) | s j , θ ) p ( A ( X ) | s i , θ )

  47. Pufferfish Interpretation of DP Theorem: Pufferfish = Differential Privacy when: S = { s i,a := Person i has value a, for all i, all a in domain X } Q = { (s i,a s i,b ), for all i and (a, b) pairs in X x X } = { Distributions where each person i is independent } Θ

  48. Pufferfish Interpretation of DP Theorem: Pufferfish = Differential Privacy when: S = { s i,a := Person i has value a, for all i, all a in domain X } Q = { (s i,a s i,b ), for all i and (a, b) pairs in X x X } = { Distributions where each person i is independent } Θ Theorem: No utility possible when: = { All possible distributions } Θ

  49. How to get Pufferfish privacy? Special case mechanisms [KM12, HMD12] Is there a more general Pufferfish mechanism for a large class of correlated data? Our work: Yes, the Markov Quilt Mechanism (Also concurrent work [GK16])

  50. Correlation Measure: Bayesian Networks Node: variable Directed Acyclic Graph Joint distribution of variables: Y Pr( X 1 , X 2 , . . . , X n ) = Pr( X i | parents( X i )) i

  51. A Simple Example X 1 X 2 X 3 X n Model: X i in {0, 1} State Transition Probabilities: 1 - p p 0 1 p 1 - p

  52. A Simple Example X 1 X 2 X 3 X n Model: Pr(X 2 = 0| X 1 = 0) = p X i in {0, 1} Pr(X 2 = 0| X 1 = 1) = 1 - p State Transition Probabilities: …. 1 - p p 0 1 p 1 - p

  53. A Simple Example X 1 X 2 X 3 X n Model: Pr(X 2 = 0| X 1 = 0) = p X i in {0, 1} Pr(X 2 = 0| X 1 = 1) = 1 - p State Transition Probabilities: …. 1 - p 2 + 1 1 Pr(X i = 0| X 1 = 0) = p 0 1 p 2(2 p − 1) i − 1 1 2 − 1 Pr(X i = 0| X 1 = 1) = 2(2 p − 1) i − 1 1 - p Influence of X 1 diminishes with distance

  54. Algorithm: Main Idea X 1 X 2 X 3 X n Goal: Protect X 1

Recommend


More recommend