new directions in privacy preserving machine learning
play

New Directions in Privacy- preserving Machine Learning Kamalika - PowerPoint PPT Presentation

New Directions in Privacy- preserving Machine Learning Kamalika Chaudhuri University of California, San Diego Sensitive Data Medical Records Genetic Data Search Logs AOL Violates Privacy AOL Violates Privacy Netflix Violates Privacy [NS08]


  1. New Directions in Privacy- preserving Machine Learning Kamalika Chaudhuri University of California, San Diego

  2. Sensitive Data Medical Records Genetic Data Search Logs

  3. AOL Violates Privacy

  4. AOL Violates Privacy

  5. Netflix Violates Privacy [NS08] Movies% User%1% User%2% User%3% 2-8 movie-ratings and dates for Alice reveals: Whether Alice is in the dataset or not Alice’s other movie ratings

  6. High-dimensional Data is Unique Example: UCSD Employee Salary Table Position Department Gender Ethnicity Salary - Faculty Female CSE SE Asian One employee (Kamalika) fits description!

  7. Simply anonymizing data is unsafe!

  8. Disease Association Studies [WLWTZ09] Cancer Healthy Correlations Correlations Correlation (R 2 values), Alice’s DNA reveals: If Alice is in the Cancer set or Healthy set

  9. Simply anonymizing data is unsafe! Statistics on small data sets is unsafe! Privacy Accuracy Data Size

  10. Correlated Data User information in social networks Physical Activity Monitoring

  11. Why is Privacy Hard for Correlated Data? Neighbor’s information leaks information on user

  12. Talk Agenda: How do we learn from sensitive data while still preserving privacy ? New Directions: 1. Privacy-preserving Bayesian Learning 2. Privacy-preserving statistics on correlated data

  13. Talk Agenda: 1. Privacy for Uncorrelated Data - How to define privacy

  14. Differential Privacy [DMNS06] Randomized Data + Algorithm “similar” Randomized Data + Algorithm Participation of a single person does not change output

  15. Differential Privacy: Attacker’s View Algorithm Prior Conclusion Output on + = Knowledge on Data & Algorithm Prior Conclusion Output on + = Knowledge on Data &

  16. Differential Privacy [DMNS06] S D 1 D 2 Pr[A(D 1 ) in S] Pr[A(D 2 ) in S] For all D 1 , D 2 that differ in one person’s value, any set S, If A = -private randomized algorithm, then: ✏ Pr( A ( D 1 ) ∈ S ) ≤ e ✏ Pr( A ( D 2 ) ∈ S )

  17. Differential Privacy 1. Provably strong notion of privacy 2. Good approximations for many functions e.g, means, histograms, etc.

  18. Interpretation: Attacker’s Hypothesis Test [WZ10, OV13] H0: Input to the algorithm = Data + H1: Input to the algorithm = Data + Failure Events: False Alarm (FA), Missed Detection (MD)

  19. Interpretation: Attacker’s Hypothesis Test [WZ10, OV13] (0, 1) If algorithm is ✏ -DP Pr( FA ) + e ✏ Pr( MD ) ≥ 1 e ✏ Pr( FA ) + Pr( MD ) ≥ 1 ✓ ◆ 1 1 1 + e ✏ , 1 + e ✏ FA = False Alarm MD = Missed Detection (1, 0)

  20. Talk Agenda: 1. Privacy for Uncorrelated Data - How to define privacy - Privacy-preserving Learning

  21. Example 1: Flu Test Predicts flu or not, based on patient symptoms Trained on sensitive patient data

  22. Example 2: Clustering Abortion Data Given data on abortion locations, cluster by location while preserving privacy of individuals

  23. Bayesian Learning

  24. Bayesian Learning } Data X = { x 1 , x 2 , … } Related through Model Class Θ likelihood p ( x | θ )

  25. Bayesian Learning } Data X = { x 1 , x 2 , … } Related through Model Class Θ likelihood p ( x | θ ) + Prior π ( θ )

  26. Bayesian Learning } Data X = { x 1 , x 2 , … } Related through Model Class Θ likelihood p ( x | θ ) + Prior π ( θ ) Data X

  27. Bayesian Learning } Data X = { x 1 , x 2 , … } Related through Model Class Θ likelihood p ( x | θ ) = + Prior π ( θ ) Data X Posterior p ( θ | X )

  28. Bayesian Learning } Data X = { x 1 , x 2 , … } Related through Model Class Θ likelihood p ( x | θ ) = + Prior π ( θ ) Data X Posterior p ( θ | X ) Goal: Output posterior (approx. or samples)

  29. Example: Coin tosses X = { H, T, H, H… } likelihood: Θ = [0, 1] p ( x | θ ) = θ x (1 − θ ) 1 − x

  30. Example: Coin tosses X = { H, T, H, H… } likelihood: Θ = [0, 1] p ( x | θ ) = θ x (1 − θ ) 1 − x + Prior π ( θ ) = 1

  31. Example: Coin tosses X = { H, T, H, H… } likelihood: Θ = [0, 1] p ( x | θ ) = θ x (1 − θ ) 1 − x + Prior Data X π ( θ ) = 1 (h H, t T)

  32. Example: Coin tosses X = { H, T, H, H… } likelihood: Θ = [0, 1] p ( x | θ ) = θ x (1 − θ ) 1 − x = + Posterior Prior Data X π ( θ ) = 1 p ( θ | x ) ∝ θ h (1 − θ ) t (h H, t T)

  33. Example: Coin tosses X = { H, T, H, H… } likelihood: Θ = [0, 1] p ( x | θ ) = θ x (1 − θ ) 1 − x = + Posterior Prior Data X π ( θ ) = 1 p ( θ | x ) ∝ θ h (1 − θ ) t (h H, t T) In general, is more complex (classifiers, etc) θ

  34. Private Bayesian Learning } Data X = { x 1 , x 2 , … } Related through Model Class Θ likelihood p ( x | θ ) = + Prior π ( θ ) Data X Posterior p ( θ | X )

  35. Private Bayesian Learning } Data X = { x 1 , x 2 , … } Related through Model Class Θ likelihood p ( x | θ ) = + Prior π ( θ ) Data X Posterior p ( θ | X ) Goal: Output private approx. to posterior

  36. How to make posterior private? Option 1: Direct posterior sampling [Detal14] Not private unless under restrictive conditions p ( θ | D ) p ( θ | D 0 )

  37. How to make posterior private? Option 2: Sample from truncated posterior at high temperature [WFS15] Disadvantage: Intractable - technically privacy only on convergence Needs more data/subjects

  38. Our Work: Exponential Families Exponential family distributions: p ( x | θ ) = h ( x ) e θ > T ( x ) − A ( θ ) where T is a sufficient statistic Includes many common distributions like Gaussians, Binomials, Dirichlets, Betas, etc

  39. Properties of Exponential Families Exponential families have conjugate priors = + Prior π ( θ ) Data X Posterior p ( θ | X ) is in the same distribution class as π ( θ ) p ( θ | x ) eg, Gaussians-Gaussians, Beta-Binomial, etc

  40. Sampling from Exponential Families (Non-private) posterior comes from exp. family: p ( θ | x ) ∝ e η ( θ ) > ( P i T ( x i )) − B ( θ ) given data x 1 , x 2 , … Private Sampling: 1. If T is bounded, add noise to to get private X T ( x i ) version T’ i 2. Sample from the perturbed posterior: p ( θ | x ) ∝ e η ( θ ) > T 0 − B ( θ )

  41. Performance • Theoretical Guarantees • Experiments

  42. Theoretical Guarantees Performance Measure: Asymptotic Relative Efficiency (Lower = more sample efficient for large n) Non-private: 2 Our Method: 2 [WFS15]: max(2 , 1 + 1 / ✏ )

  43. Experiments - Task Task: Time series clustering of events in Wikileaks war logs while preserving event-level privacy Data: War-log entries - Afghanistan (75K), Iraq (390K) Goal: Cluster entries in each region based on features (casualty counts, enemy/friendly fire, explosive hazards, etc…)

  44. Experiments - Model Hidden Markov Model for each region h t … Hidden state Observed features … x t Discrete states (h t ) and observations (x t ) Transition parameters T: T ij = P(h t+1 = i | h t = j) Emission parameters O: where O ij = P(x t = i | h t = j) Goal: Sample from posterior P(O| data) (in the exponential family)

  45. Experiments - Results 4 5 − 3.5 x 10 − 1.5 x 10 − 4 − 2 − 4.5 Non − private HMM − 2.5 Non − private naive Bayes − 5 Test − set log − likelihood Test − set log − likelihood Laplace mechanism HMM − 5.5 OPS HMM (truncation multiplier = 100) − 3 − 6 − 3.5 − 6.5 − 7 − 4 Non − private HMM − 7.5 Non − private naive Bayes − 4.5 − 8 Laplace mechanism HMM OPS HMM (truncation multiplier = 100) − 5 − 8.5 − 1 0 1 − 1 0 1 10 10 10 10 10 10 Epsilon (total) Epsilon (total) Afghanistan Iraq

  46. State 2 Iraq State 1 Iraq 0.05 0.15 0.25 0.05 0.15 0.25 0.1 0.2 0.3 0.1 0.2 0 0 criminal event criminal event enemy action enemy action explosive hazard explosive hazard friendly action friendly action Experiments - States friendly fire friendly fire non − combat event combat event other other suspicious incident suspicious incident threat report threat report 0.005 0.015 0.005 0.015 0.025 0.01 0.02 0.01 0.02 0 0 cache found/cleared ied explosion ied found/cleared direct fire ied explosion ied found/cleared direct fire murder detain indirect fire escalation of force detain indirect fire search and attack small arms threat cache found/cleared raid raid murder counter mortar patrol 0.05 0.15 0.25 0.05 0.15 0.1 0.2 0.1 0.2 0 0 friendly and host casualties friendly and host casualties civilian casualties civilian casualties enemy casualties enemy casualties

  47. Experiments - Clustering MND − BAGHDAD State 2 MND − C Region code MND − N MND − SE State 1 MNF − W Jan 2004 Jan 2005 Jan 2006 Surge announced Jan 2008 Peak troops Month

  48. Conclusion New method for private posterior sampling from exponential families Open Problems: 1. Private sampling from more complex posteriors 2. Private versions of other Bayesian posterior approximation schemes (variational Bayes, etc) 3. Combining Bayesian inference with more relaxed forms of DP (eg, concentrated DP , distributional DP , etc)

  49. Talk Agenda: 1. Privacy for Uncorrelated Data - How to define privacy - Privacy-preserving Bayesian Learning 2. Privacy for Correlated Data

Recommend


More recommend