DIFFERENTIAL PRIVACY and some of its relatives
BETTER DEFINITIONS or: Why you should I answer your survey?
THE PROBLEM • Statistical databases provide some utility, • ...often at odds with privacy concerns. • Utility for who? - Government, researchers, health authorities (Dwork: Tier 1) - Commercial, movie recommendations, ad targeting (Tier II) • How do we protect privacy but maintain utility?
ONLINE METHODS • Bart showed us offline methods, preparing data beforehand, and some of the problems. Instead, consider the online case: • Data curators host data and control access to it. • We assume the curator is trusted, as is the confidentiality of the raw data itself. • If the data contains private information, how does a curator keep it private?
BLEND IN THE CROWD • Suggestion: Only answer questions about large datasets. • Fails: ?> How many Swedish PhD students like Barbie? => 834 ?> How many Swedish PhD students, who are not Icelandic or do not study Computer Security, like Barbie? => 833
Willard likes Barbie! (because, obviously, I do not!)
RANDOMIZED RESPONSE • Suggestion: Fuzz the answers so that individual differences are lost in the noise. • Not easy, but on the right track! • E.g. averaging many queries cancels noise, and it may not be decidable to detect equivalent queries.
A LITTLE REMINDER • Linkage attacks particularly tricky to prevent: The AOL debacle, Netflix Prize and Sweeney’s use of voting records. • Dalenius’ desiteratum & Dwork’s “impossibility proof”: Known fact: Height(Peter) = Average Swedish height + 10
WHAT IS PRIVACY ANYWAYS • Dalenius: Any private information that can be deduced from the information from a database, can just as well be deduced without any access to it at all. • Great notion of privacy preserving , but impossible (unless we sacrifice utility) • Sweeney’s k-anonymity: Each value of a QID appears at least k times. • Easy to check or enforce, but what does it actually mean? Ad-hoc and hurts utility (e.g. correlations)
TWO INSIGHTS • Dalenius’ definition is good, • Auxiliary information caused but doesn’t make sense Peter’s height to be unless considering all the revealed, whether he was in information in the universe. the database or not.
A MORE TANGIBLE PROPERTY Controlled access to a database preserves the privacy of a person if it makes no difference whether that person’s data is included in the database or not. So: I should participate in that toy product survey, because it makes very little difference for me (privacy), but the the statistical results may lead to better Barbie dolls (utility).
A MORE TANGIBLE PROPERTY ������ Controlled access to a database preserves the privacy of a person if it makes no difference whether that person’s data is included in the database or not. So: I should participate in that toy product survey, because it makes very little difference for me (privacy), but the the statistical results may lead to better Barbie dolls (utility).
DIFFERENTIAL PRIVACY I’m in your database, but nobody knows! (Dwork)
BACK TO RANDOMIZING • Given a database D , we allow run a randomized query f, one that may make random coin tosses in addition to inspecting data. • The result f(D) is a probability distribution over the random coin tosses of f , i.e. the data is not a random variable. • We can now measure the probability of an answer being in a certain range: Pr[ f(D) ∈ S ]
For a particular query f, what happens to those probabilities when one person is removed from D ?
EXAMPLE: COUNT P Pr[ f(D) ] outcome 833
EXAMPLE: COUNT P Pr[ f(D) ] outcome 833 S
EXAMPLE: COUNT P Pr[ f(D) ] outcome 833 S
EXAMPLE: COUNT D’ = D+{ Arnar } P Pr[ f(D) ] 833 S
EXAMPLE: COUNT D’ = D+{ Arnar } P Pr[ f(D) ] Pr[ f(D’) ] 833 S
EXAMPLE: COUNT D’ = D+{ Arnar } P Pr[ f(D) ] Pr[ f(D’) ] 833 Pr[ f(D) ∈ S ] ≈ Pr[ f(D’) ∈ S ] S
DIFFERENTIAL PRIVACY • Let D and D’ be databases differing in one row. • A randomized query f is ε -differentially private iff for any set S ⊆ dom( f ) Pr[ f(D) ∈ S ] ≤ exp( ε ) ⋅ Pr[ f(D’) ∈ S ]
DIFFERENTIAL PRIVACY • Swapping D and D’ and rearranging gives an equivalent: Pr[ f(D) ∈ S ] exp(- ε ) ≤ ≤ exp( ε ) Pr[ f(D’) ∈ S ] • For small ε , this means the ratio is very close to one.
HOW MUCH NOISE? • The sensitivity of a (non-randomized) query g is the maximum effect of adding or removing one row. over all databases. max | g(D) - g(D’) | D, D’ • Then g can be made ε - differentially private by adding a noise according to the Laplace distribution Lap( b ), with density P(z) = b/2 ⋅ exp(-|z| / b) with b = 1/ ε
LAPLACIAN NOISE • High ε gives a clear difference. More utility, less privacy: g(D) g(D’) • Low ε gives less difference. More privacy, less utility: g(D) g(D’) • Why Laplace: Symmetric and “memoryless”.
LAPLACIAN NOISE • High ε gives a clear difference. More utility, less privacy: g(D) g(D’) • Low ε gives less difference. More privacy, less utility: g(D) g(D’) • Why Laplace: Symmetric and “memoryless”.
LAPLACIAN NOISE • High ε gives a clear difference. More utility, less privacy: g(D) g(D’) • Low ε gives less difference. More privacy, less utility: g(D) g(D’) • Why Laplace: Symmetric and “memoryless”.
MULTIPLE QUERIES • Differential privacy mechanisms generally allocate a privacy budget to each user. • A user runs a query with a specified ε , which is then deducted from her budget. • Running the same query twice and averaging gives the same distribution as running it once with twice the ε . • Benefit: No need for semantic analysis on queries.
AUXILIARY INFORMATION • Differential privacy is not affected by auxiliary information, because the definition only considers whether one should participate in a particular database or not. • Note: Differential privacy gives the same guarantees for those that do not appear in the database, as for those who do!
DP MECHANISMS How do we actually use this?
BY HAND • Sensitivity of an algorithm may sometimes be approximated mechanically, but often one can do better. Conservative estimates lead to high noise and reduced utility. • Often there are non-trivial ways to give a differentially private implementation of an algorithm, that requires much less noise. • Generally parametrized on an ε . Not always Laplace noised. • Many publications here, as DP originated from the algorithms community.
EXAMPLE: K-MEANS • Finding k clusters in a collection of N points: Select k random “master” points Sort points into buckets by their closest master point Choose means of buckets as new master points and repeat. • “Bucketing” has a low sensitivity - removing one point only affects one bucket. • Calculating the means also has a low sensitivity. • Tricky: How many iterations and how to split the ε ?
PINQ • LINQ, Language Integrated Queries, are an embedded language for queries in .NET languages. • Privacy Integrated Queries [McSherry] adds a layer on top that automatically adds Laplacian noise to the result. var data = new PINQueryable<SearchRecord>( ... ... ); var users = from record in data where record.Query == argv[0] groupby record.IPAddress return users.NoisyCount(0.1);
PINQ • Sequential composition of differentially private computations is differentially private, with the sum of the components. • Parallel composition of differentially private computations is differentially private, with the maximum of the components. • PINQ over-approximates very much for some algorithms, e.g. where the privacy factor depends on control flow, but works well for many others - for example k-means.
AIRAVAT • MapReduce computations are decomposed into two phases that can easily be distributed. Introduced by Google. • Airavat [Shmatikov et.al.] implements differential privacy on top of MapReduce. • Mandatory access control and isolation allows untrusted mappers.
LINEAR TYPES • Pierce and Reed provided a linear type system that guarantees differential privacy. • Value types form metric spaces, so sensitivity of operations can be inferred (conservatively). • Does not deal well with redundant computations and over- estimates sensitivity due to control flow. • Works only for specific ways of adding random noise. Recent work of Barthe et al. (POPL 2012) aims to improve.
NOT PERFECT, OF COURSE • Issuing and managing privacy budgets (to be spent on epsilons) is far from trivial. Not really a technical problem. • One may leak information through the use of the budget. E.g. PINQ issues an error when the budget is spent, providing a limited side channel. • Traditional covert channels, such as timing, are not addressed by PINQ/Airavat. • Proposed solutions: DP Under Fire [Haeberlen, Pierce, Narayan]
RELATIVES OF DP Wait, there’s more?
Recommend
More recommend