Mobile Data Collection and Analysis with Local Differential Privacy - Part 1 Ninghui Li (Purdue University) 1
Outline • Motivation of Differential Privacy and Local Differential Privacy (LDP) • Frequency Oracles in LDP
Tradeoff between Privacy and Utility A privacy notion for privacy protection guarantee Privacy Utility Design a mechanism under such notion with high utility 6/13/2019 3
AOL Data Release [NYTimes 2006] • In August 2006, AOL Released search keywords of 650,000 users over a 3-month period. • User IDs are replaced by random numbers. • 3 days later, pulled the data from public access. Thelman Arnold, AOL searcher # 4417749 a 62 year old “landscapers in Lilburn, GA” widow who lives NYT queries on last name “Arnold” in Liburn GA, has “ homes sold in shadow lake three dogs, subdivision Gwinnett County, GA” frequently “num fingers” searches her “60 single men” friends’ medical “dog that urinates on everything” ailments. Re-identification occurs! 6/13/2019 4
Differential Privacy [Dwork et al. 2006] • Idea: Any output should be about as likely regardless of whether or not I am in the dataset Def. Algo 𝐵 satisfies 𝜗 -differential 𝐸′ D privacy if for any neighboring D and D ’ and any possible output 𝑢 , 𝑓 −𝜗 ≤ Pr[𝐵 𝐸 =𝑢] Pr[𝐵 𝐸 ′ =𝑢] ≤ 𝑓 𝜗 Parameter 𝜗: strength of privacy 𝐵(𝐸′) 𝐵(𝐸) protection, known as privacy budget. 6/13/2019 5
Key Assumption Behind DP: The Personal Data Principle • After removing one individual’s data , that individual’s privacy is protected perfectly. • Even if correlation can still reveal individual info, that is not considered to be privacy violation • In other words, for each individual, the world after removing the individual’s data is an ideal world of privacy for that individual. Goal is to simulate all these ideal worlds. 6/13/2019 6
Differential Privacy in the Centralized Setting Data mining Database Statistical queries +Noise Differential Privacy Interpretation: Differential Privacy Interpretation: Classical/ The decision to include/exclude an The decision to include/exclude an centralized individual’s record has limited ( 𝜁 ) individual’s record has limited ( 𝜁 ) setting influence on the outcome. influence on the outcome. Smaller 𝜁 ➔ Stronger Privacy Smaller 𝜁 ➔ Stronger Privacy DataData Data Data Data 7
Differential Privacy in the Centralized Setting Data mining Database Statistical queries +Noise Trusted Trust boundary Data Data Data Data Data 8
Local Differential Privacy Data mining Database Statistical queries No worry about untrusted server Data+Noise Data+Noise Data+Noise Trust boundary 9
Outline • Motivation of Differential Privacy and Local Differential Privacy (LDP) • Frequency Oracles in LDP
The Frequency Oracle Protocols under LDP • 𝑑 ≔ 𝐹𝑡𝑢( 𝑧 ) takes reports {𝑧} from all • 𝑧 ≔ 𝑄(𝑤) users and outputs takes input value 𝑤 from 𝑧 estimations 𝑑(𝑤) for any domain 𝐸 and outputs 𝑧 . value 𝑤 in domain 𝐸 FO is 𝜁 -LDP iff ′ for any 𝑤 and 𝑤′ from 𝐸 , and any valid output 𝑧 , Pr 𝑄 𝑤 =𝑧 Pr 𝑄 𝑤′ =𝑧 ≤ 𝑓 𝜁 11
Random Response (Warner’65) • Survey technique for private questions • Survey people: • “Do you a disease?” • Each person: Provide deniability: • Flip a secret coin Seeing answer, not certain about the secret. • Answer truth if head (w/p 0.5 ) • Answer randomly if tail • E.g., a patient will answer “yes” w/p 75%, and “no” w/p 25% • To get unbiased estimation of the distribution: • If 𝑜 𝑤 out of 𝑜 people have the disease, we expect to see 𝐹[ 𝐽 𝑤 ] = 0.75𝑜 𝑤 + 0.25(𝑜 − 𝑜 𝑤 ) “yes” answers • 𝑑(𝑜 𝑤 ) = 𝐽 𝑤 −0.25𝑜 0.75−0.5 is the unbiased estimation of number of patients 6/13/2019 12
Concrete Example An individual will answer “yes” w/p 75%, and “no” w/p 25% truth Expected Expected yes no yes 80 60 20 no 20 5 15 observed 65 35 𝐽 𝑤 −0.25𝑜 𝑑(𝑜 𝑤 ) = estimate 80 20 0.75−0.25 6/13/2019 13
From Two to Any Categories Generalized RAPPOR: Randomized Random Aggregatable Privacy- Response Preserving Ordinal Response. Ú. Erlingsson, V. Pihur, A. Korolova, CCS 2014 Random Unary Response Encoding Local, Private, Efficient Local Locally Differentially Private Protocols Protocols for Succinct for Frequency Estimation T. Wang, J. Hash Histograms R. Bassily, A. Blocki, N. Li, S. Jha: USENIX Security Smith. STOC 2015. 2017 6/13/2019 14
Generalized Random Response • User: Intuitively, the higher 𝑞 , the more accurate Intuitively, the higher 𝑞 , the more accurate • Given v ∈ 𝐸 = {1,2, … , 𝑒} ) • Toss a coin with bias 𝑞 However, when 𝑒 is large, 𝑞 becomes small However, when 𝑒 is large, 𝑞 becomes small • If it is head, report the true value 𝑧 = 𝑤 (for the same 𝜁 ) (for the same 𝜁 ) • Otherwise, report any other value with probability 𝑟 = 1−𝑞 𝑒−1 (uniformly at random) 𝜁 𝒒(𝒆 = 𝟑) 𝒒(𝒆 = 𝟗) 𝒒(𝒆 = 𝟐𝟑𝟗) 𝒒(𝒆 = 𝟐𝟏𝟑𝟓) 𝑓 𝜁 1 Pr 𝑄 𝒘 =𝒘 𝑞 𝑟 = 𝑓 𝜁 • 𝑞 = 𝑓 𝜁 +𝑒−1 , 𝑟 = 𝑓 𝜁 +𝑒−1 ⇒ Pr 𝑄 𝒘′ =𝒘 = 0.1 0.52 0.13 0.016 0.001 • Aggregator: 1 0.73 0.27 0.027 0.002 • Suppose 𝑜 𝑤 users possess value 𝑤, 𝐽 𝑤 is the number of reports 2 0.88 0.51 0.057 0.007 on 𝑤. 4 0.98 0.88 0.307 0.05 • 𝐹[𝐽 𝑤 ] = 𝑜 𝑤 ⋅ 𝑞 + 𝑜 − 𝑜 𝑤 ⋅ 𝑟 • Unbiased Estimation: 𝑑(𝑤) = 𝐽 𝑤 −𝑜⋅𝑟 To get rid of dependency on domain size, To get rid of dependency on domain size, 𝑞−𝑟 we move to the other protocols. we move to the other protocols. 6/13/2019 15
Unary Encoding (Basic RAPPOR) • Encode the value 𝑤 into a bit string 𝒚 ≔ 0, 𝒚 𝑤 ≔ 1 • e.g., 𝐸 = 1,2,3,4 , 𝑤 = 3, then 𝒚 = [0,0,1,0] • Perturb each bit, preserving it with probability 𝑞 𝑓 𝜁/2 1 • 𝑞 1→1 = 𝑞 0→0 = 𝑞 = 𝑞 1→0 = 𝑞 0→1 = 𝑟 = 𝑓 𝜁/2 +1 𝑓 𝜁/2 +1 • ⇒ Pr 𝑄(𝐹 𝑤 )=𝒚 Pr 𝑄(𝐹 𝑤 ′ )=𝒚 ≤ 𝑞 1→1 𝑞 0→1 × 𝑞 0→0 𝑞 1→0 = 𝑓 𝜁 • Since 𝒚 is unary encoding of 𝑤, 𝒚 and 𝒚′ differ in two locations • Intuition: • By unary encoding, each location can only be 0 or 1 , effectively reducing 𝑒 in each location to 2 . (But privacy budget is halved.) • When 𝑒 is large, UE is better than DE. • To estimate frequency of each value, do it for each bit. 6/13/2019 16
Binary Local Hash • The original protocol uses a shared random matrix; this is an equivalent description • Each user uses a random hash function from 𝐸 to 0,1 • The user then perturbs the bit with probabilities 𝑓 𝜁 1 • 𝑞 = 𝑓 𝜁 +1 , 𝑟 = 𝑓 𝜁 +1 ⇒ Pr 𝑄(𝐹 𝒘 ) = 𝑐 Pr 𝑄(𝐹 𝒘 ′ ) = 𝑐 = 𝑞 𝑟 = 𝑓 𝜁 • The user then reports the bit and the hash function • The aggregator increments the reported group • 𝐹[𝐽 𝑤 ] = 𝑜 𝑤 ⋅ 𝑞 + 𝑜 − 𝑜 𝑤 ⋅ ( 1 2 𝑟 + 1 2 𝑞) 𝐽 𝑤 −𝑜⋅ 1 • Unbiased Estimation: 𝑑(𝑤) = 2 𝑞− 1 2 6/13/2019 17
Optimization • We measure utility of a mechanism by its variance • E.g., in Random Response, 𝐽 𝑤 −𝑜⋅𝑟 𝑊𝑏𝑠[𝐽 𝑤 ] 𝑜⋅𝑟⋅(1−𝑟) • 𝑊𝑏𝑠 𝑑 𝑤 = 𝑊𝑏𝑠 = 𝑞−𝑟 2 ≈ 𝑞−𝑟 2 𝑞−𝑟 • We propose a framework called ‘pure’ and cast existing mechanisms into the framework. 𝑛𝑗𝑜 𝑟′ 𝑊𝑏𝑠 𝑑 𝑤 𝑛𝑗𝑜 𝑟′ 𝑊𝑏𝑠 𝑑 𝑤 • Each output 𝑧 “supports” a set of input 𝑤 𝑜⋅𝑟′⋅(1−𝑟′) 𝑜⋅𝑟′⋅(1−𝑟′) or 𝑛𝑗𝑜 𝑟′ or 𝑛𝑗𝑜 𝑟′ • E.g., In Unary Encoding, a binary vector supports each 𝑞′−𝑟 ′2 𝑞′−𝑟 ′2 value with a corresponding 1 where 𝑞′, 𝑟′ satisfy 𝜁 -LDP where 𝑞′, 𝑟′ satisfy 𝜁 -LDP • E.g., In BLH, Support (𝑧) = 𝑤 𝐼 𝑤 = 𝑧 • A pure protocol is specified by 𝑞′ and 𝑟′ • Each input is perturbed into a value “supporting it” with 𝑞 ′ , and into a value not supporting it with 𝑟′ 6/13/2019 18
Frequency Estimation Protocols • Randomised response: a survey technique for eliminating evasive answer bias • S.L. Warner, Journal of Ame. Stat. Ass. 1965 • Direct Encoding (Generalized Random Response) • RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. • Ú. Erlingsson, V. Pihur, A. Korolova, CCS 2014 • Unary Encoding, Encode into a bit-vector • Local, Private, Efficient Protocols for Succinct Histograms • R. Bassily, A. Smith. STOC 2015. • Binary Local Hash: Encode by hashing and then perturb • Locally Differentially Private Protocols for Frequency Estimation • T. Wang, J. Blocki, N. Li, S. Jha: USENIX Security 2017
Recommend
More recommend