releasing a differentially private
play

Releasing a Differentially Private Password Frequency Corpus from 70 - PowerPoint PPT Presentation

Releasing a Differentially Private Password Frequency Corpus from 70 Million Yahoo! Passwords Jeremiah Blocki Purdue University DIMACS/Northeast Big Data Hub Workshop on Overcoming Barriers to Data Sharing including Privacy and Fairness What


  1. Releasing a Differentially Private Password Frequency Corpus from 70 Million Yahoo! Passwords Jeremiah Blocki Purdue University DIMACS/Northeast Big Data Hub Workshop on Overcoming Barriers to Data Sharing including Privacy and Fairness

  2. What is a Password Frequency List? 12345 password abc123 abc123 Password Dataset: (N users) Histogram 2 1 1

  3. What is a Password Frequency List? 12345 password abc123 abc123 Password Dataset: (N users) Frequency List Histogram 2 2 1 1 1 1

  4. What is a Password Frequency List? 12345 password abc123 abc123 Password Dataset: (N users) Frequency List Histogram Formally: 2 2 𝐠 ∈ β„˜(𝑢) 1 1 1 1 Password Frequency List is just an integer partition.

  5. Password Frequency List (Example Use) Estimate #accounts compromised by attacker with 𝛾 guesses per user β€’ Online Attacker ( 𝛾 small) β€’ Offline Attacker ( 𝛾 large) 𝛾 πœ‡ 𝛾 = 𝑔 𝑗 𝑗=1 Password Frequency Lists allow us to estimate β€’ Marginal Guessing Cost (MGC) β€’ Marginal Benefit (MB) β€’ Rational Adversary: MGC = MB

  6. Available Password Frequency Lists (2015) Site #User Accounts (N) How Released RockYou 32.6 Million Data Breach* LinkedIn 6 Data Breach* …. … … Yahoo! [B12] * entire frequency list available due to improper password storage 70 Million With Permission** ** frequency list perturbed slightly to preserve differential privacy. Yahoo! Frequency data is now available online at: https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937

  7. Yahoo! Password Frequency List β€’ Collected by Joseph Bonneau in 2011 (with permission from Yahoo!) β€’ Store H(s|pwd) β€’ Secret salt value s (same for all users) β€’ Discarded after data-collection β€’ β‰ˆ 70 million Yahoo! Users β€’ Yahoo! Legal gave permission to publish analysis of the frequency list

  8. Project Origin Would it be possible to access the Yahoo! data? I am working on a cool new research project and the password frequency data would be very useful.

  9. Project Origin I would love to make the data public, but Yahoo! Legal has concerns about security and privacy. They won’t let me release it.

  10. Project Origin I would love to make the data public, but Yahoo! Legal has concerns about security and privacy. They won’t let me release it.

  11. Available Password Frequency Lists Site #User Accounts (N) How Released RockYou 32.6 Million Data Breach* LinkedIn 6 Data Breach* …. … … Yahoo! [B12] 70 Million With Permission** * entire frequency list available due to improper password storage ** frequency list perturbed slightly to preserve differential privacy. Yahoo! Frequency data is now available online at: https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937

  12. Yahoo! Frequency Corpus Largest publicly available frequency corpus (that was not result of a data-breach) Source β‰  Dark Web

  13. Why not just publish the original frequency lists? β€’ Heuristic Approaches to Data Privacy often break down when the adversary has background knowledge β€’ Netflix Prize Dataset[NS08] β€’ Background Knowledge: IMDB β€’ Massachusetts Group Insurance Medical Encounter Database [SS98] β€’ Background Knowledge: Voter Registration Record β€’ Many other attacks [BDK07,…] β€’ In the absence of provable privacy guarantees Yahoo! was understandably reluctant to release these password frequency lists.

  14. Security Risks (Example) ??? abc123 abc123 12345 Adversary Background Knowledge

  15. Security Risks (Example) ??? abc123 abc123 12345 3 other abc123 12345 2 2 2 2 1 1 1

  16. Differential Privacy (Dwork et al) Definition: An (randomized) algorithm A preserves 𝜁, πœ€ -differential privacy if for any subset S βŠ† π‘†π‘π‘œπ‘•π‘“(𝐡) of possible outcomes and any we have Pr 𝐡(𝑔) ∈ S ≀ 𝑓 𝜁 Pr 𝐡(𝑔′) ∈ S + πœ€ for any pair of adjacent password frequency lists f and f’, 𝑔 βˆ’ 𝑔′ 1 = 1 . 𝑔 βˆ’ 𝑔′ 1 ≝ 𝑔 𝑗 βˆ’ 𝑔 𝑗 β€² 𝑗

  17. Differential Privacy (Dwork et al) Definition: An (randomized) algorithm A preserves 𝜁, πœ€ -differential privacy if for any subset S βŠ† π‘†π‘π‘œπ‘•π‘“(𝐡) of possible outcomes and any we have Pr 𝐡(𝑔) ∈ S ≀ 𝑓 𝜁 Pr 𝐡(𝑔′) ∈ S + πœ€ for any pair of adjacent password frequency lists f and f’, 𝑔 βˆ’ 𝑔′ 1 = 1 . f – original password frequency list f’ – remove Alice’s password from dataset

  18. Differential Privacy (Dwork et al) Definition: An (randomized) algorithm A preserves 𝜁, πœ€ -differential privacy if for any subset S βŠ† π‘†π‘π‘œπ‘•π‘“(𝐡) of possible outcomes and any we have Pr 𝐡(𝑔) ∈ S ≀ 𝑓 𝜁 Pr 𝐡(𝑔′) ∈ S + πœ€ for any pair of adjacent password frequency lists f and f’, 𝑔 βˆ’ 𝑔′ 1 = 1 . Small Constant (e.g., 𝜁 = 0.5 ) f – original password frequency list f’ – remove Alice’s password from dataset

  19. Differential Privacy (Dwork et al) Definition: An (randomized) algorithm A preserves 𝜁, πœ€ -differential privacy if for any subset S βŠ† π‘†π‘π‘œπ‘•π‘“(𝐡) of possible outcomes and any we have Pr 𝐡(𝑔) ∈ S ≀ 𝑓 𝜁 Pr 𝐡(𝑔′) ∈ S + πœ€ for any pair of adjacent password frequency lists f and f’, 𝑔 βˆ’ 𝑔′ 1 = 1 . Negligibly Small Value (e.g., πœ€ = 2 βˆ’100 ) Small Constant (e.g., 𝜁 = 0.5 ) f – original password frequency list f’ – remove Alice’s password from dataset

  20. Differential Privacy (Example) 2 2 2 2 2 Subset S of all potentially = 1 minus harmful outcomes to Alice f f ’ 𝑃𝑣𝑒𝑑𝑝𝑛𝑓𝑑 20

  21. Differential Privacy (Example) 2 2 2 2 2 Subset S of all potentially = 1 minus harmful outcomes to Alice f f ’ 𝐐𝐬 𝐡 𝑔 ∈ ≀ 𝑓 𝜻 𝐐𝐬 𝐡(𝑔′) ∈ + πœ€ 21

  22. Differential Privacy (Example) Intuition: Alice won’t be harmed because her password was included in the dataset. 𝐐𝐬 𝐡 𝑔 ∈ ≀ 𝑓 𝜻 𝐐𝐬 𝐡(𝑔′) ∈ + πœ€ 22

  23. Main Technical Result Theorem: There is a computationally efficient algorithm 𝐡: β„˜ Γ— β„˜ β†’ β„˜ such that A preserves 𝜁, πœ€ -differential privacy and, except with probability πœ€ , A(f) outputs 𝑔 s.t. ∞ 1 πœ€ 𝑔 βˆ’ ln 𝑔 1 1 β„˜ = β„˜ 𝒐 ≀ 𝑃 + . 𝒐=𝟐 𝑂 πœπ‘‚ 𝜁 𝑂 1 Ξ΄ 𝑂 𝑂 + 𝑂 ln π”π£π§πŸ 𝐁 = 𝑃 = 𝐓πͺπ›ππŸ(𝐡) 𝜁

  24. Main Tool: Exponential Mechanism [MT07] Input: f π‘”βˆ’ 𝑔 1 Output: Pr ℇ 𝜁 𝑔 = 𝑔 ∝ 𝑓 βˆ’ Assigns very small probability to 2𝜁 inaccurate outcomes.

  25. Main Tool: Exponential Mechanism [MT07] Input: f π‘”βˆ’ 𝑔 1 Output: Pr ℇ 𝜁 𝑔 = 𝑔 ∝ 𝑓 βˆ’ 2𝜁 Theorem [MT07]: The exponential mechanism preserves 𝜁, 0 - differential privacy.

  26. Analysis: Exponential Mechanism Input: f π‘”βˆ’ 𝑔 1 Output: Pr ℇ 𝜁 𝑔 = 𝑔 ∝ 𝑓 βˆ’ Assigns very small probability to 2𝜁 inaccurate outcomes. 𝑂 partitions of the integer N. Theorem [HR18]: There are 𝑓 𝑃

  27. Analysis: Exponential Mechanism Input: f π‘”βˆ’ 𝑔 1 Output: Pr ℇ 𝜁 𝑔 = 𝑔 ∝ 𝑓 βˆ’ Assigns very small probability to 2𝜁 inaccurate outcomes. 𝑂 partitions of the integer N. Theorem [HR18]: There are 𝑓 𝑃 𝑂 1 𝑔 βˆ’ 𝑔 1 ≀ 𝑃 𝜁 = 𝑃 𝑂 . Union Bound οƒ  with high probability when 𝜁

  28. Analysis: Exponential Mechanism Input: f π‘”βˆ’ 𝑔 1 Output: Pr ℇ 𝜁 𝑔 = 𝑔 ∝ 𝑓 βˆ’ Assigns very small probability to 2𝜁 inaccurate outcomes. π‘”βˆ’ 𝑔 1 1 ≀ 𝑃 Theorem: 𝜁 𝑂 with high probability. 𝑂 Theorem [MT07]: The exponential mechanism preserves 𝜁, 0 - differential privacy.

  29. (e.g., [U13])

  30. But, we did run the exponential mechanism Theorem: There is an efficient algorithm A to sample from a distribution that is πœ€ – close to the exponential mechanism ℇ over integer partitions. The algorithm uses time and space 𝑂 𝑂 + N ln 1 πœ€ 𝑃 𝜁 Key Intuition: 𝒇 βˆ’πœ» 𝒋 π’ˆ 𝒋 βˆ’ 𝑔 𝑗 = 𝒇 βˆ’πœ» 𝒋≀𝒖 𝑔 𝑗 βˆ’ 𝑔 𝑗 Γ— 𝒇 βˆ’πœ» 𝒋>𝒖 𝑔 𝑗 βˆ’ 𝑔 𝑗 Suggests Potential Recurrence Relationships

  31. But, we did run the exponential mechanism Theorem: There is an efficient algorithm A to sample from a distribution that is πœ€ – close to the exponential mechanism ℇ over integer partitions. The algorithm uses time and space 𝑂 𝑂 + N ln 1 πœ€ 𝑃 𝜁 Key Idea 1: Novel dynamic programming algorithm to compute weights W i,k such that W i, 𝑙 𝐐𝐬 𝑔 𝑗 = 𝑙 𝑔 π‘—βˆ’1 = . 𝑔 π‘—βˆ’1 W i,t t=0

  32. But, we did run the exponential mechanism Theorem: There is an efficient algorithm A to sample from a distribution that is πœ€ – close to the exponential mechanism ℇ over integer partitions. The algorithm uses time and space 𝑂 𝑂 + N ln 1 πœ€ 𝑃 𝜁 Key Idea 1: Novel dynamic programming algorithm to compute weights W i,t Key Idea 2: Allow A to ignore a partition 𝑔 if 𝑔 βˆ’ 𝑔 1 very large.

  33. Practical Challenge #1 β€’ Space is Limiting Factor: N=70 million, 𝜁 = 0.02 𝑂 𝑂 + N ln 1 πœ€ (8 bytes) β‰ˆ 200 π‘ˆπΆ 𝜁 β€’ Workaround : Initial pruning phase to identify relevant subset of DP table for sampling. β€’ Running Time: β‰ˆ 12 hours on this laptop

Recommend


More recommend