Releasing a Differentially Private Password Frequency Corpus from 70 Million Yahoo! Passwords Jeremiah Blocki Purdue University DIMACS/Northeast Big Data Hub Workshop on Overcoming Barriers to Data Sharing including Privacy and Fairness
What is a Password Frequency List? 12345 password abc123 abc123 Password Dataset: (N users) Histogram 2 1 1
What is a Password Frequency List? 12345 password abc123 abc123 Password Dataset: (N users) Frequency List Histogram 2 2 1 1 1 1
What is a Password Frequency List? 12345 password abc123 abc123 Password Dataset: (N users) Frequency List Histogram Formally: 2 2 π β β(πΆ) 1 1 1 1 Password Frequency List is just an integer partition.
Password Frequency List (Example Use) Estimate #accounts compromised by attacker with πΎ guesses per user β’ Online Attacker ( πΎ small) β’ Offline Attacker ( πΎ large) πΎ π πΎ = π π π=1 Password Frequency Lists allow us to estimate β’ Marginal Guessing Cost (MGC) β’ Marginal Benefit (MB) β’ Rational Adversary: MGC = MB
Available Password Frequency Lists (2015) Site #User Accounts (N) How Released RockYou 32.6 Million Data Breach* LinkedIn 6 Data Breach* β¦. β¦ β¦ Yahoo! [B12] * entire frequency list available due to improper password storage 70 Million With Permission** ** frequency list perturbed slightly to preserve differential privacy. Yahoo! Frequency data is now available online at: https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937
Yahoo! Password Frequency List β’ Collected by Joseph Bonneau in 2011 (with permission from Yahoo!) β’ Store H(s|pwd) β’ Secret salt value s (same for all users) β’ Discarded after data-collection β’ β 70 million Yahoo! Users β’ Yahoo! Legal gave permission to publish analysis of the frequency list
Project Origin Would it be possible to access the Yahoo! data? I am working on a cool new research project and the password frequency data would be very useful.
Project Origin I would love to make the data public, but Yahoo! Legal has concerns about security and privacy. They wonβt let me release it.
Project Origin I would love to make the data public, but Yahoo! Legal has concerns about security and privacy. They wonβt let me release it.
Available Password Frequency Lists Site #User Accounts (N) How Released RockYou 32.6 Million Data Breach* LinkedIn 6 Data Breach* β¦. β¦ β¦ Yahoo! [B12] 70 Million With Permission** * entire frequency list available due to improper password storage ** frequency list perturbed slightly to preserve differential privacy. Yahoo! Frequency data is now available online at: https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937
Yahoo! Frequency Corpus Largest publicly available frequency corpus (that was not result of a data-breach) Source β Dark Web
Why not just publish the original frequency lists? β’ Heuristic Approaches to Data Privacy often break down when the adversary has background knowledge β’ Netflix Prize Dataset[NS08] β’ Background Knowledge: IMDB β’ Massachusetts Group Insurance Medical Encounter Database [SS98] β’ Background Knowledge: Voter Registration Record β’ Many other attacks [BDK07,β¦] β’ In the absence of provable privacy guarantees Yahoo! was understandably reluctant to release these password frequency lists.
Security Risks (Example) ??? abc123 abc123 12345 Adversary Background Knowledge
Security Risks (Example) ??? abc123 abc123 12345 3 other abc123 12345 2 2 2 2 1 1 1
Differential Privacy (Dwork et al) Definition: An (randomized) algorithm A preserves π, π -differential privacy if for any subset S β πππππ(π΅) of possible outcomes and any we have Pr π΅(π) β S β€ π π Pr π΅(πβ²) β S + π for any pair of adjacent password frequency lists f and fβ, π β πβ² 1 = 1 . π β πβ² 1 β π π β π π β² π
Differential Privacy (Dwork et al) Definition: An (randomized) algorithm A preserves π, π -differential privacy if for any subset S β πππππ(π΅) of possible outcomes and any we have Pr π΅(π) β S β€ π π Pr π΅(πβ²) β S + π for any pair of adjacent password frequency lists f and fβ, π β πβ² 1 = 1 . f β original password frequency list fβ β remove Aliceβs password from dataset
Differential Privacy (Dwork et al) Definition: An (randomized) algorithm A preserves π, π -differential privacy if for any subset S β πππππ(π΅) of possible outcomes and any we have Pr π΅(π) β S β€ π π Pr π΅(πβ²) β S + π for any pair of adjacent password frequency lists f and fβ, π β πβ² 1 = 1 . Small Constant (e.g., π = 0.5 ) f β original password frequency list fβ β remove Aliceβs password from dataset
Differential Privacy (Dwork et al) Definition: An (randomized) algorithm A preserves π, π -differential privacy if for any subset S β πππππ(π΅) of possible outcomes and any we have Pr π΅(π) β S β€ π π Pr π΅(πβ²) β S + π for any pair of adjacent password frequency lists f and fβ, π β πβ² 1 = 1 . Negligibly Small Value (e.g., π = 2 β100 ) Small Constant (e.g., π = 0.5 ) f β original password frequency list fβ β remove Aliceβs password from dataset
Differential Privacy (Example) 2 2 2 2 2 Subset S of all potentially = 1 minus harmful outcomes to Alice f f β ππ£π’πππππ‘ 20
Differential Privacy (Example) 2 2 2 2 2 Subset S of all potentially = 1 minus harmful outcomes to Alice f f β ππ¬ π΅ π β β€ π π» ππ¬ π΅(πβ²) β + π 21
Differential Privacy (Example) Intuition: Alice wonβt be harmed because her password was included in the dataset. ππ¬ π΅ π β β€ π π» ππ¬ π΅(πβ²) β + π 22
Main Technical Result Theorem: There is a computationally efficient algorithm π΅: β Γ β β β such that A preserves π, π -differential privacy and, except with probability π , A(f) outputs π s.t. β 1 π π β ln π 1 1 β = β π β€ π + . π=π π ππ π π 1 Ξ΄ π π + π ln ππ£π§π π = π = ππͺπππ(π΅) π
Main Tool: Exponential Mechanism [MT07] Input: f πβ π 1 Output: Pr β π π = π β π β Assigns very small probability to 2π inaccurate outcomes.
Main Tool: Exponential Mechanism [MT07] Input: f πβ π 1 Output: Pr β π π = π β π β 2π Theorem [MT07]: The exponential mechanism preserves π, 0 - differential privacy.
Analysis: Exponential Mechanism Input: f πβ π 1 Output: Pr β π π = π β π β Assigns very small probability to 2π inaccurate outcomes. π partitions of the integer N. Theorem [HR18]: There are π π
Analysis: Exponential Mechanism Input: f πβ π 1 Output: Pr β π π = π β π β Assigns very small probability to 2π inaccurate outcomes. π partitions of the integer N. Theorem [HR18]: There are π π π 1 π β π 1 β€ π π = π π . Union Bound ο with high probability when π
Analysis: Exponential Mechanism Input: f πβ π 1 Output: Pr β π π = π β π β Assigns very small probability to 2π inaccurate outcomes. πβ π 1 1 β€ π Theorem: π π with high probability. π Theorem [MT07]: The exponential mechanism preserves π, 0 - differential privacy.
(e.g., [U13])
But, we did run the exponential mechanism Theorem: There is an efficient algorithm A to sample from a distribution that is π β close to the exponential mechanism β over integer partitions. The algorithm uses time and space π π + N ln 1 π π π Key Intuition: π βπ» π π π β π π = π βπ» πβ€π π π β π π Γ π βπ» π>π π π β π π Suggests Potential Recurrence Relationships
But, we did run the exponential mechanism Theorem: There is an efficient algorithm A to sample from a distribution that is π β close to the exponential mechanism β over integer partitions. The algorithm uses time and space π π + N ln 1 π π π Key Idea 1: Novel dynamic programming algorithm to compute weights W i,k such that W i, π ππ¬ π π = π π πβ1 = . π πβ1 W i,t t=0
But, we did run the exponential mechanism Theorem: There is an efficient algorithm A to sample from a distribution that is π β close to the exponential mechanism β over integer partitions. The algorithm uses time and space π π + N ln 1 π π π Key Idea 1: Novel dynamic programming algorithm to compute weights W i,t Key Idea 2: Allow A to ignore a partition π if π β π 1 very large.
Practical Challenge #1 β’ Space is Limiting Factor: N=70 million, π = 0.02 π π + N ln 1 π (8 bytes) β 200 ππΆ π β’ Workaround : Initial pruning phase to identify relevant subset of DP table for sampling. β’ Running Time: β 12 hours on this laptop
Recommend
More recommend