controlling false discovery rate privately
play

Controlling False Discovery Rate Privately Weijie Su University of - PowerPoint PPT Presentation

Controlling False Discovery Rate Privately Weijie Su University of Pennsylvania NIPS, Barcelona, December 9, 2016 Joint work with Cynthia Dwork and Li Zhang Living in the Big Data world 2 / 40 Privacy loss 3 / 40 Privacy loss Second


  1. Controlling False Discovery Rate Privately Weijie Su University of Pennsylvania NIPS, Barcelona, December 9, 2016 Joint work with Cynthia Dwork and Li Zhang

  2. Living in the Big Data world 2 / 40

  3. Privacy loss 3 / 40

  4. Privacy loss • Second Netflix challenge canceled • AOL search data leak • Inference presence of individual from minor allele frequencies [Homer et al ’08] 4 / 40

  5. This talk: privacy-preserving multiple testing H 1 A hypothesis H could be • Is the SNP associated with diabetes? H 2 • Does the drug affect autism? · · · · · · H m 5 / 40

  6. This talk: privacy-preserving multiple testing H 1 A hypothesis H could be • Is the SNP associated with diabetes? H 2 • Does the drug affect autism? · · · · · · H m Goal • Preserve privacy • Control false discovery rate (FDR) 5 / 40

  7. This talk: privacy-preserving multiple testing H 1 A hypothesis H could be • Is the SNP associated with diabetes? H 2 • Does the drug affect autism? · · · · · · H m Goal • Preserve privacy • Control false discovery rate (FDR) Application • Genome-wide association studies • A/B testing 5 / 40

  8. Outline 1 Warm-ups FDR and BHq procedure Differential privacy 2 Introducing PrivateBHq 3 Proof of FDR control 6 / 40

  9. Two types of errors Not reject Reject Total Null is true True negative False positive m 0 Null is false False negative True positive m 1 Total m 7 / 40

  10. False discovery rate (FDR) true model � # false discoveries � FDR := E 300 100 200 # discoveries estimated model 8 / 40

  11. False discovery rate (FDR) true model � # false discoveries � 200 FDR := E = 300 100 200 # discoveries 100 + 200 estimated model 8 / 40

  12. False discovery rate (FDR) true model � # false discoveries � 200 FDR := E = 300 100 200 # discoveries 100 + 200 estimated model • Wish FDR ≤ q (often q = 0 . 05 , 0 . 1 ) • Proposed by Benjamini and Hochberg ’95 • 35 , 490 citations as of yesterday 8 / 40

  13. Why FDR? 9 / 40

  14. Why FDR? 9 / 40

  15. FDR addresses reproducibility 10 / 40

  16. FDR addresses reproducibility 10 / 40

  17. How to control FDR? 11 / 40

  18. p -values of hypotheses p -value The probability of finding the observed, or more extreme, results when the null hypothesis of a study question is true • Uniform in [0 , 1] (or stochastically larger) under true null 12 / 40

  19. p -values of hypotheses p -value The probability of finding the observed, or more extreme, results when the null hypothesis of a study question is true • Uniform in [0 , 1] (or stochastically larger) under true null H 0 : the drug does not lower blood pressure 12 / 40

  20. p -values of hypotheses p -value The probability of finding the observed, or more extreme, results when the null hypothesis of a study question is true • Uniform in [0 , 1] (or stochastically larger) under true null H 0 : the drug does not lower blood pressure • If p = 0 . 5 , no evidence 12 / 40

  21. p -values of hypotheses p -value The probability of finding the observed, or more extreme, results when the null hypothesis of a study question is true • Uniform in [0 , 1] (or stochastically larger) under true null H 0 : the drug does not lower blood pressure • If p = 0 . 5 , no evidence • If p = 0 . 01 , there is evidence! 12 / 40

  22. p -values of hypotheses p -value The probability of finding the observed, or more extreme, results when the null hypothesis of a study question is true • Uniform in [0 , 1] (or stochastically larger) under true null H 0 : the drug does not lower blood pressure • If p = 0 . 5 , no evidence • If p = 0 . 01 , there is evidence? 12 / 40

  23. Benjamini-Hochberg procedure (BHq) Let p 1 , p 2 , . . . , p m be p -values of m hypotheses 1.0 ● ● 0.8 ◮ Sort p (1) ≤ · · · ≤ p ( m ) ● ● ● ● p−values 0.6 ● 0.4 ● ● ● ● 0.2 ● ● ● ● ● ● ● ● 0.0 ● 5 10 15 20 sorted index 13 / 40

  24. Benjamini-Hochberg procedure (BHq) Let p 1 , p 2 , . . . , p m be p -values of m hypotheses 1.0 ● ● 0.8 ◮ Sort p (1) ≤ · · · ≤ p ( m ) ● ● ● ● ◮ Draw rank-dependent p−values 0.6 threshold qj/m ● 0.4 ● ● ● qj/m ● 0.2 ● ● ● ● ● ● ● ● 0.0 ● 5 10 15 20 sorted index 13 / 40

  25. Benjamini-Hochberg procedure (BHq) Let p 1 , p 2 , . . . , p m be p -values of m hypotheses 1.0 ● ● 0.8 ◮ Sort p (1) ≤ · · · ≤ p ( m ) ● ● ● ● ◮ Draw rank-dependent p−values 0.6 threshold qj/m ● 0.4 ● ◮ Reject hypotheses below ● ● cutoffs qj/m ● 0.2 ● ● ● ● ● ● ● ● 0.0 ● 5 10 15 20 sorted index 13 / 40

  26. Benjamini-Hochberg procedure (BHq) Let p 1 , p 2 , . . . , p m be p -values of m hypotheses 1.0 ● ● 0.8 ◮ Sort p (1) ≤ · · · ≤ p ( m ) ● ● ● ● ◮ Draw rank-dependent p−values 0.6 threshold qj/m ● 0.4 ● ◮ Reject hypotheses below ● ● cutoffs qj/m ● 0.2 ● ● ◮ Under independence ● ● ● ● ● ● FDR ≤ q 0.0 ● 5 10 15 20 sorted index 13 / 40

  27. What is privacy? • My response had little impact on released results • Any adversary cannot learn much information about me based on released results • Anonymity may not work • Is the Benjamini-Hochberg procedure (BH) privacy-preserving? 14 / 40

  28. BHq is sensitive to perturbations 1.0 0.8 0.6 p−values ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● 0.0 5 10 15 20 sorted index 15 / 40

  29. BHq is sensitive to perturbations 1.0 0.8 0.6 p−values ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● 0.0 5 10 15 20 sorted index 15 / 40

  30. A concrete foundation of privacy Let M be a (random) data-releasing mechanism Differential privacy (Dwork, McSherry, Nissim, Smith ’06) M is called ( ǫ, δ ) -differentially private if for all databases D and D ′ differing with one individual, and all S ⊂ Range( M ) , P ( M ( D ) ∈ S ) ≤ e ǫ P ( M ( D ′ ) ∈ S ) + δ 16 / 40

  31. A concrete foundation of privacy Let M be a (random) data-releasing mechanism Differential privacy (Dwork, McSherry, Nissim, Smith ’06) M is called ( ǫ, δ ) -differentially private if for all databases D and D ′ differing with one individual, and all S ⊂ Range( M ) , P ( M ( D ) ∈ S ) ≤ e ǫ P ( M ( D ′ ) ∈ S ) + δ • Probability space is over the randomness of M 16 / 40

  32. A concrete foundation of privacy Let M be a (random) data-releasing mechanism Differential privacy (Dwork, McSherry, Nissim, Smith ’06) M is called ( ǫ, δ ) -differentially private if for all databases D and D ′ differing with one individual, and all S ⊂ Range( M ) , P ( M ( D ) ∈ S ) ≤ e ǫ P ( M ( D ′ ) ∈ S ) + δ • Probability space is over the randomness of M • If δ = 0 (pure privacy), e − ǫ ≤ P ( M ( D ) ∈ S ) P ( M ( D ′ ) ∈ S ) ≤ e ǫ 16 / 40

  33. A concrete foundation of privacy �, d Differential privacy (Dwork, McSherry, Nissim, Smith ’06) (𝜗, 𝜀) if for all adjacent x and x’, and For all neighboring databases D and D ′ , ≤ � d C ⊆ 𝑠𝑏𝑜𝑕𝑓( M ) ∈ (D’) ∈ P ( M ( D ) ∈ S ) ≤ e ǫ P ( M ( D ′ ) ∈ S ) + δ Σ � Σ d ratio bounded Pr [response] Z Z Z Bad Responses: 𝜀 17 / 40

  34. An addition to a vast literature • Counts, linear queries, histograms, contingency tables • Location and spread • Dimension reduction (PCA, SVD), clustering • Support vector machine • Sparse regression, Lasso, logistic regression • Gradient descent • Boosting, multiplicative weights • Combinatorial optimization, mechanism design • Kalman filtering • Statistical queries learning model, PAC learning 18 / 40

  35. An addition to a vast literature • Counts, linear queries, histograms, contingency tables • Location and spread • Dimension reduction (PCA, SVD), clustering • Support vector machine • Sparse regression, Lasso, logistic regression • Gradient descent • Boosting, multiplicative weights • Combinatorial optimization, mechanism design • Kalman filtering • Statistical queries learning model, PAC learning • FDR control 18 / 40

  36. Laplace noise Lap( b ) has density exp( −| x | /b ) / 2 b 19 / 40

  37. Achieving ( ǫ, 0) -differential privacy: a vignette How many members of the House of Representatives voted for Trump? • Sensitivity is 1 • Add symmetric noise Lap( 1 ǫ ) to the counts 20 / 40

  38. Achieving ( ǫ, 0) -differential privacy: a vignette How many members of the House of Representatives voted for Trump? • Sensitivity is 1 • Add symmetric noise Lap( 1 ǫ ) to the counts How many albums of Taylor Swift are bought in total by people in this room? • Sensitivity is 5 • Add symmetric noise Lap( 5 ǫ ) to the counts 20 / 40

  39. Outline 1 Warm-ups FDR and BHq procedure Differential privacy 2 Introducing PrivateBHq 3 Proof of FDR control 21 / 40

  40. Sensitivity of p -values • Additive noise can kill signals when p -values are small • Solution: take logarithm of p -values 22 / 40

Recommend


More recommend