Controlling False Discovery Rate Privately Weijie Su University of - PowerPoint PPT Presentation

Controlling False Discovery Rate Privately Weijie Su University of Pennsylvania NIPS, Barcelona, December 9, 2016 Joint work with Cynthia Dwork and Li Zhang

Living in the Big Data world 2 / 40

Privacy loss 3 / 40

Privacy loss • Second Netflix challenge canceled • AOL search data leak • Inference presence of individual from minor allele frequencies [Homer et al ’08] 4 / 40

This talk: privacy-preserving multiple testing H 1 A hypothesis H could be • Is the SNP associated with diabetes? H 2 • Does the drug affect autism? · · · · · · H m 5 / 40

This talk: privacy-preserving multiple testing H 1 A hypothesis H could be • Is the SNP associated with diabetes? H 2 • Does the drug affect autism? · · · · · · H m Goal • Preserve privacy • Control false discovery rate (FDR) 5 / 40

This talk: privacy-preserving multiple testing H 1 A hypothesis H could be • Is the SNP associated with diabetes? H 2 • Does the drug affect autism? · · · · · · H m Goal • Preserve privacy • Control false discovery rate (FDR) Application • Genome-wide association studies • A/B testing 5 / 40

Outline 1 Warm-ups FDR and BHq procedure Differential privacy 2 Introducing PrivateBHq 3 Proof of FDR control 6 / 40

Two types of errors Not reject Reject Total Null is true True negative False positive m 0 Null is false False negative True positive m 1 Total m 7 / 40

False discovery rate (FDR) true model � # false discoveries � FDR := E 300 100 200 # discoveries estimated model 8 / 40

False discovery rate (FDR) true model � # false discoveries � 200 FDR := E = 300 100 200 # discoveries 100 + 200 estimated model 8 / 40

False discovery rate (FDR) true model � # false discoveries � 200 FDR := E = 300 100 200 # discoveries 100 + 200 estimated model • Wish FDR ≤ q (often q = 0 . 05 , 0 . 1 ) • Proposed by Benjamini and Hochberg ’95 • 35 , 490 citations as of yesterday 8 / 40

Why FDR? 9 / 40

FDR addresses reproducibility 10 / 40

How to control FDR? 11 / 40

p -values of hypotheses p -value The probability of finding the observed, or more extreme, results when the null hypothesis of a study question is true • Uniform in [0 , 1] (or stochastically larger) under true null 12 / 40

p -values of hypotheses p -value The probability of finding the observed, or more extreme, results when the null hypothesis of a study question is true • Uniform in [0 , 1] (or stochastically larger) under true null H 0 : the drug does not lower blood pressure 12 / 40

p -values of hypotheses p -value The probability of finding the observed, or more extreme, results when the null hypothesis of a study question is true • Uniform in [0 , 1] (or stochastically larger) under true null H 0 : the drug does not lower blood pressure • If p = 0 . 5 , no evidence 12 / 40

p -values of hypotheses p -value The probability of finding the observed, or more extreme, results when the null hypothesis of a study question is true • Uniform in [0 , 1] (or stochastically larger) under true null H 0 : the drug does not lower blood pressure • If p = 0 . 5 , no evidence • If p = 0 . 01 , there is evidence! 12 / 40

p -values of hypotheses p -value The probability of finding the observed, or more extreme, results when the null hypothesis of a study question is true • Uniform in [0 , 1] (or stochastically larger) under true null H 0 : the drug does not lower blood pressure • If p = 0 . 5 , no evidence • If p = 0 . 01 , there is evidence? 12 / 40

Benjamini-Hochberg procedure (BHq) Let p 1 , p 2 , . . . , p m be p -values of m hypotheses 1.0 ● ● 0.8 ◮ Sort p (1) ≤ · · · ≤ p ( m ) ● ● ● ● p−values 0.6 ● 0.4 ● ● ● ● 0.2 ● ● ● ● ● ● ● ● 0.0 ● 5 10 15 20 sorted index 13 / 40

Benjamini-Hochberg procedure (BHq) Let p 1 , p 2 , . . . , p m be p -values of m hypotheses 1.0 ● ● 0.8 ◮ Sort p (1) ≤ · · · ≤ p ( m ) ● ● ● ● ◮ Draw rank-dependent p−values 0.6 threshold qj/m ● 0.4 ● ● ● qj/m ● 0.2 ● ● ● ● ● ● ● ● 0.0 ● 5 10 15 20 sorted index 13 / 40

Benjamini-Hochberg procedure (BHq) Let p 1 , p 2 , . . . , p m be p -values of m hypotheses 1.0 ● ● 0.8 ◮ Sort p (1) ≤ · · · ≤ p ( m ) ● ● ● ● ◮ Draw rank-dependent p−values 0.6 threshold qj/m ● 0.4 ● ◮ Reject hypotheses below ● ● cutoffs qj/m ● 0.2 ● ● ● ● ● ● ● ● 0.0 ● 5 10 15 20 sorted index 13 / 40

Benjamini-Hochberg procedure (BHq) Let p 1 , p 2 , . . . , p m be p -values of m hypotheses 1.0 ● ● 0.8 ◮ Sort p (1) ≤ · · · ≤ p ( m ) ● ● ● ● ◮ Draw rank-dependent p−values 0.6 threshold qj/m ● 0.4 ● ◮ Reject hypotheses below ● ● cutoffs qj/m ● 0.2 ● ● ◮ Under independence ● ● ● ● ● ● FDR ≤ q 0.0 ● 5 10 15 20 sorted index 13 / 40

What is privacy? • My response had little impact on released results • Any adversary cannot learn much information about me based on released results • Anonymity may not work • Is the Benjamini-Hochberg procedure (BH) privacy-preserving? 14 / 40

BHq is sensitive to perturbations 1.0 0.8 0.6 p−values ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● 0.0 5 10 15 20 sorted index 15 / 40

A concrete foundation of privacy Let M be a (random) data-releasing mechanism Differential privacy (Dwork, McSherry, Nissim, Smith ’06) M is called ( ǫ, δ ) -differentially private if for all databases D and D ′ differing with one individual, and all S ⊂ Range( M ) , P ( M ( D ) ∈ S ) ≤ e ǫ P ( M ( D ′ ) ∈ S ) + δ 16 / 40

A concrete foundation of privacy Let M be a (random) data-releasing mechanism Differential privacy (Dwork, McSherry, Nissim, Smith ’06) M is called ( ǫ, δ ) -differentially private if for all databases D and D ′ differing with one individual, and all S ⊂ Range( M ) , P ( M ( D ) ∈ S ) ≤ e ǫ P ( M ( D ′ ) ∈ S ) + δ • Probability space is over the randomness of M 16 / 40

A concrete foundation of privacy Let M be a (random) data-releasing mechanism Differential privacy (Dwork, McSherry, Nissim, Smith ’06) M is called ( ǫ, δ ) -differentially private if for all databases D and D ′ differing with one individual, and all S ⊂ Range( M ) , P ( M ( D ) ∈ S ) ≤ e ǫ P ( M ( D ′ ) ∈ S ) + δ • Probability space is over the randomness of M • If δ = 0 (pure privacy), e − ǫ ≤ P ( M ( D ) ∈ S ) P ( M ( D ′ ) ∈ S ) ≤ e ǫ 16 / 40

A concrete foundation of privacy �, d Differential privacy (Dwork, McSherry, Nissim, Smith ’06) (𝜗, 𝜀) if for all adjacent x and x’, and For all neighboring databases D and D ′ , ≤ � d C ⊆ 𝑠𝑏𝑜𝑕𝑓( M ) ∈ (D’) ∈ P ( M ( D ) ∈ S ) ≤ e ǫ P ( M ( D ′ ) ∈ S ) + δ Σ � Σ d ratio bounded Pr [response] Z Z Z Bad Responses: 𝜀 17 / 40

An addition to a vast literature • Counts, linear queries, histograms, contingency tables • Location and spread • Dimension reduction (PCA, SVD), clustering • Support vector machine • Sparse regression, Lasso, logistic regression • Gradient descent • Boosting, multiplicative weights • Combinatorial optimization, mechanism design • Kalman filtering • Statistical queries learning model, PAC learning 18 / 40

An addition to a vast literature • Counts, linear queries, histograms, contingency tables • Location and spread • Dimension reduction (PCA, SVD), clustering • Support vector machine • Sparse regression, Lasso, logistic regression • Gradient descent • Boosting, multiplicative weights • Combinatorial optimization, mechanism design • Kalman filtering • Statistical queries learning model, PAC learning • FDR control 18 / 40

Laplace noise Lap( b ) has density exp( −| x | /b ) / 2 b 19 / 40

Achieving ( ǫ, 0) -differential privacy: a vignette How many members of the House of Representatives voted for Trump? • Sensitivity is 1 • Add symmetric noise Lap( 1 ǫ ) to the counts 20 / 40

Achieving ( ǫ, 0) -differential privacy: a vignette How many members of the House of Representatives voted for Trump? • Sensitivity is 1 • Add symmetric noise Lap( 1 ǫ ) to the counts How many albums of Taylor Swift are bought in total by people in this room? • Sensitivity is 5 • Add symmetric noise Lap( 5 ǫ ) to the counts 20 / 40

Outline 1 Warm-ups FDR and BHq procedure Differential privacy 2 Introducing PrivateBHq 3 Proof of FDR control 21 / 40

Sensitivity of p -values • Additive noise can kill signals when p -values are small • Solution: take logarithm of p -values 22 / 40

Controlling False Discovery Rate Privately Weijie Su University of - PowerPoint PPT Presentation

Controlling False Discovery Rate Privately Weijie Su University of Pennsylvania NIPS, Barcelona, December 9, 2016 Joint work with Cynthia Dwork and Li Zhang Living in the Big Data world 2 / 40 Privacy loss 3 / 40 Privacy loss Second

Controlling False Alarm/Discovery Rates in Online Internet Traffic Flow Classification Daniel

Microarrays False Discovery Rate Prof. Tesler Math 186 Winter 2019 Prof. Tesler

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

High-Dimensional Variable Selection in Nonlinear Models that Controls the False Discovery Rate

Leveraging prior information and group structure for false discovery rate control Rina Foygel

Barriers to Preventing False Discovery in Interactive Data Analysis Jonathan Ullman (Northeastern

Differential analysis of microarray data, Multiple testing problems and Local False Discovery

Problem and model selection and model selection Elisabeth Gnatowski Elisabeth Gnatowski

Q1) How important is the problem of adaptivity and its various guises as a cause of false

Hommels Method for False Discovery Proportions Jelle Goeman Joint work with: Aldo Solari,

Memcheck vs Optimising Compilers: Memcheck vs Optimising Compilers: keeping the false positive

Discovery on Demand Corporate Overview Nuritas deploy multiple technologies to revolutionise

An Efficient Distance Bounding RFID Authentication Protocol: Balancing False-Acceptance Rate and

Privately Developed Privately Developed Groundwater Recharge Dave Tuthill, Ph.D., P.E. August

Labor Classification Yrs Rate 1 Rate 2 Rate 3 Rate 4 Rate 5 Rate 6 Rate 7 Rate 8 Rate 9

Pe Persons controlling lling Bank under the law aw of of UK and US US Korostelev Evgeniy

YOUR NEW-GENERATION SOLUTION PARTNER ABOUT US? The largest privately-owned investment bank in

YOUR NEW-GENERATION SOLUTION PARTNER ABOUT US? The largest privately-owned investment bank in

Quad working point Fred Hartjes NIKHEF 1. False hits when using T2K gas 2. Reduction of the gas

YOU CANNOT WIN A BATTLE IF YOU DO NOT KNOW YOU ARE IN A WAR. Ephesians 6:10-18 False Cults

IC220 Set #7: Controlling the Single Cycle Implementation (Chapter Four) 1 Control Selecting

Personal Statements TRUE FALSE TRUE FALSE TRUE There is a 4,000 character

STAT 213 Controlling the Family-wise Error Rate Colin Reimer Dawson Oberlin College 8 March

Understanding Understanding and Controlling and Controlling the Risk of the Risk of