Post hoc bounds on false positives using Post hoc bounds on false positives using reference families reference families Pierre Neuvial Pierre Neuvial CNRS and Institut de Mathématiques de Toulouse (France) CNRS and Institut de Mathématiques de Toulouse (France) joint work with Gilles Blanchard, Guillermo Durand, Etienne Roquain, joint work with Gilles Blanchard, Guillermo Durand, Etienne Roquain, Marie Perrot-Dockès Marie Perrot-Dockès https://arxiv.org/abs/1910.11575 https://arxiv.org/abs/1910.11575 Funded by Funded by ANR SansSouci ANR SansSouci 1 / 23 1 / 23
Case study: di�erential expression in genomics Example: Leukemia data set Chiaretti et. al., Clinical cancer research , 11(20):7209–7219, 2005 Data: gene expression measurements (mRNA) genes m = 12625 cancer patients in two subgroups: n = 79 BCR/ABL: 37 patients NEG: 42 patients Question Find genes whose average expression differs between the two groups 2 / 23
Leukemia data set: volcano plot 3 / 23
Notation null hypotheses to be tested H = {1, … m } m : true null hypotheses, H 0 ⊂ H H 1 = H ∖ H 0 , m 0 = | H 0 | π 0 = m 0 / m : -values ( p i ) 1≤ i ≤ m p : a set of rejected hypotheses R ⊂ H : number of "false positives" within . | R ∩ H 0 | R Goal: post hoc inference Find a -level post hoc upper bound on , ie such that (1 − α ) | S ∩ H 0 | V α P (∀ S ⊂ {1 … m }, | S ∩ H 0 | ≤ V α ( S )) ≥ 1 − α Some related works Genovese & Wasserman, Ann. Stat. , 2006; Goeman & Solari, Stat. Sci. , 2011 Katsevich and Ramdas, ArXiv:1803.06790 Meijer, Krebs, and Goeman SAGMB , 2015 4 / 23
Starting point: post hoc bound via Simes' inequality Under PRDS, Simes' inequality implies P ( ∀ k , | R k ∩ H 0 | ≤ k − 1 ) ≥ 1 − α where R k = { i / p i ≤ αk / m } Corollary: post hoc bound on (1 − α ) | S ∩ H 0 | ¯ ¯ ¯ ¯ V α ( S ) = min 1≤ k ≤| S | { ∑ 1{ p i > αk / m } + k − 1 } i ∈ S Recovers the bound of Goeman and Solari, Stat. Science , 2011. Proof: | S ∩ H 0 | = | S ∩ R c k ∩ H 0 | + | S ∩ R k ∩ H 0 | ≤ | S ∩ R c k | + | R k ∩ H 0 | 5 / 23
Leukemia data set: volcano plot (Simes-based bound) 6 / 23
Post hoc control via reference families Post hoc control via reference families 7 / 23 7 / 23
Joint Error Rate control implies post hoc bound De�nition: JER controlling family such that R = ( R k , ζ k ) k P ( ∀ k , | R k ∩ H 0 | ≤ ζ k ) ≥ 1 − α Simes: , R k = { i / p i ≤ αk / m } ζ k = k − 1 Property: interpolation yields valid post hoc bounds (1 − α ) V ∗ α ( S ) = max{| S ∩ A | : A s.t. ∀ k , | R k ∩ A | ≤ ζ k } ¯ ¯ ¯ ¯ 1≤ k ≤| S | { | S ∩ R c V α ( S ) = min k | + ζ k } Simes: ¯ ¯ ¯ ¯ V ∗ α ( S ) = V α ( S ) = min 1≤ k ≤| S | {∑ i ∈ S 1{ p i > αk / m } + k − 1 } Main question: how to obtain JER control? 8 / 23
Contributions: post hoc bounds in two dual cases -value level sets structured hypotheses p Fixed given by prior R k knowledge Find ζ k = ζ k ( X ) Fixed JER control = joint estimation of ζ k (= k − 1) R k = R k ( X ) | R k ∩ H 0 | JER control = joint control of the - k FWER 9 / 23
Case 1: Fixed Case 1: Fixed , random , random ζ k R k Blanchard, N., Roquain: Post Hoc Blanchard, N., Roquain: Post Hoc Confidence Bounds on False Positives Confidence Bounds on False Positives Using Reference Families Using Reference Families Annals of Statistics Annals of Statistics , to appear. , to appear. R package sansSouci R package sansSouci 10 / 23 10 / 23
Setup: , ζ k = k − 1 R k = { i : p i ≤ t k ( λ )} Properties The are nested α ( S ) = ¯ ¯ ¯ ¯ ⇒ V ∗ R k V α ( S ) For the reference family : ( R k , ζ k ) JER control holds for any such that λ P ( ∃ k , p ( k : H 0 ) ≤ t k ( λ ) ) ≤ α Examples for under PRDS λ = α t k ( λ ) = λk / m for quantile of under λ = α t k ( λ ) = λ − Beta ( k + 1, m − k + 1) independence adaptivity to dependence? 11 / 23
Adaptivity to dependence Goal: estimate the largest such that λ P ( ∃ k , p ( k : H 0 ) ≤ t k ( λ ) ) ≤ α Tool: randomization , e.g. class label permutation in multiple two-sample tests Example: quantile of t k ( λ ) = λ − Beta ( k + 1, m − k + 1) 12 / 23
Leukemia data: con�dence bounds on | S ∩ H 1 | 13 / 23
Leukemia data: con�dence bounds on FDP = | S ∩ H 0 | | S |∨1 14 / 23
Leukemia data set: volcano plot (Simes-based bound) 15 / 23
Leukemia data set: volcano plot (after -calibration) λ 16 / 23
Case 2: Fixed Case 2: Fixed , random , random R k ζ k Durand, Blanchard, N., Roquain: Post hoc false positive control for Durand, Blanchard, N., Roquain: Post hoc false positive control for structured hypotheses, structured hypotheses, Scandinavian Journal of Statistics Scandinavian Journal of Statistics (2020). (2020). arxiv:1807.01470 arxiv:1807.01470 R package R package sansSouci sansSouci 17 / 23 17 / 23
Setup: Fixed , random R k ζ k Forest assumption: the are either nested or disjoint ( R k ) k =1… K Questions: 1. How to chose yielding JER control? ζ k ( X ) 2. How to estimate the associated post hoc bound V ∗ α 18 / 23
1. JER control Device: DKWM inequality Dvoretzky, Kiefer, and Wolfowitz (1956) Ann. Math. Stat. Massart (1990) Ann. Prob. Proposition Under independence, JER control is obtained for 2 1/2 ⎥ ⎢ C 2 ⎢ ⎥ ∑ i ∈ R 1 1 { p i ( X ) > t } C ⎢ ⎥ ⎢ ⎥ ζ k ( X ) = | R k | ∧ min + ( + ) , 4(1 − t ) 2 1 − t 2(1 − t ) t ∈[0,1) ⎣ ⎦ where 1 K C = √ log ( ) α 2 19 / 23
2. Algorithm to compute V ∗ α Proposition The bound is obtained recursively by examining partitions at each V ∗ α possible depth in the forest. 20 / 23
Numerical experiments: Simes vs tree-based methods 21 / 23
Leukemia data set: regional association plot The selection can be done interactively: https://pneuvial.shinyapps.io/posthoc-bounds_ordered-hypotheses/ 22 / 23
Conclusions Versatile approach to post hoc inference JER control post hoc bounds ⇒ JER control can be obtained from classical probabilistic inequalities Fixed , random : Simes' inequality under PRDS ζ k R k Fixed , random : DKWM inequality under independence R k ζ k adaptation to dependence: sharper JER control can be obtained by randomization Extensions Applications to genomic data analysis e.g. differential analysis along the genome Fixed , random : extension to specific dependence settings R k ζ k See poster of Marie Perrot-Dockès: "Improving structured post hoc inference via a Hidden Markov Model" 23 / 23
Recommend
More recommend