STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 13 1/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Feature Assessment when p " N Feature Assessment and Multiple Testing Problem The false discovery rate Stability Selection Introduction Selection probability Stability path Choice of regularization STK-IN4300: lecture 13 2/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem In the previous lecture: ‚ talked about the p " N framework; ‚ focused on the construction of prediction models. More basic goal: ‚ assess the significance of the M variables; § in this lecture M is the number of variables (as in the book); ‚ e.g., identify the genes most related to cancer. STK-IN4300: lecture 13 3/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem Assessing the significance of a variable can be done: ‚ as a by-product of a multivariate model, § selection by a procedure with variable selection property; § absolute value of a regression coefficient in lasso; § variable importance plots (boosting, random forests, . . . ); ‚ evaluating the variables one-by-one: § univariate tests; § § đ multiple hypothesis testing STK-IN4300: lecture 13 4/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem Consider the data from Rieger et al. (2004): ‚ study on the sensitivity of cancer patients to ionizing radiation treatment; ‚ oligo-nucleotide microarray data ( M “ 12625 ); ‚ N “ 58 : § 44 patients with normal reaction; § 14 patients who had a severe reaction. STK-IN4300: lecture 13 5/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem STK-IN4300: lecture 13 6/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem The simplest way to identify significative genes: ‚ two-sample t-statistic for each gene, t j “ ¯ x 2 j ´ ¯ x 1 j se j where § ¯ x kj “ ř i P C k x kj { N k ; § C k are the indexes of the N k observations of group k ; b 1 1 § se j “ ˆ σ j N 1 ` N 2 ; x 1 j q 2 ` ř § ˆ 1 σ 2 `ř x 2 j q 2 ˘ j “ i P C 1 p x ij ´ ¯ i P C 2 p x ij ´ ¯ . N 1 ` N 2 ´ 2 STK-IN4300: lecture 13 7/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem STK-IN4300: lecture 13 8/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem From the histogram ( 12625 t-statistics): ‚ the values range from ´ 4 . 7 to 5 . 0 ; ‚ assuming t j „ N p 0 , 1 q , significance at 5% when | t j | ě 2 ; ‚ in the example, 1189 genes with | t j | ě 2 . However: ‚ out of 12625 genes, many are significant by chance; ‚ supposing (it is not true) independence: § expected falsely significant genes, 12625 ¨ 0 . 05 “ 631 . 25 ; § standard deviation, a 12625 ¨ 0 . 05 ¨ p 1 ´ 0 . 05 q « 24 . 5 ; ‚ the actual 1189 is way out of range. STK-IN4300: lecture 13 9/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem Without assuming normality, permutation test: ` 58 ˘ ‚ perform K “ permutations of the sample labels; 14 ‚ compute the statistic t r k s for each permutation k ; j ‚ the p-value for the gene j is K p j “ 1 1 p| t r k s ÿ j | ą | t j |q K k “ 1 ` 58 ˘ (not all are needed, random sample of K “ 1000 ) 14 STK-IN4300: lecture 13 10/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem For j P 1 , . . . , M test the hypotheses: H 0 j : treatment has no effect on gene j H 1 j : treatment has an effect on gene j H 0 j is rejected at level α if p j ă α : ‚ α is the type-I error; ‚ we expect a probability of falsely rejecting H 0 j of α . STK-IN4300: lecture 13 11/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : family-wise error rate Define A j “ { H 0 j is falsely rejected } Ý Ñ Pr p A j q “ α . The family-wise error rate (FWER) is the probability of at least one false rejection, M ď Pr p A q “ Pr p A j q j “ 1 ‚ for p large, Pr p A q " α ; ‚ it depends on the correlation between the test; ‚ if tests independent, Pr p A q “ 1 ´ p 1 ´ α q M ; ‚ test with positive dependence, Pr p A q ă 1 ´ p 1 ´ α q M ; § positive dependence is typical in genomic studies. STK-IN4300: lecture 13 12/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : family-wise error rate The simplest approach to correct the p-value for the multiplicity of the tests is the Bonferroni method : ‚ reject H 0 j if p j ă α { M ; ‚ it makes the individual test more stringent; ‚ controls the FWER § it is easy to show that FWER ď α ; ‚ it is very (too) conservative. In the example: ‚ with α “ 0 . 05 , α { M “ 0 . 05 { 12635 « 3 . 9 ˆ 10 ´ 6 ; ‚ no gene has a p-value so small. STK-IN4300: lecture 13 13/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : the false discovery rate Instead of FWER, we can control the false discovery rate (FDR): ‚ expected proportion of genes incorrectly defined significant among those selected as significant, ‚ in formula, FDR “ E r V { R s ; ‚ procedure to have the FDR smaller than an user-defined α . STK-IN4300: lecture 13 14/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : the false discovery rate STK-IN4300: lecture 13 15/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : the false discovery rate STK-IN4300: lecture 13 16/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : the false discovery rate In the example: ‚ α “ 0 . 15 ; ‚ the last p j under the line α ¨ p j { M q occurs at j “ 11 ; ‚ the smallest 11 p-values are considered significative; ‚ in the example, p p 11 q “ 0 . 00012 ; ‚ the corresponding t-statistic is | t p 11 q | “ 4 . 101 ; ‚ a gene is relevant if the corresponding t-statistics is in absolute value larger than 4 . 101 . STK-IN4300: lecture 13 17/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : the false discovery rate It can be proved (Benjamini & Hochberg, 1995) that FDR ď M 0 M α ď α ‚ regardless the number of true null hypotheses; ‚ regardless the distribution of the p-values under H 1 ; ‚ suppose independent test statistics; ‚ in case of dependence, see Benjamini & Yekutieli (2001). STK-IN4300: lecture 13 18/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Stability Selection: introduction In general: ‚ the L 1 -penalty is often use to perform model selection; ‚ no oracle property (strict conditions to have it); ‚ issues with selecting the proper amount of regularization; Meinshausen & B¨ uhlmann (2010) suggested a procedure: ‚ based on subsampling (could work with bootstrapping as well); ‚ determines the amount of regularization to control the FWER; ‚ new structure estimation or variable selection scheme; ‚ here presented with L 1 -penalty, works in general. STK-IN4300: lecture 13 19/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Stability Selection: introduction Setting: ‚ β is a p -dimensional vector of coefficients; ‚ S “ t j : β j ‰ 0 u , | S | ă p ; ‚ S C “ t j : β j “ 0 u ; ‚ Z r i s “ p X r i s , Y r i s q , i “ 1 , . . . , N , are the i.i.d. data, § univariate response Y ; § N ˆ p covariate matrix X . ‚ consider a linear model Y “ Xβ ` ǫ with ǫ “ p ǫ 1 , . . . , ǫ N q with i.i.d. components. STK-IN4300: lecture 13 20/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Stability Selection: introduction The goal is to infer S from the data. We saw that lasso, ˜ p ¸ β λ “ argmin β P R p ˆ || Y ´ Xβ || 2 ÿ 2 ` λ | β j | j “ 1 provides an estimate of S, S λ “ t j : ˆ β j ‰ 0 u Ď t 1 , . . . , p u . Remember: ‚ λ P R ` is the regularization factor; i “ 1 p x r i s j q 2 “ 1 ; 2 “ ř N ‚ || X j || 2 STK-IN4300: lecture 13 21/ 30
STK-IN4300 - Statistical Learning Methods in Data Science Stability Selection: selection probability Stability selection is built on the concept of selection probability , Definition 1: Let I be a random subsample of t 1 , . . . , N u of size t N { 2 u drawn without replacement. We define selection probability the probability for a variable X j of being in S λ p I q , ˆ Π λ j “ Pr ˚ r j Ď S λ p I qs Note: ‚ Pr ˚ is with respect of both the random subsampling and other sources of randomness if S λ is not deterministic; ‚ t N { 2 u is chosen for computational efficiency. STK-IN4300: lecture 13 22/ 30
Recommend
More recommend