u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Faculty of Health Sciences Statistical methods in bioinformatics Brief introduction, statistical models, dimension reductions. Claus Thorn Ekstrøm Biostatistics, University of Copenhagen E-mail: ekstrom@sund.ku.dk Slide 1/56
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Today’s programme Introduction to statistical methods for high-dimensional data, linear models, dimension reduction and regularization methods. 1 Brief overview of molecular data. 2 Big-p small-n problems 3 Multiple testing techniques (inference correction, false discovery rates, q-values) 4 The correlation vs. causation and prediction vs. hypothesis differences 5 Generalized linear models refresher 6 Dimension reduction I: Penalized regression 7 Dimension reduction II: Partial least squares, principal component regression Slide 2/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 “Classical”statistics analysis gene obesity gender age Could be analyzed with a multiple regression model: obesity i = α + β 1 · gene i + β 2 · gender i + β 3 · age i + ε i Slide 3/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 The omics revolution Slide 4/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 The“joy”of *omics for an analyst CACAC GCGTG AAGAT CAACC Slide 5/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Sequence data CACACGCGTGAAGATCAACCGAAA TCACTCATGCGGGCTTGACCATGT CGCCTACATGTCCTTCACACGCGT GAAGATCAACCGAAATCACTCATG CGGGCTTGACCATGTCGCCTACAT GTCCTTCACACGCGTGAAGATCAA CCGAAATCACTCATGCGGGCTTGA CCATGTCGCCTACATGTCCTTCAC ACGCGTGAAGATCAACCGAAATCA CTCATGCGGGCTTGACCATGTCGC CTACATGTCC Slide 6/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Sequence data CACACGCGTGAAGATCAACCGAAA TCACTCATGCGGGCTTGACCATGT Evaluate CGCCTACATGTCCTTCACACGCGT GAAGATCAACCGAAATCACTCATG P ( Y i =“gene” | Y 1 ,..., Y i − 1 ) CGGGCTTGACCATGTCGCCTACAT GTCCTTCACACGCGTGAAGATCAA Do that for each i and identify CCGAAATCACTCATGCGGGCTTGA the nucleotides that have a CCATGTCGCCTACATGTCCTTCAC high probability of being inside ACGCGTGAAGATCAACCGAAATCA a gene. CTCATGCGGGCTTGACCATGTCGC CTACATGTCC Slide 6/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Proteomics Slide 7/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Gene expression data Slide 8/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Metabolite data Slide 9/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Sequence data — metabolite data Slide 10/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 A bit of history • 2000 one SNP • 2003 10 SNPs • 2006 500 SNPs • 2009 22k SNPs • 2012 2.5 mio SNPs • 2013 25 mio SNPs ∼ 45 mio imputated Slide 11/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Pattern recognition Slide 12/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Prediction Slide 13/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 The $1000 genome Slide 14/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Data sizes y x 1 x 2 x 3 y x . . . Slide 15/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Data sizes y x 1 x 2 x 3 ··· x 1 x 2 x 99999 X y x . . . . . . We need dimension reduction constantly: • Feature selection • Inference? Slide 15/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 The problem with multiple comparisons P predictors - let’s do P standard analyses! P (at least 1 false positive) = 1 − (1 − α ) P Slide 16/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems Slide 17/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems Possible errors committed when testing a single null hypotheses, H 0 . H 0 is true ... is false Total Rejected α 1 − β Not rejected 1 − α β Total 1 1 α is the significance level, 1 − β is the power. Slide 18/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems Number of errors committed when testing m null hypotheses H 0 is true ... is false Total Rejected V S R Not rejected U T m − R Total m 0 m − m 0 m Here R , the number of rejected hypotheses/discoveries, can be seen as a random variable. V , S , U and T are unobserved. Slide 19/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems Number of errors committed when testing m null hypotheses H 0 is true ... is false Total Rejected V S R Not rejected U T m − R Total m 0 m − m 0 m Here R , the number of rejected hypotheses/discoveries, can be seen as a random variable. V , S , U and T are unobserved. The family-wise error rate (FWER) is the probability of making at least one type I error (false positive): FWER = P ( V > 0) = 1 − P ( V = 0) Slide 19/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems The family-wise error rate (FWER) is the probability of making at least one type I error (false positive). For m tests we have FWER = 1 − P ( V = 0) = 1 − (1 − α ) m ≤ m α where the second equality only holds under independence. (However, the inequality holds due to Boole’s inequality.) Slide 20/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Bonferroni correction The most conservative method but is free of dependence and distributional assumptions. FWER = 1 − P ( V = 0) = 1 − (1 − α ) m ≤ m α So set instead the significance level at each individual test at α / m . In other words we reject the i th hypothesis if m · p i ≤ α ⇔ p i ≤ α m Slide 21/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Bonferroni correction The most conservative method but is free of dependence and distributional assumptions. FWER = 1 − P ( V = 0) = 1 − (1 − α ) m ≤ m α So set instead the significance level at each individual test at α / m . In other words we reject the i th hypothesis if m · p i ≤ α ⇔ p i ≤ α m ˘ ak (assume independence). Want significance level α ∗ . S´ ıd´ √ 1 − (1 − α ) m = α ∗ ⇔ α = 1 − m 1 − α ∗ Slightly less conservative (but not much). Slide 21/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Holm’s correction The Holm-Bonferroni-correction. 1 Compute and order the individual p -values: p (1) ≤ p (2) ≤ ··· ≤ p ( m ) . 2 Find ˆ α k = min { k : p ( k ) > m +1 − k } 3 If ˆ k exists then reject hypotheses corresponding to p (1) ,..., p (ˆ k − 1) . Slide 22/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Holm’s correction The Holm-Bonferroni-correction. 1 Compute and order the individual p -values: p (1) ≤ p (2) ≤ ··· ≤ p ( m ) . 2 Find ˆ α k = min { k : p ( k ) > m +1 − k } 3 If ˆ k exists then reject hypotheses corresponding to p (1) ,..., p (ˆ k − 1) . Controls the FWER: Assume the (ordered) k is the first wrongly rejected true hypothesis. Then k ≤ m − ( m 0 − 1). Hypothesis k was rejected so α m +1 − ( m − ( m 0 − 1)) ≤ α α p ( k ) ≤ m +1 − k ≤ m 0 Since there are m 0 true hypotheses then (Bonferroni argument) the probability that one of them is significant is at most α so FWER is controlled. Slide 22/56 — Statistical methods in bioinformatics
u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Resampling methods Computerintensive methods Permutation methods. Simulate data under H 0 , compute test statistic and compre to test statistic from original data. Bootstrap. “Simulate data under H a ” . Slide 23/56 — Statistical methods in bioinformatics
Recommend
More recommend