statistical significance for untangling complex genotype
play

Statistical significance for untangling complex genotype- phenotype - PowerPoint PPT Presentation

Statistical significance for untangling complex genotype- phenotype connections Jun Sese sese.jun@aist.go.jp AIST http://seselab.org/ Higher-order analyses of genome-wide data are incompatible with p-values Combinatorial e ff ects


  1. Statistical significance for untangling complex genotype- phenotype connections Jun Sese sese.jun@aist.go.jp AIST http://seselab.org/

  2. Higher-order analyses of genome-wide data are incompatible with p-values • Combinatorial e ff ects • Network analysis Epistatic effects Transcription factors s1 OCT3/4 SOX2 s5 s2 KLF4 C-MYC s4 s3 iPS cells Takahashi and Yamanaka. 2006, Cell Carlborg O, and Haley CS. 2004. Nature Reviews Genetics Few combinations have been found from genome-wide data. Why? Computationally high cost. Yes. But, recent supercomputer may be able to find small combinations. However, few results have been found. Statistical models are not suitable for the problem. Probably yes. Traditional approximation is too simple to analyze them. Statistical procedure have some problem. Try to solve this problem in this work.

  3. Higher-order analyses of genome-wide data are incompatible with p-values • Combinatorial e ff ects • Network analysis Epistatic effects Transcription factors s1 OCT3/4 SOX2 s5 s2 KLF4 C-MYC s4 s3 iPS cells Takahashi and Yamanaka. 2006, Cell Carlborg O, and Haley CS. 2004. Nature Reviews Genetics Few combinations have been found from genome-wide data. Why? Existing multiple testing corrections are too conservative to find the combinations. We developed multiple testing correction method to find statistically significant combinations.

  4. Contents • Multiple Testing and Correction • LAMP: multiple testing correction for combination discovery • Tarone’s method: modify Bonferroni correction • Key algorithm for LAMP • Application to combinatorial TF discovery • Derivative softwares • Summary 4

  5. Active motif discovery • Think about association between motifs and gene expressions. • To simplify the explanation, gene expressions are categorized in high or low. Contingency Table Total Gene High 3 0 3 High High 0 5 5 Low High 3 5 8 Total Low Low Fisher’s exact test p=0.018 < 0.05 → significant? Low Low Low 5

  6. Active motif discovery • Think about association between motifs and gene expressions. • To simplify the explanation, gene expressions are categorized in high or low. Contingency Table Total Gene High 3 0 3 High High 0 5 5 Low High 3 5 8 Total Low Low Fisher’s exact test p=0.018 < 0.05 → significant? Low Low No ! because we need Low multiple testing correction 6

  7. Single test Significance level α Ten tests 5% ≦ 5% 40% 5% 0.5% 0.5% False discovery: 4.9% Multiple testing correction 7

  8. Bonferroni Correction • Adjusted p-value = The number of tests * raw p-value • Theoretically, correct corrected significance level δ to α / N • Control family-wise error rate (FWER) • the probability that at least one significant test happens. δ : corrected significance level, N : # of tests 0 1 [ X A ≤ α = Pr { p i ≤ δ } Pr( p i ≤ δ ) ≤ N δ @ i ∈ { 1 ,...,N } i ∈ { 1 ,...,N } Family-Wise Error ≦ N ・ α = ≦ p > δ for all treatments p > δ for all treatments 8

  9. Bonferroni Correction • Adjusted p-value = The number of tests * raw p-value • Theoretically, correct corrected significance level δ to α / N • Control family-wise error rate (FWER) • the probability that at least one significant test happens. δ : corrected significance level, N : # of tests 0 1 [ X FWER = A ≤ α = Pr { p i ≤ δ } Pr( p i ≤ δ ) ≤ N δ @ i ∈ { 1 ,...,N } i ∈ { 1 ,...,N } The upper bound should be less than α Family-Wise Error δ ≤ α /N ≦ N ・ FWER = α = ≦ p > δ for all treatments p > δ for all treatments 9

  10. Bonferroni Correction • Adjusted p-value = The number of tests * raw p-value • Theoretically, correct corrected significance level δ to α / N • Control family-wise error rate (FWER) • the probability that at least one significant test happens. δ : corrected significance level, N : # of tests 0 1 [ X A ≤ α = Pr { p i ≤ δ } Pr( p i ≤ δ ) ≤ N δ @ i ∈ { 1 ,...,N } i ∈ { 1 ,...,N } δ ≤ α /N P-value P-value Take combinations Larger C D C D AB AC B B ... A A AD BC BD CD correction factor Detection of functional complex of genes is extremely unlikely 10

  11. Two problems to discover the combinations statistically • Avoiding conservative multiple testing correction • But, FWER should be kept below α • We introduce Tarone’s method [Tarone, Biometrics, 1990] • Fast enumeration of all possible combinations/subgraphs • Counting Bonferroni factor e ffi ciently • We use • a frequent pattern mining method for combinations and • an e ffi cient graph enumeration technique for subgraphs. • Both are combined with Tarone’s method.

  12. Contents • Multiple Testing and Correction • LAMP: multiple testing correction for combination discovery • Tarone’s method: modify Bonferroni correction • Key algorithm for LAMP • Application to combinatorial TF discovery • Derivative softwares • Summary 12

  13. Our Proposal: [PNAS 2013] Limitless Arity Multiple testing Procedure • Can enumerate statistically significant combinations • Techniques • Count the exact number of “testable” combinations • Infrequent combinations do not a ff ect FWER • Stepwise procedure with frequent itemset mining • Calibrate the correction factor to the smallest possible value • Discovered statistically significant motif combinations in yeast and breast cancer expression data 13

  14. Bonferroni inequation N δ 0 1 [ X A ≤ α = Pr { p i ≤ δ } Pr( p i ≤ δ ) ≤ N δ @ i ∈ { 1 ,...,N } i ∈ { 1 ,...,N } Bonferroni factor N = # of tests. Tests that have possibility to Testable have false positives. This should be counted in Bonf. factor. Tests that have NO possibility to Pr( p i ≤ δ ) = 0 Untestable have false positives. This can be safely removed from Bonf. factor. Tatone’s method: Only count testable ones in Bonferroni factor Bonferroni δ ≤ α /N δ N ≤ α Check all possible thresholds, Tarone |{ i | Pr( p i ≤ δ ) > 0 }| ≤ α and select largest δ 14

  15. Contents • Multiple Testing and Correction • LAMP: multiple testing correction for combination discovery • Tarone’s method: modify Bonferroni correction • Key algorithm for LAMP • Application to combinatorial TF discovery • Derivative softwares • Summary 15

  16. Infrequent combinations never cause significant result. From this contingency table, High Total n u High High ? ? n u High ? Low ? ? N-n u Low Total x N-x N Low N-n u minimum p-value of Fisher’s exact test Low can be calculated as Low ✓ ◆� ✓ ◆ n u N f ( x ) = x x Low f(x) depends only on x . f(x) decreases to increasing x With this f(x) , testable ones can be described as

  17. Tarone correction with frequency 0 1 α 0 = Pr [ X A ≤ FWER { p i ≤ δ } Pr( p i ≤ δ ) @ i 2 { 1 ,...,N } i 2 { 1 ,...,N } X = Pr( p i ≤ δ ) ≤ |{ i | f ( x i ) ≤ δ }| · δ { i | f ( x i )  δ } Take maximum δ that keeps FWER bound below α . g ( x ) = |{ i | f ( x i ) ≤ δ }| δ Appropriate x i =N-2 x i =N-1 x i =N :corrected sig. thres. 17

  18. Frequent Pattern Mining { } x { } { } { } { } m x … … { } { } { } { } { } f ( x ) m x … … { } { } { } x = x − 1 f ( x ) … … { } { } x { } { } { } { } m x … … { } { } { } { } { } f ( x ) m x … … { } { } { } x = x − 1 … … f ( x ) { } { } x … { } { } { } { } f ( x ) m x m x … … { } { } { } { } { } … … { } { } { } … … { } 18 f ( x )

  19. Contents • Multiple Testing and Correction • LAMP: multiple testing correction for combination discovery • Tarone’s method: modify Bonferroni correction • Key algorithm for LAMP • Application to combinatorial TF discovery • Derivative softwares • Summary 19

  20. An Example of Combinatorial Gene Regulation in Yeast 102 motifs Heat shock condition Gene High Expression: Gasch et al. ChIP-Chip: Harbison et al. High High Low 5,935 genes Low Low Low Low 20

  21. An Example of Combinatorial Gene Regulation in Yeast Under heat shock condition Corrected p-value. Red: significant LAMP ( ≦ 102) Bonferroni ( ≦ 4) Motif combination K= 303 K = 4,426,528 HSF1 4.41E-24 6.44E-20 MSN2 3.73E-11 5.45E-07 MSN4 0.000532 >1 SKO1 0.00839 >1 SNT2 0.0192 >1 PHD1, SUT1, SOK2, SKN7 0.0272 >1 21

  22. A Rank of gene expression p -value Up Down � PHD1 � > 1 � PHD1 � � >1 � > 1 � SUT1 � SUT1 � SKN7 � � � � � p -value � >1 � 0.0272 � 0.111 � 0.666 SOK2 � > 1 � � 0.111 1.0 � SKN7 ! � � 0.5 � 0.666 � � 0.05 � � PHD1 , SUT1 , ! � 0.0272 0.0 � SOK2 , SKN7 ! SOK2 � HAP4 GAT2 MSN4 MGA1 GID8 � YNL179C RHO5 � � � 22

Recommend


More recommend