introduction to genetic epidemiology gbio0015 1
play

INTRODUCTION TO GENETIC EPIDEMIOLOGY (GBIO0015-1) Prof. Dr. Dr. K. - PowerPoint PPT Presentation

INTRODUCTION TO GENETIC EPIDEMIOLOGY (GBIO0015-1) Prof. Dr. Dr. K. Van Steen Introduction to Genetic Epidemiology


  1. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions The frequency of epistasis • Not a new idea! (Bateson 1909) • Complexity of gene regulation and biochemical networks (Gibson 1996; Templeton 2000) • Single gene results don’t replicate (Hirschhorn et al. 2002) • Gene-gene interactions are commonly found when properly investigated (Templeton 2000) • Working hypothesis: Single gene studies don’t replicate because gene-gene interactions are more important (Moore and Williams 2002) (Moore 2003) K Van Steen 537

  2. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Slow shift from main towards epistatis effects (Motsinger et al 2007) K Van Steen 538

  3. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Power of a gene-gene or gene-environment interaction analysis • There is a vast literature on power considerations - Most of this literature strengthen their beliefs by extensive simulation studies • There is a need for user-friendly software tools that allow the user to perform hands-on power calculations • Main package targeting interaction analyses is QUANTO (v1.2.1): - Available study designs for a disease (binary) outcome include the unmatched case-control, matched case-control, case-sibling, case- parent, and case-only designs. Study designs for a quantitative trait include independent individuals and case parent designs. - Reference: Gauderman (2000a), Gauderman (2000b), Gauderman (2003) / http://hydra.usc.edu/GxE K Van Steen 539

  4. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions A simple example of epistasis K Van Steen 540

  5. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions A simple disease model • Penetrance - Pr (affected | genotype) • One-locus Dominant Model Genotype aa aA AA Status 0 1 1 K Van Steen 541

  6. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions A slightly more complicated two-locus model Genotype bb bB BB aa 0 0 0 aA 0 1 1 AA 0 1 1 Enumeration of two-locus models • Although there are 2 9 =512 • Enumeration allows 0 and 1 only possible models, because of for penetrance values (‘fully symmetries in the data, only 50 of penetrant’; i.e., “show” example). these are unique. K Van Steen 542

  7. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Enumeration of two-locus models • Each model represents a group of equivalent models under permutations. The representative model is the one with the smallest model number. • The six models studied in Neuman and Rice [67] (‘RR, RD, DD, T, Mod, XOR’), as well as two single-locus models (‘IL’) – the recessive (R) and the interference (I) model, (Li and Reich 2000) are marked. K Van Steen 543

  8. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Different degrees of epistasis (slide: Motsinger) K Van Steen 544

  9. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Pure epistasis model for dichotomous traits • Suppose - p(A)=p(B)=p(a)=p(b)=0.5 - HWE (hence, p(AA)=0.5 2 =0.25,p(Aa)=2 � 0.5 2 =0.5) and no LD - penetrances are given according to the table below P(affected|genotype) Penetrance bb bB BB prob aa 0 0 1 0.25 aA 0 0.50 0 0.50 AA 1 0 0 0.25 prob 0.25 0.50 0.25 1 • Then make multiple use of Bayes rule to retrieve the genotype distributions in cases and controls K Van Steen 545

  10. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Pure epistasis model for dichotomous traits • Then the marginal genotype distributions for cases and controls are the same, and hence one-locus approaches will be powerless! P(genotypes|affected) P(genotypes|unaffected ) bb bB BB prob bb bB BB prob aa 0 0 0.25 0.25 aa 0.083 0.167 0 0.25 aA 0 0.50 0 0.50 aA 0.167 0.167 0.167 0.50 AA 0.25 0 0 0.25 AA 0 0.167 0.083 0.25 prob 0.25 0.50 0.25 1 prob 0.25 0.50 0.25 1 P(aa,BB|D) =p(D|aa,BB)p(aa,BB) / p(D) = 1 � 0.5 2 � 0.5 2 /(1 � 0.5 2 � 0.5 2 +0.5 � 2 � 0.5 2 � 2 � 0.5 2 +1 � 0.5 2 � 0.5 2 ) = ¼ = 0.25 K Van Steen 546

  11. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Purely epistatic 3-locus diseasemodel for quantitative traits • Assume all allele frequencies are 0.5 • Heritability is 55% and prevalence is 6.25% L.3=0 L.3=1 L.3=2 L.2 0 1 2 0 1 2 0 1 2 L.1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0.25 0 0 0 0 2 0 0 0 0 0 0 1 0 0 (Culverhouse et al 2002) K Van Steen 547

  12. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Expected genotype patterns for 3-locus model L.1 L.2 L.3 p(g) E[#affected] E[#unaff] 0 2 0 0.0156 25 0 2 0 2 0.0156 25 0 1 1 1 0.1250 50 10 Other 0.8438 0 90 Sum 1 100 100 (Culverhouse et al 2002) (sllide: J Ott 2004) K Van Steen 548

  13. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions 2 Epistasis detection: a challenging task Main challenges • Variable selection • Modeling • Interpretation - Making inferences about biological epistasis from statistical epistasis (slide Chen 2007) K Van Steen 549

  14. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions 2.a Variable selection Introduction • The aim is to make clever selections of marker combinations to look at in an epistasis analysis • This may not only aid in the interpretation of analysis results, but also reduced the burden of multiple testing and the computational burden K Van Steen 550

  15. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Variable selection and multiple testing • Multiple testing is a thorny issue, the bane of statistical genetics. - The problem is not really the number of tests that are carried out: even if a researcher only tests one SNP for one phenotype, if many other researchers do the same and the nominally significant associations are reported, there will be a problem of false positives. (Balding 2006) • Example - Given 3 disease SNPS (e.g., Culverhouse 3-locus model before), making inferences is not at all an easy task: � Chi-sq = 166.7 (26 df), p=1.76 � 10 -22 - With 50,000 SNPS, there will be 2.1 � 10 13 subsets of size 3 � Applying Bonferroni correction, p = 3.6 � 10 -9 - A more manageable approach is to test all possible pairs of loci for interaction effects, different in cases and controls (Hoh and Ott 2003) K Van Steen 551

  16. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Variable selection and multiple testing • Pre-screening for subsequent testing: - Independent screening and testing step (PBAT screening; Van Steen et al 2005) - Dependent screening and testing step K Van Steen 552

  17. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Methods to correct for multiple testing • Family-wise error rates (FWER) - … In the presence of too many SNPs, the Bonferroni threshold will be extremely low: Bonferroni adjustments are conservative when statistical tests are not independent / Bonferroni adjustments control the error rate associated with the omnibus null hypothesis / The interpretation of a finding depends on how many statistical tests were performed • Permutation data sets - It is particularly handy for rare genotypes, small studies, non-normal phenotypes, and tightly linked markers - In case-control data this is relatively straightforward / In family data this is not at all an easy task … K Van Steen 553

  18. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Methods to correct for multiple testing • False discovery rate (FDR) - With too many SNPs it starts to break down and the power over Bonferroni is minimal (e.g. see Van Steen et al 2005) • False-positive report probability (FPRP) - It is the probability of no true association between a genetic variant and disease given a statistically significant finding, depends not only on the observed p-value but also on both the prior probability that the association between the genetic variant and the disease is real and the statistical power of the test (Wacholder et al 2004) - In general, Bayesian approaches do not yet have a big role in genetic association analyses, possibly because of computational burden? - Not yet well documented / What are the priors? ( Balding 2006; Lucke 2008) K Van Steen 554

  19. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Variable selection and computation time • When SNPs do not have independent effects, however, it is impossible for most current computer technologies to analyze the resulting astronomical number of possible combinations. • For instance, if 300000 SNPs have been measured at a density of 1 SNP every 10 kilobases (kb), and if 10 statistical evaluations can be computed each second, then evaluation of each individual SNP would require 30000 seconds (ie, 8.3 hours) of computer time. • Exhaustive evaluation of the approximately 4 � 10 10 pairwise combinations of SNPs would require 1286 years. • Although it might be possible for a large supercomputer to complete these computations in a reasonable amount of time, an exhaustive search of all combinations of 3 or 4 SNPs would not be possible even if every computer in the world were simultaneously working on the problem. (Moore and Ritchie 2004) K Van Steen 555

  20. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions 2.b Modeling Failure of traditional methods • A large number of SNPs are • Genetic loci may interact genotyped (epistasis) in their influence on the - “multiple comparisons” phenotype - loci with small marginal problem, very small p -values required for significance, effects may go undetected - interested in the interaction which is even compounded in gene-environment interaction itself • Curse of dimensionality and sparse analyses. “cells” K Van Steen 556

  21. Introduction to Genetic Epidemiology Cha Chapter 7: A World of Interactions Curse of dimensionality and and sparse cells • For 2 SNPs, there are 9 = 3 9 = 3 2 possible two locus genotype com combinations. • If the alleles are rare (MA (MAF ≤ 10%), then some cells will be em e empty (slide: C Amos) K Van Steen 557

  22. Introduction to Genetic Epidemiology Cha Chapter 7: A World of Interactions Curse of dimensionality and and sparse cells • For 4 SNPs, there are 81 p 81 possible combinations with more p re possible empty cells … (slide: C Amos) K Van Steen 558

  23. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Modeling: strategy 1 Strategy 1: Set association approach • At each SNP, compute an association statistic T • Build sum over 1, 2, 3, etc highest values t • Evaluate significance of given sum by permutation test • Sum with smallest p-value will point towards the markers to select • Smallest p is single statistic, find significance level • Is applicable to many SNPs and has also been used in microarray settings (Hoh et al 2001) K Van Steen 559

  24. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Strategy 1: Set association approach (Hoh et al 2001) K Van Steen 560

  25. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Modeling: strategy 2 Strategy 2: Multi-locus approaches • Most case control studies far too often do not take into account the multi- locus nature of complex traits • When the aim is to analyze multiple SNPs or genes jointly, two classes of approaches emerge: - Combine (properties of) single-locus statistics over multiple SNPs to obtain a new multivariate test statistic � Depending on whether SNPS are in high LD or not, different measures need to be taken - Look for patterns of genotypes at SNPs in different genomic locations K Van Steen 561

  26. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Two frameworks for multi-locus approaches (Onkamo and Toivonen 2006) • Parametric methods: - Regression - Logistic or (Bagged) logic regression • Non-parametric methods: • Tree-based methods: - Recursive Partitioning (Helix Tree) - Random Forests (R, CART) • Pattern recognition methods: - Mining association rules - Neural networks (NN) - Support vector machines (SVM) • Data reduction methods: - DICE (Detection of Informative Combined Effects) - MDR (Multifactor Dimensionality Reduction) K Van Steen 562

  27. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Non-parametric chi-square • The question is how to test for epistatic effects above and beyond (independent) main effects (of single-locus genotype effects) - Use “usual” chi-square for interactions independent of main effects. Isolate individual df’s. - Assess difference in interactions between cases and controls, since then interactions may be better indicative for underlying pathways Locus 2 Main effect locus 1 2df Locus 1 BB Bb Bb Main effect locus 2 2 df AA Interactions 4 df Aa Total 8 df aa K Van Steen 563

  28. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Partitioning chi-squares for one locus 2df 1 df 1 df • Simple disease model, population frequency K = 0.10 N = 100 cases, 100 controls. • Predicted numbers of cases and controls in given genotype classes, and resulting odds ratios, OR K Van Steen 564

  29. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Partitioning chi-squares for two loci • 3 × 3 table of genotypes (4 df) may be partitioned into 4 independent components, each with 1 df. • Do such partitioning for cases and controls each (Agresti 2002). BB Bb BB, bb Bb AA Aa AA Aa BB Bb BB, bb AA, Bb Aa AA, aa Aa aa K Van Steen 565

  30. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Partitioning chi-squares for two loci • Compare each of the four 2 by 2 subtables between cases and controls to see whether their odds ratios are the same K Van Steen 566

  31. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Logistic regression • LR is a derivative of linear regression that fits a function to continuous or discrete independent variables based on a dichotomous dependent variable (Hosmer and Lemeshow, 2000). • One of the most common procedures for variable selection in a LR analysis is step-wise logistic regression (step LR) [Hosmer and Lemeshow, 2000]. - In the step-wise procedure, each variable is tested for independent effects, and those variables with significant effects are included in the model. - In a second step, interaction terms of those variables with significant main effects are included, and significant effects are included in the model. (Motsinger-Reif et al 2008) K Van Steen 567

  32. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Logistic regression • LR is a de facto standard for traditional association studies. Using independent variables to predict a dichotomous dependent variable, LR by definition lacks the ability to characterize purely interactive effects. • Only variables that contain an independent main effect will be included in the final model. • To properly evaluate non-linear purely interactive effects, combinations of variables must be encoded as a single variable for inclusion in the analysis. Such an encoding scheme can be computationally expensive, depending on the number of variables used. (Motsinger-Reif et al 2008) K Van Steen 568

  33. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Strategy 2: Look for patterns of genotypes using unrelated individuals • CPM = combinatorial partitioning method (Charlie Sing, U Michigan). Applicable to small number (~50) of SNPs only. • MDR = multifactor-dimensionality reduction method (Jason Moore, Vanderbuilt U) • LAD = logical analysis of data (P. Hammer, Rutgers U) • Mining association rules, Apriori algorithm (R. Agrawal) • Special approaches for microarray data (Hoh and Ott 2003) K Van Steen 569

  34. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions The MDR algorithm What is MDR? • A data mining approach to identify interactions among discrete variables that influence a binary outcome • A nonparametric alternative to traditional statistical methods such as logistic regression • Driven by the need to improve the power to detect gene-gene interactions (slide: L Mustavich) K Van Steen 570

  35. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions The 6 steps of MDR K Van Steen 571

  36. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions MDR Step 1 • Divide data (genotypes, discrete environmental factors, and affectation status) into 10 distinct subsets K Van Steen 572

  37. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions MDR Step 2 • Select a set of n genetic or environmental factors (which are suspected of epistasis together) from the set of all variables in the training set K Van Steen 573

  38. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions MDR Step 3 • Create a contingency table for these multi-locus genotypes, counting the number of affected and unaffected individuals with each multi-locus genotype K Van Steen 574

  39. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions MDR Step 4 • Calculate the ratio of cases to controls for each multi-locus genotype • Label each multi-locus genotype as “high-risk” or “low-risk”, depending on whether the case- control ratio is above a certain threshold • This is the dimensionality Reduces n-dimensional space to 1 reduction step: dimension with 2 levels K Van Steen 575

  40. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions MDR Step 5 • To evaluate the developed model in Step 4, use labels to classify individuals as cases or controls, and calculate the misclassification error • In fact: balanced accuracy is used (arithmetic mean between sensitivity and specificity), which IS mathematically equivalent to classification accuracy when data are balanced K Van Steen 576

  41. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Repeat Steps 2 to 5 • All possible combinations of n factors are evaluated sequentially for their ability to classify affected and unaffected individuals in the training data, and the best n-factor model is selected in terms of minimal misclassification error K Van Steen 577

  42. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions MDR Step 6 • The independent test data from the cross-validation are used to estimate the prediction error (testing accuracy) of the best model selected K Van Steen 578

  43. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Towards MDR Final • Steps 1 through 6 are repeated for each possible cross-validation interval • The best model across all 10 training and testing sets is selected on the basis of the criterion: - Maximize the cross-validation consistency = The number of times a particular model was the best model across the cross-validation subsets • The end of a cross-validation procedure also allows to compute the - average training accuracy - average testing accuracy of best models over all cross-validation sets, and possible over multiple runs (with different seeds, to reduce the chance of observing spurious results due to chance divisions of the data) K Van Steen 579

  44. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions MDR final • The entire process is repeated for each k=1 to N loci combinations that are computationally feasible and an optimal k-locus model is chosen for each level of k considered. • The final model is based on maximizing two criteria: - maximizing the (average) prediction accuracy - maximizing the (average) cross-validation consistency • Statistical significance is obtained by comparing the average cross- validation consistency from the observed data to the distribution of average consistencies under the null hypothesis of no associations, derived empirically from 1000 permutations (Ritchie et al 2001, Ritchie et al 2003, Hahn et al 2003) K Van Steen 580

  45. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Several measures of fitness to compare models Balanced accuracy • Balanced accuracy(BA) weighs the classification accuracy of the two classes equally and it is thought to be more powerful than using accuracy alone when data are imbalanced, or when the counts of cases and controls are not equal (Velez et al 2007) - BA is calculated from a 2 × 2 table relating exposure to status by [(sensitivity+specificity)/2]. Real Real When #cases = #controls, then case control TP+FN = FP+TN and Model case TP FP BA = (TP+TN)/2*#cases Model control FN TN = TP+TN/(total sample size) K Van Steen 581

  46. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Several measures of fitness to compare models Model-adjusted balanced accuracy • Model-adjusted balanced accuracy uses in addition a different threshold in the MDR modeling, one that is based on the actual counts of case and control samples in the data. - When individuals have missing data, it accounts for the precise number of individuals with complete data for that particular multi-locus combination - This makes MDR robust to class imbalances (Velez et al 2007) K Van Steen 582

  47. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Hypothesis test of best model • Evaluate magnitude of cross-validation consistency and prediction error estimates by adopting a permutation strategy • In particular: - Randomize disease labels - Repeat MDR analysis several times (1000?) to get distribution of cross- validation consistencies and prediction errors - Use distributions to derive the p-values for the actual cross-validation consistencies and prediction errors K Van Steen 583

  48. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Sample Quantiles An Example Empirical Distribution 0% 0.045754 10 25% 0.168814 50% 0.237763 8 75% 0.321027 Frequency 6 90% 0.423336 4 95% 0.489813 99% 0.623899 2 99.99% 0.872345 0 100% 1 0.2 0.4 0.6 0.8 1.0 K Van Steen 584

  49. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions The probability that we would see results as, or more, extreme than for instance 0.4500, simply by chance, is between 5% and 10% (slide: L Mustavich) The MDR Software Downloads • Available from www.sourceforge.net • The MDR method is described in further detail by Ritchie et al. (2001) and reviewed by Moore and Williams (2002). • An MDR software package is available from the authors by request, and is described in detail by Hahn et al. (2003). More information can also be found at http://phg.mc.vanderbilt.edu/Software/MDR K Van Steen 585

  50. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions The authors • Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Hahn, Ritchie, Moore, 2003. Required operating system software Linux: Linux (Fedora version Core 3): Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_06-b03) Java HotSpot(TM) Client VM (build 1.4.2_06-b03, mixed mode) Windows: Windows (XP Professional and XP Home): Java(TM) 2 Runtime Environment, Standard Edition (build v1.4.2_05) Minimum system requirements • 1 GHz Processor • 256 MB Ram • 800x600 screen resolution K Van Steen 586

  51. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions K Van Steen 587

  52. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Application to simulated data • To show MDR in action, we simulated 200 cases and 200 controls using different multi-locus epistasis models (Evans 2006) - Scenario 1: 10 SNPs, adapted epistasis model M170, minor allele frequencies of disease susceptibility pair 0.5 - Scenario 2: 10 SNPs, epistasis model M27, minor allele frequencies of disease susceptibility pair 0.25 M170 M27 0 1 2 0 1 2 0 0 0.1 0 0 0 0 0 1 0.1 0 0.1 1 0 0.1 0.1 2 0 0.1 0 2 0 0.1 0.1 • All markers were assumed to be in HWE. No LD between the markers. K Van Steen 588

  53. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Application to simulated data Marginal distributions for the controls M170 0 1 2 M27 0 1 2 0 0.07 0.12 0.07 0.25 0 0.15 0.29 0.15 0.58 1 0.12 0.26 0.12 0.50 1 0.10 0.17 0.09 0.36 2 0.07 0.12 0.07 0.25 2 0.02 0.03 0.01 0.06 0.25 0.50 0.25 0.26 0.49 0.25 Marginal distributions for the cases M170 0 1 2 M27 0 1 2 0 0.00 0.25 0.00 0.25 0 0 0.00 0.00 0.00 1 0.25 0.00 0.25 0.50 1 0 0.57 0.29 0.86 2 0.00 0.25 0.00 0.25 2 0 0.10 0.05 0.14 0.25 0.50 0.25 0.00 0.66 0.33 K Van Steen 589

  54. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Data format • The definition of the format is as follows: - All fields are tab-delimited. - The first line contains a header row. This row assigns a label to each column of data. Labels should not contain whitespace. - Each following line contains a data row. Data values may be any string value which does not contain whitespace. - The right-most column of data is the class, or status, column. The data values for this column must be 1, to represent ”Affected” or ”Case” status, or 0, to represent ”Unaffected” or ”Control” status. No other values are allowed. K Van Steen 590

  55. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Easy data conversion > M170data[1,] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [1,] 1 2 2 2 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 M170data <- rbind(M170.cases,M170.controls) M170ccdata <- matrix(NA,nrow=ss,ncol=nsnps) for (i in 1:nsnps){ M170ccdata[,i] <- apply(M170data[,c(2*i-1,2*i)],1,sum)-2 } M170ccdata <- cbind(M170ccdata,c(rep(1,200),rep(0,200))) write.table(M170ccdata,"M170ccdata.txt",sep="\t",row.names=F,col.names=F) > M170ccdata[1,] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [1,] 1 2 0 0 0 0 1 0 1 1 1 K Van Steen 591

  56. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions M170 case control data SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 Class 1 2 0 0 0 0 1 0 1 1 1 1 2 1 1 0 2 0 0 0 1 1 1 2 0 0 0 0 0 0 1 1 1 2 1 0 0 0 0 2 2 1 0 1 2 1 0 0 1 0 0 1 1 1 1 … 0 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 2 0 1 0 0 0 1 2 0 0 0 1 0 0 1 0 0 2 2 0 0 0 0 1 0 2 0 0 1 0 1 0 1 1 1 0 1 2 0 K Van Steen 592

  57. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Loading a data file (MDR 2.0 beta 3) K Van Steen 593

  58. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Configuring the analysis K Van Steen 594

  59. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Reducing the number of cross-validations CV=10 CV=3 (Motsinger and Ritchie 2006) K Van Steen 595

  60. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Reducing the number of cross-validations (CVs) • In general, CV is a useful approach for limiting false-positives by assessing the generalizability of models (Coffey et al 2004) • The number of CV intervals in an MB-MDR analysis can be reduced from 10 to 5, but not to 3 • CV seems to be rather important in the MDR algorithm: - Motsinger and Ritchie (2003) showed that, without CV, selection of a final model is difficult, but that it is encouraging that the false-positive results almost always include at least one correct functional locus. - This indicates that perhaps, in the case of extremely large datasets, like genomewide scans, where using any type of CV would be computationally infeasible, MDR could still be used (without CV) to identify at least one functional locus… K Van Steen 596

  61. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Search method configuration K Van Steen 597

  62. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Running the MDR analysis K Van Steen 598

  63. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Summary of results K Van Steen 599

  64. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Best MDR model K Van Steen 600

  65. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions MDR best model K Van Steen 601

  66. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Values calculated by MDR Measure Formula/Interpretation Balanced (Sensitivity+Specificity)/2; fitness measure Accuracy Accuracy is skewed in favor of the larger class, whereas balanced accuracy gives equal weight to each class Accuracy (TP+TN)/(TP+TN+FP+FN) Proportion of instances correctly classified (skewed in favor of larger class) Sensitivity TP/(TP+FN); how likely a positive classification is correct Specificity TN/(TN+FP); how likely a negative classification is correct Odds Ratio (TP*TN)/(FP*FN); compares whether the probability of a certain event is the same for two groups K Van Steen 602

  67. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Values calculated by MDR Measure Formula/Interpretation Precision TP/(TP+FP); the proportion of relevant cases returned Kappa 2(TP*TN+FP*FN)/[(TP+FN)(FN+TN)+(TP+FP)*(FP+TN)] A function of total accuracy and random accuracy X 2 Chi-squared score for the attribute constructed by MDR from this attribute combination F-Measure 2*TP/(2*TP+FP+FN); a function of sensitivity and precision TP: true positive; TN: true negative; FP: false positive; FN: false negative K Van Steen 603

  68. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions MDR CV results 0.8028 0.8028 0.8056 0.7861 0.7972 0.8000 0.8056 0.7889 0.7944 0.7917 average = 0.79751 K Van Steen 604

  69. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions MDR best model If-then rules on whole data Graphical display on whole data K Van Steen 605

  70. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions The fitness landscape • Gives the fitness landscape across all models as a line chart (the default). - The models produced are on the x-axis of the chart. The models on the x-axis are in the order in which they were generated (e.g., 1,2,3, …, 12, 13, 14, …) - Training accuracy is shown on the y-axis. K Van Steen 606

  71. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions The fitness landscape SNP1 0.5127778 SNP2 0.5286111 SNP3 0.52527773 SNP4 0.51555556 SNP5 0.5875 SNP6 0.5127778 SNP7 0.5158334 SNP8 0.5141667 SNP9 0.5144445 SNP10 0.5233334 SNP1,SNP2 0.7975 SNP1,SNP4 0.5375 SNP1,SNP5 0.5916667 SNP1,SNP3 0.5372222 … K Van Steen 607

  72. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Locus Dendrogram • The dendrogram provides a graphical representation of the interactions between attributes (and the strength of those interactions) from the MDR analysis (max nr of interactions asked for) using an “interaction dendrogram”. • The purpose of the interaction dendrogram is to assist the user with determining the nature of the interactions (redundant, additive, or synergistic). K Van Steen 608

  73. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Locus Dendrogram • The dendrogram is constructed using hierarchical cluster analysis with average-linking. • The distance matrix used by the cluster analysis is constructed by calculating the information gained by constructing two attributes using the MDR function (Moore et al 2006, Jakulin and Bratko 2003, Jakulin et al 2003) K Van Steen 609

  74. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Raw entropy values • Entropy is basically a defined a measure of randomness or disorder within a system. More specifically indicates that the lower the entropy values are the higher likelihood that the system is in a more probable state. • A classic example of this principle is the melting of a glass of ice in which as the state becomes more unstable as the entropy increases. A graphical illustration of the relationships between information theoretic measures on the joint distribution of attributes A and B. The surface area of a section corresponds to the labeled quantity (Jakulin 2003) [I(A;B) = mutual information = the amount of information provided by A about B = information gain.; H(A) = entropy of A] K Van Steen 610

  75. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Raw entropy values • Let us assume an attribute, A. We have observed its probability distribution, P A (a). Shannon’s entropy measured in bits is a measure of predictability of an attribute is defined as: ���� � � � ���� ��� � ������ � ! • Hence phrased differently, the higher the entropy, the less reliable are our predictions about A. We can understand H(A) as the amount of uncertainty about A, as estimated from its probability distribution. K Van Steen 611

  76. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Raw entropy values • Single Attribute Values: - H(A): This is the entropy of the given attribute (A) - H(A|C): This is the entropy of the given attribute (A) given the class (C) - I(A;C): This is the information gain of the given attribute (A) given the class (C) • Pairwise Values: - H(AB): This is the entropy of the given constructed attribute (AB) - H(AB|C): This is the entropy of the given constructed attribute (AB) given class I - I(A;B): This is the information gain of attribute (A) given attribute (B) - I(A;B;C): This is the information gain for attribute (A) or Attribute (B) given class (C) - I(AB;C): This is the information for the constructed attribute (AB) given class I K Van Steen 612

  77. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Raw entropy values • Mutual information I(A ;B) as a function of r 2 (as a measure of LD between markers), for a subset of the Spanish Bladder Cancer data (SBCS) – unpublished results K Van Steen 613

  78. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions K Van Steen 614

  79. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Locus dendrogram • The colors range from red • On the redundancy end of the representing a high degree of spectrum, the highest degree is synergy (positive information represented by the blue color gain), orange a lesser degree, and (negative information gain) with a gold representing the midway lesser degree represented by point between synergy and green. redundancy. Synergy – The interaction between two attributes provides more information than the sum of the individual attributes. Redundancy – The interaction between attributes provides redundant information. K Van Steen 615

  80. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Positive and negative interactions • Say I(A;B;C) = I(A,B;C)−I(A;C) – I(B;C) • Assume that we are uncertain about the value of C, but we have information about A and B. - Knowledge of A alone eliminates I(A;C) bits of uncertainty from C. - Knowledge of B alone eliminates I(B;C) bits of uncertainty from C. - However, the joint knowledge of A and B eliminates I(A,B;C) bits of uncertainty. • Hence, if interaction information is positive, we benefit from an unexpected synergy. If interaction information is negative, we suffer diminishing marginal returns by introducing attributes that partly contribute redundant information. K Van Steen 616

  81. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Significance of the results • We simulated data from a two-locus epistasis model. • The remaining SNPs were generated at random… • Hence, what does it mean that the best single effects model SNP5 was chosen? Answer: Every k-locus setting will give rise to a “best” model. MDR forces for every k-locus setting an optimal model. K Van Steen 617

  82. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Significance of results • The best model among all 1-3 locus models, is the one with maximal cross validation consistency and maximum average balanced prediction accuracy • But how significant is this result? K Van Steen 618

  83. Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Configuring the permutation analysis (MDR PT Module 0.4.8 alpha) K Van Steen 619

Recommend


More recommend