Data Mining in Bioinformatics Day 8: Feature Selection - Epistasis Detection Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls Universität Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
Classic and new questions Genetics How does genotypic vari- ation lead to phenotypic variation? Can we predict pheno- types based on the geno- type of an individual? Recent progress Genotypes can be de- termined at an unprece- dented level of detail Phenotypes can be recorded in an automated manner Karsten Borgwardt: Data Mining in Bioinformatics, Page 2
Genome-wide association mapping by courtesy of D. Weigel Karsten Borgwardt: Data Mining in Bioinformatics, Page 3
Phenotype prediction Arabidopsis phenotypes (99-199 plants, 250k SNPs, Atwell et al., 2010) AUC SVM Phenotype Chlorosis at 22 ◦ C 0.629 ± 0.003 Anthocyanin at 16 ◦ C 0.569 ± 0.003 Anthocyanin at 22 ◦ C 0.609 ± 0.003 Leaf Roll at 10 ◦ C 0.696 ± 0.002 Leaf Roll at 22 ◦ C 0.587 ± 0.004 Why is there room for improvement? We assume additive effects of SNPs, ignore gene-gene interactions and gene-environment interactions. We ignore population structure, that is systematic an- cestry differences of cases and controls. Karsten Borgwardt: Data Mining in Bioinformatics, Page 4
Epistasis - what it means I (Cordell, 2002) Bateson’s masking effect model Bateson defines epistasis as a masking effect, whereby a variant or allele at one locus prevents the variant at another locus from manifesting its effect. Genotype at locus B / G gg gG GG bb White Grey Grey bB Black Grey Grey BB Black Grey Grey Example of phenotypes (e.g. hair colour) obtained from different genotypes at two loci interacting epistatically under Bateson’s (1909) definition of epistasis. Karsten Borgwardt: Data Mining in Bioinformatics, Page 5
Epistasis - what it means II (Cordell, 2002) Epistasis in a general sense Genotype at locus A / B bb bB BB aa 0 0 0 aA 0 1 1 AA 0 1 1 Example of penetrance table for two loci interacting epistatically in a general sense Karsten Borgwardt: Data Mining in Bioinformatics, Page 6
Epistasis - what it means III (Cordell, 2002) Genetic heterogeneity model Genotype at locus A / B bb bB BB aa 0 0 1 aA 0 0 1 AA 1 1 1 Example of penetrance table for two loci acting together in a heterogeneity model Karsten Borgwardt: Data Mining in Bioinformatics, Page 7
Epistasis - what it means IV Regression model Most popular statistical definition: y = θ i x i + θ j x j + θ ( i,j ) x i ⊙ x j + ǫ (1) Test whether θ ( i,j ) is significantly different from zero; rank pairs by the resulting p-value. Other common measures of association include e.g. the F-statistics and Pearson’s correlation coefficient. Karsten Borgwardt: Data Mining in Bioinformatics, Page 8
Epistasis - what it means V (Marchini et al., 2005) Model 1: Multiplicative interaction within and between loci Locus A / B bb bB BB α (1 + θ 2 ) 2 aa α α (1 + θ 2 ) α (1 + θ 1 )(1 + θ 2 ) 2 α (1 + θ 1 ) α (1 + θ 1 )(1 + θ 2 ) aA α (1 + θ 1 ) 2 α (1 + θ 1 ) 2 (1 + θ 2 ) α (1 + θ 1 ) 2 (1 + θ 2 ) 2 AA The odds increase multiplicatively with genotype both within and between loci. Karsten Borgwardt: Data Mining in Bioinformatics, Page 9
Epistasis - what it means VI (Marchini et al., 2005) Model 2: Two-locus interaction with multiplicative effects Locus A / B bb bB BB α α α aa α (1 + θ ) 2 α α (1 + θ ) aA α (1 + θ ) 2 α (1 + θ ) 4 α AA In this model, the odds have a baseline value ( α ) un- less both loci have at least one disease-associated al- lele. After that, the odds increase multiplicatively within and between genotypes. Karsten Borgwardt: Data Mining in Bioinformatics, Page 10
Epistasis - what it means VII (Marchini et al., 2005) Model 3: Two-locus interaction with threshold effects Locus A / B bb bB BB aa α α α α α (1 + θ ) α (1 + θ ) aA α α (1 + θ ) α (1 + θ ) AA In this model, the odds have a baseline value ( α ) unless both loci have at least one disease-associated allele. In this case, the odds-ratio is α (1 + θ ) . Karsten Borgwardt: Data Mining in Bioinformatics, Page 11
Epistasis - what it means VIII (Marchini et al., 2005) by courtesy of J. Marchini Karsten Borgwardt: Data Mining in Bioinformatics, Page 12
Impact of Epistasis Examples of Epistasis Epistasis is conjectured to be one source of missing her- itability (Manolio et al., 2009) Genetic interactions are one indicator that epistasis is a major factor in the genotype-phenotype relationship (e.g. Boone et al., 2007) Pairs of genes have been reported to affect complex dis- eases such as breast cancer (Ashworth et al., 2011): Loss of either BRCA1 or BRCA2 tumor suppressor gene function in cells triggers a cell-cycle arrest at the G2/M checkpoint that can be suppressed by the inactivation of P53 (Connor et al., 1997 and Liu et al., 2007). Loss of VHL (Von Hippel-Lindau tumor suppressor) function normally causes cellular senescence, but inactivation of a second tumor suppres- sor, RB (Retinoblastoma), can suppress this process (Young et al., 2008). Karsten Borgwardt: Data Mining in Bioinformatics, Page 13
Bottlenecks in two-locus mapping Scale of the problem Typical datasets include order 10 5 − 10 7 SNPs. Hence we have to consider order 10 10 − 10 14 SNP pairs. Enormous multiple hypothesis testing problem. Enormous computational runtime problem. Karsten Borgwardt: Data Mining in Bioinformatics, Page 14
Common approaches in the literature Exhaustive enumeration Only with special hardware such as GPU implementa- tions: EPIBLASTER (Kam-Thong et al., EJHG 2010) Filtering approaches Statistical criterion, e.g. SNPs with large main effect (Zhang et al., 2007) Biological criterion, e.g. underlying PPI (Emily et al., 2009) Index structure approaches fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008) TEAM, efficient updates of contingency tables (Zhang et al., 2010) Karsten Borgwardt: Data Mining in Bioinformatics, Page 15
Family I: Exhaustive search Exhaustive enumeration Concept: Run through all pairs of SNPs exhaustively Setback: On standard PCs, such searches are limited to hundreds of SNPs. Workaround: Use special hardware, such as Computing clusters Graphical processing units Cloud computing Current limitation: These solutions tend to work for datasets that are currently available, but they may not be able to cope with an increase in sample size or SNP marker number in the future. Karsten Borgwardt: Data Mining in Bioinformatics, Page 16
Engineering approach to epistasis detection GPUs are heavily optimized for basic matrix operations Computational power in terms of GPUs much cheaper than CPUs We exploit the power of GPUs for rapid exhaustive SNP- SNP interaction detection EPIBLASTER: Difference in correlation for binary phe- notypes (Kam-Thong et al., EJHG 2010) EPIGPUHSIC: Kernel-based test for arbitrary pheno- types (Kam-Thong et al., ISMB 2011) CUDALIN: Regression model with main effects (Kam- Thong et al., submitted) Available from http://agkb.is.tuebingen.mpg.de/Forschung/epistasistools/ Karsten Borgwardt: Data Mining in Bioinformatics, Page 17
First finding 567 subjects 1,075,163 SNPs phenotype: Hippocam- pus volume genome-wide signifi- cant results ( p < 10 − 12 ) near genes involved in by P . Sämann cell-cell signaling Karsten Borgwardt: Data Mining in Bioinformatics, Page 18
Multifactor-Dimensionality Reduction Properties of MDR (Ritchie et al., 2001) A model-free and non-parametric approach to epistasis detection Was proposed to overcome the problem that the type of encoding of SNPs affects the results in generalized lin- ear models; does not assume a specific genetic model Measures the association between SNPs and disease risk using prediction accuracy of selected multifactor models Limitations: Runs exhaustively through all SNP combinations and detects the best model Resulting models may be difficult to interpret Original variant only considers balanced case-control datasets Karsten Borgwardt: Data Mining in Bioinformatics, Page 19
Multifactor-Dimensionality Reduction Algorithm of MDR 1. A set of n genetic and/or discrete environmental factors is selected from the pool of all factors. 2. The n factors and their possible multifactor classes or cells are represented in n -dimensional space. Then, the ratio of the number of cases to the number of controls is estimated within each multifactor class. 3. Each multifactor cell in n-dimensional space is labeled either as high-risk if the cases:controls ratio meets or exceeds some threshold or as low-risk if that threshold is not exceeded. This reduces the n-dimensional model to a one-dimensional model. 4. The prediction error of each model is estimated by 10 repetitions of 10-fold cross-validation. Karsten Borgwardt: Data Mining in Bioinformatics, Page 20
Family II: Filtering approaches Two-stage procedure (popular reference: Marchini et al., 2005): First, reduce set of SNPs. Second, compute all remaining pairs exhaustively. Karsten Borgwardt: Data Mining in Bioinformatics, Page 21
Filtering in practice by courtesy of J. Marchini Karsten Borgwardt: Data Mining in Bioinformatics, Page 22
Recommend
More recommend