data mining in bioinformatics day 8 feature selection in
play

Data Mining in Bioinformatics Day 8: Feature Selection in - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Epistasis detection Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Group Max Planck Institutes Tbingen, Germany Karsten


  1. Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Epistasis detection Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Group Max Planck Institutes Tübingen, Germany Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

  2. Classic and new questions Genetics How does genotypic vari- ation lead to phenotypic variation? Can we predict pheno- types based on the geno- type of an individual? Recent progress Genotypes can be de- termined at an unprece- dented level of detail Phenotypes can be recorded in an auto- mated manner Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

  3. Phenotype prediction AUC SVM Phenotype Chlorosis at 22 ◦ C 0.629 ± 0.003 Anthocyanin at 16 ◦ C 0.569 ± 0.003 Anthocyanin at 22 ◦ C 0.609 ± 0.003 Leaf Roll at 10 ◦ C 0.696 ± 0.002 Leaf Roll at 22 ◦ C 0.587 ± 0.004 99-199 plants, 250k SNPs, Atwell et al., Nature 2010 Why is there room for improvement? We assume additive effects of SNPs, ignore gene-gene interactions. We ignore population structure, that is systematic an- cestry differences of cases and controls. We ignore gene-environment interactions. Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

  4. Gene-gene interactions Scale of the problem Typical datasets include order 10 5 − 10 6 SNPs. Hence we have to consider order 10 10 − 10 12 SNP pairs. Enormous multiple hypothesis testing problem. Enormous computational runtime problem. Our contribution We assume binary phenotypes (cases and controls). Genotypes may be homozygous or heterozygous. We assume m individuals with n SNPs each. We define an algorithm called epiSVM that rapidly de- tects epistatic interactions underlying the phenotypes in O ( mn ) . Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

  5. Epistasis detection approaches Filtering approaches Statistical criterion, e.g. SNPs with large main effect (Zhang et al., 2007) Biological criterion, e.g. underlying PPI (Emily et al., 2009) Index structure approaches fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008) TEAM, efficient updates of contingency tables (Zhang et al., 2010) Exhaustive enumeration Only with special hardware such as GPU implementa- tions: EPIBLASTER (Kam-Thong, EJHG 2010) Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

  6. Support Vector Machines Separating hyperplane classifier (Vapnik & Chervonenkis, 1974) (x) w = (1,1) Feature 2 Feature 1 Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

  7. SVM: naive approach Classify case/control by means of pairs of features SVM classifier: f ( x ) = sgn ( � w , φ ( x ) � + b ) � = sgn ( w γ φ γ ( x ) + b ) γ Mapping φ : X → X ⊗ X We have to compute all n 2 entries of w to detect the feature pairs with maximum weight. These may be up to 10 12 entries. Can we avoid this? Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

  8. Three types of epistasis Based on HJ Cordell, 2002: D&D D&R R&R BB(2) BB(2) BB(2) Bb(1) Bb(1) Bb(1) bb(0) bb(0) bb(0) aa(0) Aa(1) AA(2) aa(0) Aa(1) AA(2) aa(0) Aa(1) AA(2) R|R R|D D|D BB(2) BB(2) BB(2) Bb(1) Bb(1) Bb(1) bb(0) bb(0) bb(0) aa(0) Aa(1) AA(2) aa(0) Aa(1) AA(2) aa(0) Aa(1) AA(2) Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

  9. Two-state feature space: φ epi D&D D&R R&R B*(1) B*(1) B*(1) bb(0) bb(0) bb(0) aa(0) A*(1) aa(0) A*(1) aa(0) A*(1) R|R R|D D|D B*(1) B*(1) B*(1) bb(0) bb(0) bb(0) aa(0) A*(1) aa(0) A*(1) aa(0) A*(1) Mapping φ epi results in 2 n features that represent n SNPs (not in n 2 !). Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

  10. epiSVM Optimization problem (Rakitsch, Li, B., 2011) w ∈ R n � w � 2 min (1) subject to y i ( w · φ epi ( x i ) + b ) ≥ 1 and � w � 0 ≤ 2 . ℓ 0 -Support Vector Machine (Weston et al., 2003) Approximation of (1) via repeated application of ℓ 2 -SVM Rescale x by pointwise multiplication with w Empirically, this procedure converges within h = 20 iter- ations Runtime In each iteration, one has to solve a linear SVM on m individuals and n SNPs, which can be done in O ( m n ) . Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

  11. Runtime Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

  12. Power r 2 =0 . 7 r 2 =1 . 0 100 100 80 80 Power in % 60 60 40 40 20 20 epiSVM TEAM 0 0 D&D D&R R&R R|R R|D D|D D&D D&R R&R R|R R|D D|D Sample size: m = 400 , n = 10 . 000 Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

  13. Confounder-robust SVMs Genotypes x i , phenotypes y i and kinship matrix ˜ L Confounder-robust SVM (Li and B., 2011): w ∈ R n � w � 2 + λ tr ( K ˜ min L ) subject to y i ( w · x i + b ) ≥ 1 K ij = � w ⊙ x i , w ⊙ x j � Equivalent to: w � 2 w ∈ R n � ˜ min ˜ subject to y i ( ˜ w · ˜ x i + b ) ≥ 1 x i ( γ ) x i ( γ ) = where ˜ � i,j x i ( γ ) x j ( γ )˜ 1 + λ � L ij Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

  14. Confounder-robust SVM Binary Arabidopsis pheontype prediction x i = 250k SNPs of plant i y i = phenotype of plant i � 1 if i and j are in the same subpopulation L ij = 0 otherwise Prediction results 5-fold cross-validation Linear SVM versus confounder-robust linear SVM AUC crSVM AUC SVM p SVM Phenotype Chlorosis at 22 ◦ C 0.662 ± 0.003 0.629 ± 0.003 1.7e-16 Anthocyanin at 16 ◦ C 0.598 ± 0.002 0.569 ± 0.003 3.0e-16 Anthocyanin at 22 ◦ C 0.618 ± 0.003 0.609 ± 0.003 1.0e-02 Leaf Roll at 10 ◦ C 0.711 ± 0.002 0.696 ± 0.002 3.0e-06 Leaf Roll at 22 ◦ C 0.594 ± 0.004 0.587 ± 0.004 4.0e-03 Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

  15. Summary and Outlook Computational challenges in genetics Gene-gene interactions Correction for population structure Analysing structured phenotypes such as images or time series Further current topics in the group Confounder(-gene interaction) correction Active learning for optimized phenotyping Feature extraction from image phenotypes Feature selection in structured spaces Comparing networks efficiently Soon: Functional annotation of SNPs Soon: Webtool for association studies Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Recommend


More recommend