data mining in bioinformatics days 6 and 7 the need for
play

Data Mining in Bioinformatics Days 6 and 7: The Need for Data - PowerPoint PPT Presentation

Data Mining in Bioinformatics Days 6 and 7: The Need for Data Mining in Bioinformatics Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institute Tbingen and Eberhard


  1. Data Mining in Bioinformatics Days 6 and 7: The Need for Data Mining in Bioinformatics Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls Universität Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

  2. The Need for Machine Learning in Computational Biology High-throughput technologies: ◮ Genome and RNA sequencing ◮ Compound screening ◮ Genotyping chips ◮ Bioimaging BGI Hong Kong, Tai Po Industrial Estate, Hong Kong Molecular databases are growing much faster than our knowledge of biological processes. Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 2

  3. The Evolution of Bioinformatics ◮ Classic Bioinformatics: Focus on Molecules Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 3

  4. Classic Bioinformatics: Focus on Molecules ◮ Large collections of molecular data ◮ Gene and protein sequences ◮ Genome sequence ◮ Protein structures ◮ Chemical compounds ◮ Focus: Inferring properties of molecules ◮ Predict the function of a gene given its sequence ◮ Predict the structure of a protein given its sequence ◮ Predict the boundaries of a gene given a genome segment ◮ Predict the function of a chemical compound given its molecular structure Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 4

  5. Example: Predicting Function from Structure ◮ Structure-Activity Relationship Source: Joska T M , and Anderson A C Antimicrob. Agents Chemother. 2006;50:3435-3443 ◮ Fundamental idea: Similarity in structure implies similarity in function Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 5

  6. Measuring the Similarity of Graphs ◮ How similar are two graphs? ◮ How similar is their structure? ◮ How similar are their node labels and edge labels? ◮ Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 6

  7. Graph Comparison 1. Graph isomorphism and subgraph isomorphism checking ◮ Exact match ◮ Exponential runtime 2. Graph edit distances ◮ Involves definition of a cost function ◮ Typically subgraph isomorphism as intermediate step 3. Topological descriptors ◮ Lose some of the structural information represented by the graph or ◮ Exponential runtime effort 4. Graph kernels (G¨ artner et al, 2003; Kashima et al. 2003) ◮ Goal 1: Polynomial runtime in the number of nodes ◮ Goal 2: Applicable to large graphs ◮ Goal 3: Applicable to graphs with attributes Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 7

  8. Graph Kernels I ◮ Kernels ◮ Key concept: Move problem to feature space H . ◮ Naive explicit approach: ◮ Map objects x and x ′ via mapping φ to H . ◮ Measure their similarity in H as � φ ( x ) , φ ( x ′ ) � . ◮ Kernel Trick : Compute inner product in H as kernel in input space k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � . R 2 ⇒ H Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 8

  9. Graph Kernels II ◮ Graph kernels ◮ Kernels on pairs of graphs ( not pairs of nodes) ◮ Instance of R-Convolution kernels (Haussler, 1999): ◮ Decompose objects x and x ′ into substructures. ◮ Pairwise comparison of substructures via kernels to compare x and x ′ . ◮ A graph kernel makes the whole family of kernel methods applicable to graphs. G G’ Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 9

  10. Weisfeiler-Lehman Kernel (Shervashidze and Borgwardt, NIPS 2009) 1st iteration Given labeled graphs G and G ’ Result of steps 1 and 2: multiset-label determination and sorting 5 2 2 5 5,234 2,35 2,45 5,234 4 3 4 3 4,1135 3,245 4,1235 3,245 1 1 1 2 1,4 1,4 1,4 2,3 G G ’ G G ’ a b 1st iteration 1st iteration Result of step 3: label compression Result of step 4: relabeling 13 8 9 13 1,4 6 3,245 10 2,3 7 4,1135 11 11 12 10 10 2,35 8 4,1235 12 2,45 9 5,234 13 6 6 6 7 G G ’ c d φ Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 10 φ φ φ

  11. Weisfeiler-Lehman Kernel (Shervashidze and Borgwardt, NIPS 2009) End of the 1st iteration Feature vector representations of G and G ’ (1) φ (G) = ( 2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1 ) WLsubtree (1) φ (G’) = ( ) 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1 WLsubtree Counts of Counts of original compressed node labels node labels (1) (1) (1) k (G,G ’ )= < φ (G), φ (G ’ ) > =11. WLsubtree WLsubtree WLsubtree e Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 11

  12. Subtree-like Patterns 2 1 1 3 3 6 2 6 4 5 3 1 2 4 5 1 1 5 Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 12

  13. Weisfeiler-Lehman Kernel: Theoretical Runtime Properties ◮ Fast Weisfeiler-Lehman kernel (NIPS 2009 and JMLR 2011) ◮ Algorithm : Repeat the following steps h times 1. Sort: Represent each node v as sorted list L v of its neighbors ( O ( m ) ) 2. Compress: Compress this list into a hash value h ( L v ) ( O ( m ) ) 3. Relabel: Relabel v by the hash value h ( L v ) ( O ( n ) ) ◮ Runtime analysis ◮ per graph pair: Runtime O ( m h ) ◮ for N graphs: Runtime O ( N m h + N 2 n h ) (naively O ( N 2 m h ) ) Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 13

  14. Weisfeiler-Lehman Kernel: Empirical Runtime Properties 5 10 500 pairwise 4 10 Runtime in seconds Runtime in seconds 400 global 3 10 300 2 10 200 1 10 100 0 10 − 1 10 0 1 2 3 200 400 600 800 1000 10 10 10 Graph size n Number of graphs N 20 15 Runtime in seconds Runtime in seconds 15 10 10 5 5 0 0 2 4 6 8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Subtree height h Graph density c Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 14

  15. Weisfeiler-Lehman Kernel: Runtime and Accuracy 1000 days WL 100 days RG 10 days 3 Graphlet RW 1 day SP 1 hour 1 minute 10 sec 85 % 80 % 75 % 70 % 65 % 60 % 55 % 50 % MUTAG NCI1 NCI109 D&D graph size Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 15

  16. The Evolution of Bioinformatics ◮ Modern Bioinformatics: Focus on Individuals Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 16

  17. Modern Bioinformatics: Focus on Individuals ◮ High-throughput technologies now enable the collection of molecular information on individuals ◮ Microarrays to measure gene expression levels ◮ Chips to determine the genotype of an individual ◮ Sequencing to determine the genome sequence of an individual Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 17

  18. Phenotype Prediction ◮ Goal: Predict breast cancer outcome from gene expression levels ◮ Current results are not satisfying in terms of stability and prediction performance Source: Venet et al., PLoS Comp Bio 2011 Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 18

  19. Phenotype Prediction Nature News, March 2009 ◮ ‘Genetic test predicts eye color in Dutch men with 90% accuracy’ (Liu et al., Current Biology 2009) ◮ Special setting: Candidate genes were already known beforehand ◮ Other phenotypes: Large genetics consortia try to detect candidate genes (e.g. diabetes, autism, depression, drug response, plant growth) Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 19

  20. Genetics: Association Studies ◮ Genome-Wide Association Studies (GWAS) bco D. Weigel ◮ One considers genome positions that differ between individuals, that is Single Nucleotide Polymorphisms (SNPs) (more general: genetic locus or genomic variant). ◮ Problem size: 10 5 - 10 7 SNPs per genome, 10 2 to 10 5 individuals Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 20

  21. Genetics: Manhattan Plots ◮ The standard statistical analysis in Genetics: Generating a Manhattan plot of association signals Manhattan-plot for chromosome Chr2 -log10(p-value) Bonferroni threshold [0.05] 6 -log10(p-value) 4 2 0 4000000 8000000 12000000 16000000 chromosomal position [bp] Phenotype: Flower color-related trait of Arabidopsis thaliana ◮ A plot of genome positions versus p-values of association/correlation. Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 21

  22. Genetics: Missing Heritability ◮ More than 1200 new disease loci were detected over the last decade. ◮ The phenotypic variance explained by these loci is disappointingly low: Vol 461 j 8 October 2009 j doi:10.1038/nature08494 REVIEWS Finding the missing heritability of complex diseases Teri A. Manolio 1 , Francis S. Collins 2 , Nancy J. Cox 3 , David B. Goldstein 4 , Lucia A. Hindorff 5 , David J. Hunter 6 , Mark I. McCarthy 7 , Erin M. Ramos 5 , Lon R. Cardon 8 , Aravinda Chakravarti 9 , Judy H. Cho 10 , Alan E. Guttmacher 1 , Augustine Kong 11 , Leonid Kruglyak 12 , Elaine Mardis 13 , Charles N. Rotimi 14 , Montgomery Slatkin 15 , David Valle 9 , AliceS.Whittemore 16 ,MichaelBoehnke 17 ,AndrewG.Clark 18 ,EvanE.Eichler 19 ,GregGibson 20 ,JonathanL.Haines 21 , Trudy F. C. Mackay 22 , Steven A. McCarroll 23 & Peter M. Visscher 24 Genome-wide association studies have identified hundreds of genetic variants associated with complex human diseases and traits, and have provided valuable insights into their genetic architecture. Most variants identified so far confer relatively Manolio et al., Nature 2009 Karsten Borgwardt Days 6 and 7: Data Mining in the Life Sciences 22

Recommend


More recommend