Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
Drug discovery Modern therapeutic research From serendipity to rationalized drug design Ancient Greeks treat infections with mould NH 2 NH S HO CH 3 O N CH 3 O O HO Biapenem in PBP-1A Karsten Borgwardt: Data Mining in Bioinformatics, Page 2
Drug discovery process 3.Hit-to-lead: 4. Lead 1. Find a 2. Identify 5. Assay characterize optimization target hits hits and synthesis Protein that we want Compounds likely to Can they be drugs? - bioactivity - in vitro to inhibit so as to interfer bind to the target (ADME-T ox) - pharmacokinetics - in vivo with a biological process - synthetic pathway - clinical Karsten Borgwardt: Data Mining in Bioinformatics, Page 3
Drug discovery process 52 months 90 months 3.Hit-to-lead: 4. Lead 1. Find a 2. Identify 5. Assay characterize optimization target hits hits and synthesis Karsten Borgwardt: Data Mining in Bioinformatics, Page 4
Drug discovery process 52 months 90 months 3.Hit-to-lead: 4. Lead 1. Find a 2. Identify 5. Assay characterize optimization target hits hits and synthesis $500,000,000 to $2,000,000,000 Karsten Borgwardt: Data Mining in Bioinformatics, Page 5
Chemoinformatics How can computer science help? → Chemoinformatics! “...the mixing of information resources to transform data into informa- tion, and information into knowledge, for the intended purpose of mak- ing better decisions faster in the arena of drug lead identification and optimisation.” – F. K. Brown “... the application of informatics methods to solve chemical problems.” – J. Gasteiger and T. Engel Karsten Borgwardt: Data Mining in Bioinformatics, Page 6
Chemoinformatics Chemoinformatics 3.Hit-to-lead: 4. Lead 1. Find a 2. Identify 5. Assay characterize optimization target hits hits and synthesis Karsten Borgwardt: Data Mining in Bioinformatics, Page 7
Chemoinformatics The chemical space 10 60 possible small or- ganic molecules 10 22 stars in the observ- able universe (Slide courtesy of Matthew A. Kayala) Karsten Borgwardt: Data Mining in Bioinformatics, Page 8
Drug discovery process 3.Hit-to-lead: 4. Lead 1. Find a 2. Identify 5. Assay characterize optimization target hits hits and synthesis QSAR QSPR QSAR: Qualitative Structure-Activity Relationship i.e. classification QSPR: Quantititive Structure-Property Relationship i.e. regression Karsten Borgwardt: Data Mining in Bioinformatics, Page 9
Representing chemicals in silico Expert knowledge molecular descriptors → hard, potentially incomplete Molecules are... NH 2 NH S CH HO 3 O N CH 3 O O HO Karsten Borgwardt: Data Mining in Bioinformatics, Page 10
Representing chemicals in silico Similar Property Principle Molecules having similar structures should exhibit similar activities. → Structure-based representations Compare molecules by comparing substructures Karsten Borgwardt: Data Mining in Bioinformatics, Page 11
Molecular graph O O d C O d C C N C O C d C N C C C C C C S O C C N C C Undirected labeled graph Karsten Borgwardt: Data Mining in Bioinformatics, Page 12
Fingerprints Define feature vectors that record the presence/absence (or number of occurrences) of particular patterns in a given molecular graph φ ( A ) = ( φ s ( A )) s substructure where � 1 if s occurs in A φ s ( A ) = 0 otherwise Extension of traditional chemical fingerprints Karsten Borgwardt: Data Mining in Bioinformatics, Page 13
Fingerprints Learning from fingerprints Classical machine learning and data mining techniques can be applied to these vectorial feature representations. Any distance / kernel can be used Classification Feature selection Clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 14
Fingerprints Fingerprints compression Systematic enumeration → long, sparse vectors e.g. 50 , 000 random compounds from ChemDB → 300 , 000 paths of length up to 8 → 300 non-zeros on average “Naive” Compression List the positions of the 1 s 2 19 = 524 , 288 average encoding: 300 × 19 = 5 , 700 bits Karsten Borgwardt: Data Mining in Bioinformatics, Page 15
Fingerprints Fingerprints compression Modulo Compression (lossy) Karsten Borgwardt: Data Mining in Bioinformatics, Page 16
Frequent patterns fingerprints MOLFEA [Helma et al. , 2004] P = positive (mutagenic) compounds N = negative compounds features: fragments (= patterns) f such that both freq ( f, P ) ≥ t and freq ( f, N ) ≥ t Limited to frequent linear patterns ML algorithm: SVM with linear or quadratic kernel Karsten Borgwardt: Data Mining in Bioinformatics, Page 17
Frequent patterns fingerprints MOLFEA [Helma et al. , 2004] CPDB – Carcinogenic Potency DataBase 684 compounds classified in 341 mutagens and 343 non- mutagens according to Ames test on Salmonella Mutagenicity prediction [Hema04] 100 Linear kernel Quadratic kernel 90 Cross-validated sensitivity 80 70 60 50 1% 3% 5% 10% Frequency threshold Karsten Borgwardt: Data Mining in Bioinformatics, Page 18
Spectrum kernels φ ( A ) = ( φ s ( A )) s ∈ S K spectrum ( A, A ′ ) = k ( φ ( A ) , φ ( A ′ )) k ∈ R R | ( S ) | × R | ( S ) | can be Dot product (linear kernel) RBF kernel Tanimoto kernel: k ( A, B ) = A ∩ B A ∪ B � N i =1 min( A i ,B i ) MinMax kernel: � N i =1 max( A i ,B i ) Karsten Borgwardt: Data Mining in Bioinformatics, Page 19
Spectrum kernels Tanimoto and MinMax Both Tanimoto and Minmax are kernels. Proof for Tanimoto: J.C. Gower A general coefficient of similarity and some of its properties . Biometrics 1971. Proof for MinMax: � φ ( x ) , φ ( y ) � MinMax ( x, y ) = � φ ( x ) , φ ( x ) � + � φ ( y ) , φ ( y ) � − � φ ( x ) , φ ( y ) � with φ ( x ) of length: # patterns × max count φ ( x ) i = 1 iff. the pattern indexed by ⌊ i/q ⌋ appears more than i mod q times in x Karsten Borgwardt: Data Mining in Bioinformatics, Page 20
All patterns fingerprints Paths fingerprints Labeled sub-paths (walks) O O d C O d CsCsCdO C C N C O C d C N C C C NsCsCsS S C C C O C C N C C Some sub-paths of length 3 Karsten Borgwardt: Data Mining in Bioinformatics, Page 21
All patterns fingerprints Circular fingerprints Labeled sub-trees - Extended-Connectivity (or Circular) features O O d C O d C C N C O C d C N C C C S C C C C C N O C{sC{sN|sC}|sN{sC}|sS{sC}} C C Example of a circular substructure of depth 2 Karsten Borgwardt: Data Mining in Bioinformatics, Page 22
All patterns fingerprints 2D spectrum kernels [Azencott et al. , 2007] Systematically extract paths / circular fingerprints, for various maximal depths SVM with Tanimoto / Minmax Karsten Borgwardt: Data Mining in Bioinformatics, Page 23
All patterns fingerprints 2D spectrum kernels [Azencott et al. , 2007] Mutagenicity (Mutag) : 188 compounds Benzodiazepine receptor affinity (BZR) : 181+125 compounds Cyclooxygenase-2 ihibitors (COX2) : 178 + 125 compounds Estrogen receptor affinity (ER) : 166 + 180 compounds Data SVM Previous best Mutag 90 . 4 % 85 . 2% ( gBoost ) BZR 79 . 8 % 76 . 4% COX2 70 . 1% 73 . 6 % 82 . 1 % 79 . 8% ER Karsten Borgwardt: Data Mining in Bioinformatics, Page 24
Weisfeiler-Lehman kernel [Shervashidze et al. , 2011] Goal: scalability Compute a sequence that captures topological and label information of graphs in a runtime linear in the number of edges → sub-tree kernel Karsten Borgwardt: Data Mining in Bioinformatics, Page 25
Weisfeiler-Lehman kernel [Shervashidze et al. , 2011] Karsten Borgwardt: Data Mining in Bioinformatics, Page 26
Convolution kernels a.k.a. decomposition kernels ( x 1 , . . . , x D ) is a tuple of parts of x , with x d ∈ X for each part d = 1 , . . . , D k d ∈ R X d × X d : a Mercer kernel � � K decomposition ( x, x ′ ) = k 1 ( x 1 , x ′ 1 ) k 2 ( x 2 , x ′ 2 ) . . . k D ( x D , x ′ D ) x 1 x 2 ...x D = x x ′ 1 x ′ 2 x ′ D = x ′ Spectrum kernels are a particular case of convolution kernels Karsten Borgwardt: Data Mining in Bioinformatics, Page 27
Convolution kernels Weighted Decomposition Kernel [Menchetti et al. , 2005] Match atoms and weigh them according to a kernel between sub- graphs that include these atoms K WDK ( x, x ′ ) = � ( a ′ ,σ ′ ∈ D r ( x ′ )) δ ( a, a ′ ) K c ( σ, σ ′ ) � ( a,σ ∈ D r ( x )) r > 0 ∈ N D r ( x ) : decompositions of the molecular graph of x in an atom a and a subpath σ of x including a and of depth at most r Karsten Borgwardt: Data Mining in Bioinformatics, Page 28
Convolution kernels Weighted Decomposition Kernel [Menchetti et al. , 2005] K c : contextual kernel , here: histogram intersection kernel l ∈ L min ( f σ ( l ) , f σ ′ ( l )) K c ( σ, σ ′ ) = � L : possible labels for edges and vertices f σ ( l ) : frequency of label l subgraph σ . Karsten Borgwardt: Data Mining in Bioinformatics, Page 29
Recommend
More recommend