Data Mining in Bioinformatics Day 8: Graph Mining for Chemoinformatics and Drug Discovery Chloé-Agathe Azencott Machine Learning & Computational Biology Research Group MPIs Tübingen C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 1
Drug Discovery Modern Therapeutic Research From serendipity to rationalized drug design Ancient Greeks treat infections with mould NH 2 NH S CH HO 3 O N CH 3 O O HO Biapenem in PBP-1A C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 2
Drug Discovery Drug Discovery Process C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 3
Drug Discovery Drug Discovery Process C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 4
Drug Discovery Drug Discovery Process C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 5
Drug Discovery Chemoinformatics How can computer science help? → Chemoinformatics! “...the mixing of information resources to transform data into information, and information into knowledge, for the intended purpose of making better decisions faster in the arena of drug lead identification and optimisation.” – F . K. Brown “... the application of informatics methods to solve chemical problems.” – J. Gasteiger and T. Engel C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 6
Drug Discovery Chemoinformatics C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 7
Drug Discovery The Chemical Space ◮ 10 60 possible small molecules ◮ 10 22 stars in the observable universe (Slide courtesy of Matthew A. Kayala) C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 8
Structure-Based Approaches Drug Discovery Process QSAR: Qualitative Structure-Activity Relationship (classification) QSPR: Quantititive Structure-Property Relationship (regression) C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 9
Structure-Based Approaches Representing Chemicals in silico ◮ Expert knowledge molecular descriptors → hard, potentially incomplete ◮ How to get a “complete” enough representation? NH 2 NH S HO CH 3 O N CH 3 O O HO Similar Property Principle: molecules having similar structures should exhibit similar activities. ◮ → Structure-based representations ◮ Compare molecules by comparing substructures C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 10
Structure-Based Approaches Molecular Graph Undirected labeled graph C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 11
Structure-Based Approaches Feature Vectors based on Patterns ◮ Define feature vectors that record the presence/absence (or number of occurrences) of particular patterns in a given molecular graph ϕ ( A ) = ( ϕ s ( A )) s substructure where � 1 if s occurs in A ϕ s ( A ) = 0 otherwise ◮ Extension of traditional chemical fingerprints C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 12
Structure-Based Approaches Feature Vectors based on Patterns Classical machine learning and data mining techniques can be applied to these vectorial feature representations. ◮ Any distance / kernel can be used ◮ Dot product → “How many substructures do two compounds share?” ◮ Classification ◮ Feature selection ◮ Clustering C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 13
Structure-Based Approaches T animoto and MinMax animoto (binary setting): k ( A, B ) = A ∩ B ◮ T A ∪ B � N i = 1 min ( A i ,B i ) ◮ MinMax (counts setting): � N i = 1 max ( A i ,B i ) C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 14
Structure-Based Approaches T animoto and MinMax Both T animoto and Minmax are kernels. ◮ Proof for T animoto: J.C. Gower A general coefficient of similarity and some of its properties . Biometrics 1971. ◮ Proof for MinMax: 〈 ϕ ( x ) , ϕ ( y ) 〉 MinMax ( x, y ) = 〈 ϕ ( x ) , ϕ ( x ) 〉 + 〈 ϕ ( y ) , ϕ ( y ) 〉 − 〈 ϕ ( x ) , ϕ ( y ) 〉 with ϕ ( x ) of length: # patterns × max count ϕ ( x ) i = 1 iff. the pattern indexed by ⌊ i/q ⌋ appears more than i mod q times in x C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 15
Structure-Based Approaches Fingerprint Compression ◮ Systematic enumeration → long, sparse vectors e.g. 50 , 000 random compounds from ChemDB → 300 , 000 paths of length up to 8 → 300 non-zeros on average ◮ “Naive” Compression ◮ List the positions of the 1s ◮ 2 19 = 524 , 288 ◮ average encoding: 300 × 19 = 5 , 700 bits ◮ Modulo ◮ Elias-Gamma Monotone Encoding Compression (lossless) [Baldi 2007] (lossy) ◮ index j → ⌊ log ( j ) ⌋ 0 bits + binary encoding of j ◮ j i < j i + 1 : ⌊ log ( j i + 1 ) ⌋ → ⌊ log ( j i ) − log ( j i + 1 ) ⌋ ◮ average compressed size = 1 , 800 bits C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 16
Frequent Patterns MOLFEA Frequent Pattern Mining: Mutagenicity Inducing Substructures [Helma et al., 2004] ◮ P = positive (mutagenic) compounds N = negative compounds ◮ features: fragments (= patterns) f such that both freq ( f, P ) ≥ t and freq ( f, N ) ≤ t ◮ Limited to frequent linear patterns ◮ ML algorithm: SVM with linear or quadratic kernel C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 17
Frequent Patterns MOLFEA Frequent Pattern Mining: Mutagenicity Inducing Substructures [Helma et al., 2004] ◮ CPDB – Carcinogenic Potency DataBase ◮ 684 compounds classified in 341 mutagens and 343 non-mutagens according to Ames test on Salmonella Mutagenicity prediction [Hema04] 100 Linear kernel Quadratic kernel 90 Cross-validated sensitivity 80 70 60 50 1% 3% 5% 10% Frequency threshold C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 18
Frequent Patterns gBoost gBoost [Saigo et al., 2008] ◮ Train data: { ( G n , y n ) } n = 1 ...l ◮ Stump: h ( x : t, w ) = w ( 2 x t − 1 ) x t = 1 if x ⊆ G and 0 otherwise ◮ Decision function: � f ( x ) = α tw h ( x : t , w ) t ∈ T ,w ∈ { − 1 , + 1} � l t , w α tw = 1 and α tw ≥ 0 ∀ t , w l ◮ LPBoosting: � α,ξ,ρ − ρ + D min ξ n n = 1 � l s.t. ξ n ≥ 0 ∀ n , t , w α tw = 1 and α tw ≥ 0 ∀ t , w min ◮ Equivalent to solving λ,γ γ � l s.t n = 1 λ n y n h ( x n : t , w ) ≤ γ ∀ t , w � l n = 1 λ n = 1 and 0 ≤ λ n ≤ D ∀ n C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 19
Frequent Patterns gBoost gBoost [Saigo et al., 2008] ◮ Solve by “column generation” ◮ start with H = ∅ and λ n = 1 / n ∀ n ◮ Iteratively: ◮ find ( t ∗ , w ∗ ) that maximizes � l g ( t, w ) = n = 1 λ n y n h ( x n : t, w ) ◮ add ( t ∗ , w ∗ ) to H ◮ update λ n , γ ◮ Stop when � ∃ ( t ∗ , w ∗ ) such that g ( t ∗ , w ∗ ) > γ + ε C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 20
Frequent Patterns gBoost gBoost [Saigo et al., 2008] ◮ Finding ( t ∗ , w ∗ ) : DFS code tree ◮ Pruning condition ( g ∗ optimal gain so far): if l l � � � � < g ∗ max 2 λ n − y n λ n , 2 λ n + y n λ n n : y n = 1 , t ⊆ G n n = 1 n : y n = − 1 , t ⊆ G n n = 1 then ∀ t ′ : t ⊆ t ′ , ∀ w ′ , g ∗ > g ( t ′ , w ′ ) C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 21
Frequent Patterns gBoost gBoost [Saigo et al., 2008] Application to CPDB: ◮ Accuracy similar to Helma et al. (79 % ) ◮ Most discriminative patterns C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 22
All Patterns 2D Kernels Paths-Based Fingerprints ◮ Labeled sub-paths (walks) Figure: Some sub-paths of length 3 C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 23
All Patterns 2D Kernels Circular Fingerprints ◮ Labeled sub-trees - Extended-Connectivity (or Circular) features Figure: Example of a circular substructure of depth 2 C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 24
All Patterns 2D Kernels T wo-Dimensional Kernels [Azencott et al., 2007] ◮ Systematically extract paths / circular fingerprints, for various maximal depths ◮ SVM with T animoto / Minmax ◮ Data ◮ Mutagenicity (Mutag) : 188 compounds ◮ Benzodiazepine receptor affinity (BZR) : 181 + 125 compounds ◮ Cyclooxygenase-2 ihibitors (COX2) : 178 + 125 compounds ◮ Estrogen receptor affinity : 166 + 180 compounds Data SVM Previous best Mutag 90 . 4 % 85 . 2 % (gBoost) BZR 76 . 4 % 79 . 8 % COX2 70 . 1 % 73 . 6 % ER 79 . 8 % 82 . 1 % C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 25
All Patterns 3D Kernels Introducing Spatial Information ◮ 3D Histograms [Azencott et al, 2007] ◮ Groups of k atoms ◮ Associated size: ◮ Pairwise distances ( k = 2) ◮ diameter of the smallest sphere that contains all k atoms C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 26
All Patterns 3D Kernels Introducing Spatial Information One histogram per class of k -tuple (e.g. C-C-C, C-C-O) C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 27
Recommend
More recommend