data mining in bioinformatics day 8 graph mining for
play

Data Mining in Bioinformatics Day 8: Graph Mining for - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 8: Graph Mining for Chemoinformatics and Drug Discovery Chlo-Agathe Azencott Machine Learning & Computational Biology Research Group MPIs Tbingen C.-A. Azencott Graph Mining for Chemoinformatics


  1. Data Mining in Bioinformatics Day 8: Graph Mining for Chemoinformatics and Drug Discovery Chloé-Agathe Azencott Machine Learning & Computational Biology Research Group MPIs Tübingen C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 1

  2. Drug Discovery Modern Therapeutic Research From serendipity to rationalized drug design Ancient Greeks treat infections with mould NH 2 NH S CH HO 3 O N CH 3 O O HO Biapenem in PBP-1A C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 2

  3. Drug Discovery Drug Discovery Process C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 3

  4. Drug Discovery Drug Discovery Process C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 4

  5. Drug Discovery Drug Discovery Process C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 5

  6. Drug Discovery Chemoinformatics How can computer science help? → Chemoinformatics! “...the mixing of information resources to transform data into information, and information into knowledge, for the intended purpose of making better decisions faster in the arena of drug lead identification and optimisation.” – F . K. Brown “... the application of informatics methods to solve chemical problems.” – J. Gasteiger and T. Engel C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 6

  7. Drug Discovery Chemoinformatics C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 7

  8. Drug Discovery The Chemical Space ◮ 10 60 possible small molecules ◮ 10 22 stars in the observable universe (Slide courtesy of Matthew A. Kayala) C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 8

  9. Structure-Based Approaches Drug Discovery Process QSAR: Qualitative Structure-Activity Relationship (classification) QSPR: Quantititive Structure-Property Relationship (regression) C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 9

  10. Structure-Based Approaches Representing Chemicals in silico ◮ Expert knowledge molecular descriptors → hard, potentially incomplete ◮ How to get a “complete” enough representation? NH 2 NH S HO CH 3 O N CH 3 O O HO Similar Property Principle: molecules having similar structures should exhibit similar activities. ◮ → Structure-based representations ◮ Compare molecules by comparing substructures C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 10

  11. Structure-Based Approaches Molecular Graph Undirected labeled graph C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 11

  12. Structure-Based Approaches Feature Vectors based on Patterns ◮ Define feature vectors that record the presence/absence (or number of occurrences) of particular patterns in a given molecular graph ϕ ( A ) = ( ϕ s ( A )) s substructure where � 1 if s occurs in A ϕ s ( A ) = 0 otherwise ◮ Extension of traditional chemical fingerprints C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 12

  13. Structure-Based Approaches Feature Vectors based on Patterns Classical machine learning and data mining techniques can be applied to these vectorial feature representations. ◮ Any distance / kernel can be used ◮ Dot product → “How many substructures do two compounds share?” ◮ Classification ◮ Feature selection ◮ Clustering C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 13

  14. Structure-Based Approaches T animoto and MinMax animoto (binary setting): k ( A, B ) = A ∩ B ◮ T A ∪ B � N i = 1 min ( A i ,B i ) ◮ MinMax (counts setting): � N i = 1 max ( A i ,B i ) C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 14

  15. Structure-Based Approaches T animoto and MinMax Both T animoto and Minmax are kernels. ◮ Proof for T animoto: J.C. Gower A general coefficient of similarity and some of its properties . Biometrics 1971. ◮ Proof for MinMax: 〈 ϕ ( x ) , ϕ ( y ) 〉 MinMax ( x, y ) = 〈 ϕ ( x ) , ϕ ( x ) 〉 + 〈 ϕ ( y ) , ϕ ( y ) 〉 − 〈 ϕ ( x ) , ϕ ( y ) 〉 with ϕ ( x ) of length: # patterns × max count ϕ ( x ) i = 1 iff. the pattern indexed by ⌊ i/q ⌋ appears more than i mod q times in x C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 15

  16. Structure-Based Approaches Fingerprint Compression ◮ Systematic enumeration → long, sparse vectors e.g. 50 , 000 random compounds from ChemDB → 300 , 000 paths of length up to 8 → 300 non-zeros on average ◮ “Naive” Compression ◮ List the positions of the 1s ◮ 2 19 = 524 , 288 ◮ average encoding: 300 × 19 = 5 , 700 bits ◮ Modulo ◮ Elias-Gamma Monotone Encoding Compression (lossless) [Baldi 2007] (lossy) ◮ index j → ⌊ log ( j ) ⌋ 0 bits + binary encoding of j ◮ j i < j i + 1 : ⌊ log ( j i + 1 ) ⌋ → ⌊ log ( j i ) − log ( j i + 1 ) ⌋ ◮ average compressed size = 1 , 800 bits C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 16

  17. Frequent Patterns MOLFEA Frequent Pattern Mining: Mutagenicity Inducing Substructures [Helma et al., 2004] ◮ P = positive (mutagenic) compounds N = negative compounds ◮ features: fragments (= patterns) f such that both freq ( f, P ) ≥ t and freq ( f, N ) ≤ t ◮ Limited to frequent linear patterns ◮ ML algorithm: SVM with linear or quadratic kernel C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 17

  18. Frequent Patterns MOLFEA Frequent Pattern Mining: Mutagenicity Inducing Substructures [Helma et al., 2004] ◮ CPDB – Carcinogenic Potency DataBase ◮ 684 compounds classified in 341 mutagens and 343 non-mutagens according to Ames test on Salmonella Mutagenicity prediction [Hema04] 100 Linear kernel Quadratic kernel 90 Cross-validated sensitivity 80 70 60 50 1% 3% 5% 10% Frequency threshold C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 18

  19. Frequent Patterns gBoost gBoost [Saigo et al., 2008] ◮ Train data: { ( G n , y n ) } n = 1 ...l ◮ Stump: h ( x : t, w ) = w ( 2 x t − 1 ) x t = 1 if x ⊆ G and 0 otherwise ◮ Decision function: � f ( x ) = α tw h ( x : t , w ) t ∈ T ,w ∈ { − 1 , + 1} � l t , w α tw = 1 and α tw ≥ 0 ∀ t , w l ◮ LPBoosting: � α,ξ,ρ − ρ + D min ξ n n = 1 � l s.t. ξ n ≥ 0 ∀ n , t , w α tw = 1 and α tw ≥ 0 ∀ t , w min ◮ Equivalent to solving λ,γ γ � l s.t n = 1 λ n y n h ( x n : t , w ) ≤ γ ∀ t , w � l n = 1 λ n = 1 and 0 ≤ λ n ≤ D ∀ n C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 19

  20. Frequent Patterns gBoost gBoost [Saigo et al., 2008] ◮ Solve by “column generation” ◮ start with H = ∅ and λ n = 1 / n ∀ n ◮ Iteratively: ◮ find ( t ∗ , w ∗ ) that maximizes � l g ( t, w ) = n = 1 λ n y n h ( x n : t, w ) ◮ add ( t ∗ , w ∗ ) to H ◮ update λ n , γ ◮ Stop when � ∃ ( t ∗ , w ∗ ) such that g ( t ∗ , w ∗ ) > γ + ε C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 20

  21. Frequent Patterns gBoost gBoost [Saigo et al., 2008] ◮ Finding ( t ∗ , w ∗ ) : DFS code tree ◮ Pruning condition ( g ∗ optimal gain so far): if  l l  � � � �  < g ∗ max  2 λ n − y n λ n , 2 λ n + y n λ n n : y n = 1 , t ⊆ G n n = 1 n : y n = − 1 , t ⊆ G n n = 1 then ∀ t ′ : t ⊆ t ′ , ∀ w ′ , g ∗ > g ( t ′ , w ′ ) C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 21

  22. Frequent Patterns gBoost gBoost [Saigo et al., 2008] Application to CPDB: ◮ Accuracy similar to Helma et al. (79 % ) ◮ Most discriminative patterns C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 22

  23. All Patterns 2D Kernels Paths-Based Fingerprints ◮ Labeled sub-paths (walks) Figure: Some sub-paths of length 3 C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 23

  24. All Patterns 2D Kernels Circular Fingerprints ◮ Labeled sub-trees - Extended-Connectivity (or Circular) features Figure: Example of a circular substructure of depth 2 C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 24

  25. All Patterns 2D Kernels T wo-Dimensional Kernels [Azencott et al., 2007] ◮ Systematically extract paths / circular fingerprints, for various maximal depths ◮ SVM with T animoto / Minmax ◮ Data ◮ Mutagenicity (Mutag) : 188 compounds ◮ Benzodiazepine receptor affinity (BZR) : 181 + 125 compounds ◮ Cyclooxygenase-2 ihibitors (COX2) : 178 + 125 compounds ◮ Estrogen receptor affinity : 166 + 180 compounds Data SVM Previous best Mutag 90 . 4 % 85 . 2 % (gBoost) BZR 76 . 4 % 79 . 8 % COX2 70 . 1 % 73 . 6 % ER 79 . 8 % 82 . 1 % C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 25

  26. All Patterns 3D Kernels Introducing Spatial Information ◮ 3D Histograms [Azencott et al, 2007] ◮ Groups of k atoms ◮ Associated size: ◮ Pairwise distances ( k = 2) ◮ diameter of the smallest sphere that contains all k atoms C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 26

  27. All Patterns 3D Kernels Introducing Spatial Information One histogram per class of k -tuple (e.g. C-C-C, C-C-O) C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 27

Recommend


More recommend