recognition of recognition of protein function protein
play

RECOGNITION OF RECOGNITION OF PROTEIN FUNCTION PROTEIN FUNCTION - PowerPoint PPT Presentation

RECOGNITION OF RECOGNITION OF PROTEIN FUNCTION PROTEIN FUNCTION USING THE LOCAL SIMILARITY USING THE LOCAL SIMILARITY Kirill E. Alexandrov Dmitry A. Filimonov Boris N. Sobolev Vladimir V. Poroikov Institute of Biomedical Chemistry of


  1. RECOGNITION OF RECOGNITION OF PROTEIN FUNCTION PROTEIN FUNCTION USING THE LOCAL SIMILARITY USING THE LOCAL SIMILARITY Kirill E. Alexandrov Dmitry A. Filimonov Boris N. Sobolev Vladimir V. Poroikov Institute of Biomedical Chemistry of Russian Academy of Medical Sciences, Moscow, Russia

  2. Agenda Agenda 1. History of Problem 2. Sequence Local Similarity 3. Algorithm of Similarity Calculation 4. Local Similarity Approach Paradigm 5. Algorithm of Protein Function Recognition 6. Prediction Accuracy Estimation 7. Results of Local Similarity Approach Evaluation 8. Acknowledgements

  3. The central dogma of SAR/QSAR/QSPR: The central dogma of SAR/QSAR/QSPR: Property = Function ( Structure ) Property = Function ( Structure ) Continuity hypothesis: the difference of structures is less, the difference of properties is less y pred = x 0 + � i x i F i (S) F i (S) = LogP, ..., (LogP) 2 , ... – traditional QSAR F i (S) = Sim(S,S i ) – similarity based QSAR MLR – multiple linear regression PLS – projections to latent structures ANN – artificial neural network SVM – support vector machine

  4. The local similarity principle The local similarity principle QSAR with CoMFA Tripos' patented Comparative Molecular Field Analysis (CoMFA) has been used as the method of choice in hundreds of published QSAR studies.

  5. Neighborhoods of atoms descriptors Neighborhoods of atoms descriptors MOLECULAR BIOLOGY QUANTUM CHEMISTRY QUANTUM FIELD THEORY: M = V + VgM = V + VgV + VgVgV + VgVgVg + … M i = V i + V i gM = V i + V i g(M 1 + M 2 + … + M m ) All descriptors are based on the concept of atoms’ of molecule description subject to the neighborhood of them: MNA - multilevel neighborhoods of atoms RMNA - reaction multilevel neighborhoods of atoms QNA - quantitative neighborhoods of atoms FNA - fuzzy neighborhoods of atoms ��������� � . � ., �������� � . � . (2006) ��� , L, (2), 66-75.

  6. Multilevel neighborhoods of atoms descriptors Multilevel neighborhoods of atoms descriptors – – MNA MNA N O O H N H C C MNA/0: C C C O H C C H H O H N H C C MNA/1: C(CN-H) C C O H C C H H O H N H C C MNA/2: C(C(CC-H)N(CC)-H(C)) C C O H C C H H O ��������� � . � ., �������� � . � . (2006) ��� , L, (2), 66-75.

  7. Multilevel neighborhoods of atoms descriptors Multilevel neighborhoods of atoms descriptors – – MNA MNA MNA/2 C(C(CC-H)C(CC-C)-H(C)) C(C(CC-H)C(CN-H)-H(C)) C(C(CC-H)C(CN-H)-C(C-O-O)) H N H C(C(CC-H)N(CC)-H(C)) C C C(C(CC-C)N(CC)-H(C)) N(C(CN-H)C(CN-H)) C C O -H(C(CC-H)) H C C H -H(C(CN-H)) H O -H(-O(-H-C)) -C(C(CC-C)-O(-H-C)-O(-C)) -O(-H(-O)-C(C-O-O)) -O(-C(C-O-O)) ��������� � . � ., �������� � . � . (2006) ��� , L, (2), 66-75.

  8. Prediction of activity spectra for organic compounds Prediction of activity spectra for organic compounds According to the Bayes formula the probability P(A|S) of that compound S has activity A is equal to: P(A|S) = P(S|A)•P(A)/P(S) Let the descriptors of organic compound D 1 , ..., D m are mutually independent, then: P(S|A) = P(D 1 , ..., D m |A) = � i P(D i |A) P(A) and P(A|D i ) are caculated as sums over all organic compounds of the training set: ��������� � . � ., �������� � . � . (2006) ��� , L, (2), 66-75.

  9. Quatitative neighborhoods of atoms descriptors Quatitative neighborhoods of atoms descriptors – – QNA QNA Q i = a i � k [g(C)] ik b k a i and b k are parameters of atoms i and k g(C) is function of the connectivity matrix C - � � - � P i = B i k (Exp(- � C)) ik B k - � � - � A k Q i = B i k (Exp(- � C)) ik B k A = � (IP + EA), B = IP – EA, IP is the first ionization potential, EA is the electron affinity. Feynman R. Ph. Phys. Rev. , 1939, 56, 340-343. Robert G. Parr et al. J. Chem. Phys. , 1978, 68(8), 3801-3807. Gasteiger J, Marsili M. Tetrahedron , 1980, 36, 3219-3228. Rappe A K and W A Goddard III. J. Ph. Ch. , 1991, 95, 3358-3363.

  10. Quatitative neighborhoods of atoms descriptors Quatitative neighborhoods of atoms descriptors – – QNA QNA ChemNavigator DataBase in QNA Space 976,545,026 QNA descriptors of 24,621,668 molecules Initial QNA Space Normalized QNA Space

  11. Quatitative neighborhoods of atoms descriptors Quatitative neighborhoods of atoms descriptors – – QNA QNA Nicotinic Acid Aspirin Sulfathiazole

  12. GUSAR GUSAR – – QNA based prediction QNA based prediction of quantitative properties of organic compounds of quantitative properties of organic compounds

  13. GUSAR GUSAR – – QNA based prediction QNA based prediction of quantitative properties of organic compounds of quantitative properties of organic compounds CDK2 inhibitors DHFR inhibitors ACE inhibitors Vibrio fischeri Chlorella vulgaris Tetrahymena pyriformis

  14. GUSAR – GUSAR – QNA based prediction QNA based prediction of quantitative properties of organic compounds of quantitative properties of organic compounds PLS MLR GFA HQSAR delta R2 test delta Q2 CoMFA delta R2 EVA CoMSIA 3D Cerius2 2D Cerius2 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20

  15. OK. But, how local OK. But, how local similarity can be used similarity can be used for recognition for recognition of protein function?… … of protein function?

  16. Pairwise Pairwise sequence alignment sequence alignment 1996, Autumn Homology-derived annotation based on the pairwise sequence alignment was a general way to predict the protein function for a long time.

  17. Sequence Local Similarity. Sequence Local Similarity. Frame 20, shift from Frame 20, shift from -8 -8 to to +8 +8 AANRDPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVA 2 ANRDPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVAL 1 NRDPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALR 1 RDPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRA 0 DPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRAL 1 PSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALF 2 SQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFG 1 QFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGR 1 FPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRF 2 PDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFP 0 DPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPA 1 PHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPAL 0 The best match HRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALS 9 RFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSL 0 FDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSLG 3 DVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSLGI 1 VTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSLGID 1 TRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSLGIDA 2 Query sequence GTAINKPLSEKMMLFGMGKRRCIGEVLAKWEIFLFLAILLQQLEFSV 9 R i = 9

  18. Sequence Local Similarity. Sequence Local Similarity. Algorithm Algorithm of of Similarity Calculation Similarity Calculation , i is position number in the query sequence A a and b are aminoacid residuals in sequence A and sequence B m is current shift between sequence A and sequence B F is frame size R i is primary similarity value S i is the local similarity value for position i in the query sequence A with sequence B About 1000 sequences per second.

  19. Sequence Local Similarity. Sequence Local Similarity. 13 13.11. .11.1996 1996

  20. “If there exists correspondence between similarity of “ If there exists correspondence between similarity of substrates and protein sequences in substrates and protein sequences in cytochrome cytochrome P450 P450 superfamily superfamily? ?” ” real data — — 1 0.9 … average random data Proportion of homologs 0.8 0.7 *** confidence interval CYP4 0.6 0.5 0.4 0.3 0.2 The results of substrate-based 0.1 0 clustering correspond to 0 25 50 75 100 125 150 Number of clusters homology-based classification for families CYP 1, 2, 3, 4, 5, 6, 7, 11 1 Proportion of homologs For other families of P450 0.9 0.8 0.7 (CYP 8, 17, 19, 21, 24, 26, 27) CYP7 0.6 0.5 substrate-based clustering brings 0.4 0.3 0.2 to the contradictions with the 0.1 0 traditional classification 0 25 50 75 100 125 150 Number of clusters Borodina Yu.V., Lisitsa A.V., Poroikov V.V., Filimonov D.A., Sobolev B.N., Archakov A.A. Nova Acta Leopoldina. , 2003, 87(329), 47-55.

  21. “ “Quantifying the Relationships among Drug Classes Quantifying the Relationships among Drug Classes” ” A subset of the MDDR database containing 65 367 compounds organized in 249 sets that associate with a specific biological target “By multiple criteria, bioinformatics and chemoinformatics networks differed substantially, and only occasionally did a high sequence similarity correspond to a high ligand-set similarity.” Hert, J., Keiser, M. J., Irwin, J. J., Oprea, T. I., Shoichet, B. K. “Quantifying the Relationships among Drug Classes” J. Chem. Inf. Model. , 2008, 48(4) , 755-765.

  22. Unique law Machine Fundamental of nature Learning theory Ab initio principles Learning by example Molecular Partial estimate Homology Modelling

  23. Protein Protein function recognition based on learning by example function recognition based on learning by example C A ¬A B It is based on a data set of sequences with known properties. This data set must be subdivided into “positive” and “negtive” examples – group A and its complement ¬A

  24. Is there universal similarity reasonable? Is there universal similarity reasonable?

Recommend


More recommend