inferring protein functions by matching binding surfaces
play

Inferring protein functions by matching binding surfaces through - PowerPoint PPT Presentation

Inferring protein functions by matching binding surfaces through evolutionary models Jie Liang (Joint work with Jeffrey Tseng) Dept. of Bioengineering University of Illinois at Chicago Outline Methodology: Computational geometry of


  1. Inferring protein functions by matching binding surfaces through evolutionary models Jie Liang (Joint work with Jeffrey Tseng) Dept. of Bioengineering University of Illinois at Chicago

  2. Outline Methodology: • Computational geometry of surface pattern: – Candidate motifs. • Assessing surface similarity. – Sequence, shape, orientation, and p -values. • Incorporation of evolutionary information by Bayesian Markov chain Monte Carlo. Discovery: • Protein functional prediction.

  3. The Universe of Protein Structures • Human genome: 3 billion nucleotides • Number of genes: 30,000 All β α/β • Protein families: 10,000-30,000 • Number of folds: 1,000 - 4,000 • Currently in PDB: < 700 folds (from SCOP) – Comparative modeling: needs a structural template with sequence identities > 30-35% • eg. ~50% of ORFs and ~18% of residues of S. cerevisiae genome • Structural Genomics: populating each fold with 4-5 structures – One for each superfamily at 30-35% sequence identities. – Fold of a novel gene can be identified • Its structure can then be interpolated by comparative modeling.

  4. • Main chain folds: – Important for understanding evolution. – May not directly lead to understanding of function Tenasin Phosphotransferase Tenasin Phosphotransferase 1ten 1poh 1ten 1poh (SCOP) All beta proteins a+b proteins (from Jaroszewski & Godzik, ISMB 00) Ig like beta sandwich HPr fold

  5. Predicting protein function by matching surfaces (Mucke and Edelsbrunner, ACM Trans. Graphics. 1994. Edelsbrunner, et al, Discrete Applied Math. 1998.) • Proteins from structural genomics often are of unknown functions. – Sequence homologs are often hypothetical proteins. • Strategy: Matching automatically computed surfaces that may be binding sites. Shape library • Three tasks: (Binkowski, Adamian, and Liang, – Geometric computation: A library of >2 J. Mol. Biol. 332:505-526 , 2003) million surface patterns on > 20,000l PDBs. ( cast.engr.uic.edu ) – Similarity measure: Sequence patterns, coordinate RMSD, and orientational RMSD. – Scoring matrix.

  6. Protein Functional Surfaces Ras 21 Fts Z GDP Binding Pockets

  7. http://cast.engr.uic.edu

  8. Voids and Pockets in Soluble Proteins Num of Voids and Pockets 150 50 0 0 200 600 1000 Number of Residues • Many voids and pockets. – At least 1 water molecule. – 15/100 residues. (Liang & Dill, 2001, Bioph J)

  9. Simulating Protein Packing with Off-Lattice Chain Polymers • 32-state off-lattice discrete model • Sequential Monte Carlo and resampling: – 1,000+ of conformations of N = 2,000 (Zhang, Chen, Tang and Liang, 2003, J. Chem. Phys .)

  10. • Proteins are not optimized by evolution to eliminate voids. – Protein dictated by generic compactness constraint related to n c .

  11. How to identify biologically important pockets and voids from random ones? Local Sequence and Shape Similarity (Binkowski, Adamian, Liang, 2003, JMB, 332:505-526)

  12. Binding Site Pocket: Sparse Residues, Long Gaps • ATP Binding: cAMP Dependet Protein Kinase (1cdk) • Tyr Protein Kinase c-src (2src) 1cdk.A 49 LGTGSFGRVMLVKHKETGNHFAMKILDKQKVVKLKQIEHTLNEKRILQAVNFPFLVKLEYSFKDNSNL YMVMEYVPGGEMFSHLRRIGRFSEPHARFYAAQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFG FAKRVKGRTWTLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGKVR FPSHFSSDLKDLLRNLLQVDLTKRFGNLKDGVNDIKNHKWFATTDWIAIYQRKVEAPFIPKFKGPGDTSN F 327 1cdk.A_p 49 LGTGSFGRV A K V MEYV E K EN L TD F 2src.m 273 LGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIV TEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVAD 404 2src.m_p 273 LGQGCFGEV A K V TEYM GS D D R AN L AD Low overall sequence identity: 13 %

  13. High Sequence Similarity of Pocket Residues 1cdk 2src cAMP Dependent Protein Kinase Tyr Protein Kinase c-src 1cdk.A LGTGSFGRVAKVMEYV---EKENLTDF 24 2src.m LGQGCFGEVAKVTEYMGSDDRANLAD- 26 ** *.**.**** **: :: **:* High sequence identity: 51 %

  14. Sequence Similarity of Surface Pockets • Similarity detection: – Dynamic programming S SEARCH (Pearson, 1998) • BLOSUM50 scoring matrix (Henikoff, 1994). • Not identity. – Order Dependent Sequence Pattern . � Statistical Significance ! • Statistics of Null Model: – Gapless local alignment: Extreme Value Distribution (Altschul & Karlin, 90) – Alignment with gaps: (Altschul, Bundschuh, Olsen & Hwa, 01)

  15. Approximation with EVD distribution (Pearson, 1998, JMB) • Kolmogorov-Smirnov Test: – Estimate K and λ parameters. • Estimation of E-value: – Estimate p value of observed Smith- Waterman score by EVD. = λ − S ' S ln Kmn , − x ≥ = − − p ( S ' x ) 1 exp( e ) – Estimate E-value: = ⋅ − ≤ ⋅ E p ( N N ) p N all d all (Binkowski, Adamian, Liang, 2003, JMB, 332:505-526)

  16. Shape Similarity Measure • cRMSD (coordinate root mean square distance) • oRMSD ( Orientational RMSD ): – Place a unit sphere S 2 at center of mass x 0 ∈ R 3 – Map each residue x ∈ R 3 to a unit vector on S 2 : f : x = (x, y, z) T a u = ( x - x 0 ) / || x - x 0 || – Measuring RMSD between two sets of unit vectors. (cf. uRMSD by Kedem and Chew, 2002)

  17. Statistical Significance of Shape Similarity • Estimate the probability p of obtaining a specific cRMSD or oRMSD value for random pockets with N res – EVD and other parametric distributions not accurate. – Randomly select 2 pockets. – Calculate cRMSD for N res randomly selected residues – Also calculate oRSMD N res Random surfaces 3 10 -8 30 10 -7 100 10 -6 (Binkowski, Adamian, Liang, 2003, JMB, 332:505-526)

  18. Surprising Surface Similarity Surprising Surface Similarity HIV- HIV -1 Protease 1 Protease ( (5hvp 5hvp) ) All β β CATH CATH Class Class All Fold Acid proteases Fold Acid proteases Family Family Retroviral protease Retroviral protease Pocket Pocket Binds poly Binds poly- -peptide substrate acetyl peptide substrate acetyl- -pepstatin pepstatin Heat Shock Protein 90 Heat Shock Protein 90 ( (1yes 1yes) ) α α + + β β CATH CATH Class Class Fold α / α / β β sandwhich Fold sandwhich •Conserved residues both important in Conserved residues both important in • Family Family Hsp90 Hsp90 polypeptide binding polypeptide binding • Both pockets undergo conformational Both pockets undergo conformational • Pocket Pocket Binds protein segment geldanamycin Binds protein segment geldanamycin changes upon binding changes upon binding

  19. How to incorporate evolutionary information? What to do if related sequences all have unknown functions?

  20. Likelihood function of a given phylogeny • Given a set of multiple-aligned sequences S = ( x 1 , x 2 , L , x s ) and a phylogenetic tree T = ( V, E ), 1 2 3 4 5 6 A column x h at poisition h is represented as: 7 8 9 10 11 = T x ( x , x , x ) L 12 13 h 1 , h 2 , h s , h 14 15 16 0.1 substitution/site • The Likelihood function of observing these sequences is: One column : ∑ ∏ = π p ( x | T , Q ) p ( t ) h x x x ij k i j ∈ ∈ ε i I ( i , j ) x ∈ i A Whole sequence : s ∏ = = P ( S | T , Q ) P ( x , x | T , Q ) p ( x | T , Q ) L 1 s h = h 1

  21. Estimation of instantaneous rates Q • Posterior probability of rate matrix given the sequences and tree: ∫ π ∝ ⋅ π ( Q | S , T ) P ( S | T , Q ) ( Q ) dQ , where π ( Q ) : prior distributi on, P ( S | T , Q ) : likelihood distributi on, π ( Q | S , T ) : posterior distributi on. • Bayesian estimation of posterior mean of rates in Q : E π ( Q ) = ∫ Q · π ( Q | S, T ) d Q, • Estimated by Markov chain Monte Carlo.

  22. Validation by simulation 1 2 3 • Generate 16 artificial sequences 4 5 from a known tree and known 6 7 rates (JTT model) 8 Phylogenetic tree 9 – Carboxypeptidase A2 precursor 10 as ancestor, length = 147 used to generate 11 12 16 sequences 13 14 • Goal: recovering the substitution 15 rates 16 0.1 substitution/site Negative Log Likelihood 15500 JTT model 50 Parameters Estimated 1 data points collected Parameters Estimated 2 every 500 simulation steps 0 15000 rate −50 14500 −100 −150 14000 0 100 200 300 400 3e + 5 6e + 5 0e+00 index Number of Steps Estimations from two initial conditions Convergence of the Markov chain are very similar to the true values of residue substitution rates.

  23. Accurate Estimation with > 20 residues and random initial values 11 0.1 Rel. Err. 10 0.09 9 = (| Q ’ | F -|Q| F) /|Q| F 0.08 8 0.07 Relative Error 7 0.06 6 0.05 5 4 0.04 3 0.03 2 0.02 1 0.01 0 −1 0 1 2 3 4 5 6 7 0 Relative Error 0 100 200 300 400 500 −3 x 10 Sequence Length Accurate when > 20 residues Distribution of relative errors of in length. estimated rates starting from 50 sets of random initial values. Q’ matrix estimated by Bayesian MCMC has small All Relative Error < 5%. relative error by Frobenius norm (<5%) to Q.

Recommend


More recommend