text encoding for protein structure representation
play

TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION Jun Tan Donald - PowerPoint PPT Presentation

TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION Jun Tan Donald Adjeroh Biological background The Problem and our goal Related research General idea Implementations Summary BIOLOGICAL BACKGROUND [1] PROTEIN PROTEIN One of the four of


  1. TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION Jun Tan Donald Adjeroh

  2. Biological background The Problem and our goal Related research General idea Implementations Summary

  3. BIOLOGICAL BACKGROUND [1]

  4. PROTEIN

  5. PROTEIN One of the four of life's basic building blocks DNA -> RNA -> Protein Peptide bound: the link between two amino acids Polypeptide: chain of amino acide Once the chain of amino acide is in its final shape, it is called protein Twenty types of amino acids Three group: COOH, NH2 and R

  6. PROTEIN Protein can have very complex shapes, and the final form is essential to its intended function Primary structure: the chain of amino acides reaches its final form Secondary structure describes common folding patterns Tertiary structure describes the overall three- dimensional structure of a single folded amino acid chain Quaternary structure for protein with multiple chains describes all subunits consist of the protein

  7. PROTEIN Primary structure determines all other structures However...

  8. PROTEIN The shape of the 3D protein structure has a direct impact on its function. Secondary structure is much more conserved than sequence (primary structure), over evolution. [2]

  9. THE PROBLEM AND OUR GOAL

  10. THE ORIGINAL FORMAT Saved as x, y, z coordinates for each atom along the chain Complicated operations for even simple tasks

  11. OUR GOAL Simplify the representation keep important information such that it retains the biological meanings Demo the performance: search similar structures in a protein domain database 80 query domains, database size: 23500

  12. STRUCTURAL CLASSIFICATION OF PROTEIN (SCOP) DATABASE Protein Domain: part of a protein that can evolve, function and exist independently of the rest of the protein chain Manually classified Hierarchical structure: Class, Fold, Super-family, family

  13. RELATED RESEARCH

  14. DIRECTLY ALIGN 3D SHAPES High accuracy Involve complex operation. Time consuming Example: DALI [3, 4](distance alignment matrix method) algorithm

  15. CONVERT 3D SHAPE INTO 2D TEXTURES

  16. CONVERT 3D SHAPE INTO STRING Not as accurate as 3D method but close Much faster Example: Ramachandran codes [5]

  17. FRAGMENT APPROACH [6] Library of fragments/short structure motifs (hand picked?) Represent protein structure as the frequency of the fragments Bag of words method

  18. GENERAL IDEA Decompose a shape into a sequence of segments Represent the segments with basic primitives: segment type, segment length and transition angle between segments Encoded into shape string Answer biological question by applying string/text algorithms on the shape strings N-Gram, TF/IDF and cosine similarity are used when compare similarity between shape strings

  19. IMPLEMENTATIONS

  20. DIHEDRAL ANGLES (RAMACHANDRAN ANGLES) One of the most important local parameters that control protein folding Three angles: 1. φ involves atoms C'-N-C α -C' 2. ψ involves atoms N-C α -C'-N 3. ω involves atoms C α -C'-N-C α (usually 0 or 180 due to peptide bond)

  21. RAMACHANDRAN PLOT

  22. CLUSTERING DIHEDRAL ANGLES

  23. PRECISION VS RECALL 4 Clusters 6 Clusters

  24. FOLD VS RECALL 4 Clusters 6 Clusters

  25. CLASS VS RECALL 4 Clusters 6 Clusters

  26. TRIPLES Dihedral angles involve three consecutive residues only. Pick three residues/points that can best represent a segment of a given length The three residues is selected as following: 1. Select the first and last residue A and B 2. Select residue C such that the distance d from C to straight line segment AB is maximized 3. Using three distances to represent the triple: d, |AB|, max(|AC|, |BC|) Another predefined parameter determines how much two adjacent fragments overlap.

  27. ILLUSTRATION

  28. DISTRIBUTION (SEGMENT SIZE = 5)

  29. DISTRIBUTION (SEGMENT SIZE = 10)

  30. PRECISION VS RECALL Triples, 6 Clusters Dihedral angles, 6 Clusters

  31. FOLD VS RECALL Triples, 6 Clusters Dihedral angles, 6 Clusters

  32. CLASS VS RECALL Triples, 6 Clusters Dihedral angles, 6 Clusters

  33. SUMMARY

  34. SIGNIFICANCE Avoid alignment. Runs fast: O(n) complexity. Automatically learn important patterns. No predefined fragment libraries are needed

  35. WEAKNESS AND FUTURE WORK Performance is not as good as alignment based methods Possible improvement one: Using multiple strings

  36. QUESTIONS?

  37. REFERENCE 1. An Introduction to Proteins 2. Whitford D, Proteins: Structure and Function, John Whiley & Sons, West Sussex, 2005. 3. Holm, L. and Sander, C, “Touring protein fold space with Dali/FSSP”, Nucleic Acids Res., 26, 316-319, 1998. 4. Holm L, Kaariainen S, Rosenstrom P, Schenkel A., “Searching protein structure databases with DaliLite v.3”, Bioinformatics 24, 2780-2781, 2008.

  38. REFERENCE (COND.) 5. Lo WC, Huang PJ, Chang CH and Lyu PC, “Protein structural sim- ilarity search by Ramachandran codes”, BMC Bioinformatics, 8, 307, 2007. 6. Budowski-Tal, Inbal, Yuval Nov, and Rachel Kolodny. “FragBag, an accurate representation of protein structure, retrieves structural neigh- bors from the entire PDB quickly and accurately.” Proceedings of the National Academy of Sciences 107.8 (2010): 3481-3486.

Recommend


More recommend