Improving protein secondary structure prediction based on short subsequences with local structure similarity Hsin-Nan Lin, Ting-Yi Sung, Shinn-Ying Ho, Wen-Lian Hsu Bioinformatics Program, TIGP (Taiwan International Graduate Program), Academia Sinica, Taiwan The author Hsin-Nan Lin wishes to acknowledge, with thanks, the Taiwan International Graduate Program (TIGP) of Academia Sinica for financial support towards attending this conference.
Outline � Introduction ◦ Protein secondary structure predictions ◦ Existing PSS methods � Methods ◦ Synonymous words ◦ Compilation of a synonymous dictionary ◦ Prediction model � Results ◦ Experiment results ◦ Two factors that affect prediction performance � Conclusions 2/22
Protein Secondary Structure Prediction � Protein secondary structure (PSS) elements ◦ The local conformation of amino acids ◦ 3 secondary structure states: helix (H), strand (E), coil (C). Ref: http://bioweb.wku.edu/courses/biol22000/3AAprotein/images/F03-08C.GIF � Protein secondary structure prediction ◦ Assign one of the states to each amino acid. ◦ Useful for protein 3D structure prediction, function prediction, and subcellular localization prediction, etc. 3/22
Existing PSS methods � Template based methods Ref: Rajkumar Bondugula, Dong Xu, 2006 � Sequence profile based methods Ref: http://bioinfo.se/kurser/swell/secstrpred.html 4/22
Outline � Introduction ◦ Protein Secondary Structure Predictions ◦ Existing PSS methods � Methods ◦ Synonymous words ◦ Construction of a Synonymous Dictionary ◦ Prediction Algorithm � Results ◦ Experiment results ◦ Two factors that affect prediction performance � Conclusions 5/22
A Dictionary based approach -- SymPred � Treating proteomic data as a language ◦ A protein structure is encoding by its amino acid sequence. ◦ protein sequence � text ◦ protein structure � meaning � Treating PSS prediction as a translation problem � protein sequence � secondary structure state sequence � A general approach for analyzing protein sequences ◦ It can be applied to PSL prediction, function prediction, remote homology detection, sequence alignment, etc. 6/22
Synonymous words in protein sequences � Protein language remains a mystery � Structure robustness ◦ Structures are more conserved than sequences ◦ Proteins of 40% ↑ sequence identity are highly similar in structure � A significant local pairwise alignment of two proteins implies two similar paragraphs. ◦ Define synonymous words in protein sequences � Definition ◦ A synonymous word is an n-gram of a protein sequence aligned with another n-gram in the other protein. 7/22
8/22 Synonymous words (cont.) EWQL � HHHH � DFDM
Compilation of Synonymous Dictionary A protein sequence � PSI-BLAST � sequence alignments � synonymous words 9/22
An example of synonymous word entry Flexibility: PSL, functions, 3D structure,.. 10/22
Properties of synonymous words � Protein dependency ◦ Synonymous words are generated from significant sequence alignments (Context-sensitive). � two similar protein words do not imply they are synonymous ◦ The material of generating synonymous words depends on the query protein sequence. � Sequence Identity Independency ◦ Protein A �� Protein B (SI = 50%) ◦ Protein B �� Protein C (SI = 40%) ◦ Protein A �� Protein C (SI = 20%) Similar Similar proteins of B proteins of A Similar proteins of C 11/22
Translation model Obtain the final structure through voting 12/22
Outline � Introduction ◦ Protein Secondary Structure Predictions ◦ Existing PSS methods � Methods ◦ Synonymous words ◦ Construction of a Synonymous Dictionary ◦ Prediction Algorithm � Results ◦ Experiment results ◦ Two factors that affect prediction performance � Conclusions 13/22
Datasets � DSSP Database ◦ A database of PSS assignments ◦ DsspNr-25 � A Non-redundant subset of DSSP � 8,297 protein chains � EVA benchmark datasets ◦ A platform analyzing PSS predictors ◦ EVA_Set1: 80 protein chains ◦ EVA_Set2: 212 protein chains 14/22
Translation Performance on DsspNr-25 DsspNr ‐ 25 Q3 Q3H Q3E Q3C SOV (8297 proteins) SymPred 81.0 84.3 71.6 77.7 76.0 +5.9% +7.3% PROSP 75.1 79.7 67.6 71.3 68.7 15/22
Two factors affect translation performance � Word length ◦ a trade-off between specificity and sensitivity � Long words: increase specificity, lose sensitivity � Short words: lose specificity, increase sensitivity ◦ exact matching vs. inexact matching � Exact matching: WGPV �� WGPV (exactly the same) � Inexact matching: WGPV �� WGPV, *GPV, W*PV, WG*V, WGP* (at most one mismatch character) 16/22
Two factors affect translation performance (cont.) � Template pool size SymPred has the potential to improve further when the number of proteins of known structures continue increasing 17/22
Performance Comparison on EVA_Set1 EVA_Set1 Q3 ERRsig SOV ERRsig (80 proteins) Q3 SOV ± 1.4 ± 1.9 SymPred 78.8 76.4 ± 1.2 ± 1.5 SAM ‐ T99sec 77.2 74.6 ± 1.4 ± 2.0 PSIPRED 76.8 75.4 ± 1.4 ± 1.9 PROFsec 75.5 74.9 ± 1.4 ± 1.9 PHDpsi 73.4 69.5 18/22
Performance Comparison on EVA_Set2 EVA_Set2 Q3 ERRsig SOV ERRsig (212 proteins) Q3 SOV ± 0.9 ± 1.2 SymPred 79.2 76.0 ± 0.8 ± 1.1 PSIPRED 77.8 75.4 ± 0.8 ± 1.1 PROFsec 76.7 74.8 ± 0.8 ± 1.2 PHDpsi 75.0 70.9 19/22
20/22 Confidence Level vs. Q3 PCC = 0.992
Outline � Introduction ◦ Protein Secondary Structure Predictions ◦ Existing PSS methods � Methods ◦ Synonymous words ◦ Construction of a Synonymous Dictionary ◦ Prediction Algorithm � Results ◦ Experiment results ◦ Two factors that affect prediction performance � Conclusions 21/22
Conclusions � Local similarities in protein sequences exhibit conserved structures. � With the increasing number of protein sequences of known structures, SymPred can further improve prediction accuracy. � The prediction result is traceable. � Our dictionary based approach is general for various protein related problems. � Synonymous words provide an alternative sequence analysis method. 22/22
Thank You ! Please visit our web server http://bio-cluster.iis.sinica.edu.tw/~bioapp/SymPred/
Recommend
More recommend