contents
play

Contents Introduction 1 Methods and Materials 2 3 Results and - PowerPoint PPT Presentation

Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy Shixiang Wan From Tianjin University, China Email: shixiangwan@tju.edu.cn 1 Contents Introduction 1 Methods and Materials 2 3 Results and


  1. Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy Shixiang Wan From Tianjin University, China Email: shixiangwan@tju.edu.cn 1

  2. Contents Introduction 1 Methods and Materials 2 3 Results and Discussion Conclusion 4 2

  3. Introduction 3

  4. TATA-binding Protein (TBP) l What are they? Ø A general transcription factor that binds specifically to a DNA sequence called the TATA box Ø Play an important role in initiation of transcription Ø Especially important in DNA melting (double strand separation) l Importance of TBP Ø Promising studies have shown that TBP is involved in the molecular mechanism of neurodegenerative diseases. There are 37 known TBP pathogenic mutations. Disease of these mutations includes epilepsy, parkinson's syndrome, personality disorder, developmental malformations and so on. Therefore, TBP has a key role in precision medical and genetic testing. 4

  5. Increased interest in TBP Reference: https://www.ncbi.nlm.nih.gov/pubmed/ 5

  6. UniProt - high quality protein database • Website http://www.uniprot.org/ • Screenshot 6

  7. Methods for TBP prediction Traditional experimental methods Intrinsic limitations Time-consuming Expensive Urgent demand to realize fast and correct prediction Computational prediction methods Output Input Large-scale sequence pool Computers Predictions 7

  8. Machine learning based Methods • Framework of machine learning based prediction methods • Factors of influencing the performance Feature representation methods, classification algorithms, and datasets for model building 8

  9. Major problem and challenge • Imbalance dataset • Redundancy (or high similarity) • Difficult for searching the best dimension fast • Low similarity between positive-class and negative-class samples 9

  10. Methods and Materials 10

  11. Framework of Pretata Input TBP sequences CD-HIT (reduce redundancy) Step 1. Data Preparation Negative samples collection (for imbalance data) Three improvements Step 2. 188D 473D 611D Feature Representation Optimal Solution Step 3. Random LibD3C LibSVM IBK Bagging Classifier Forest Prediction Prediction Results TBP or not ? 11

  12. Dataset construction l Positive dataset construction (true TBP) Ø Increase the number of positive samples (about 559 were downloaded) Ø Reduce the redundancy of the positive samples (sequence similarity < 90%) using CD-HIT l Negative dataset construction (non-true TBP) Ø Raw negative dataset (about 8,465 sequences) Ø Purify negative dataset 8465 negative sequences Replenish negative-class Extract 559 negative sequences Take out LibSVM misclassification 559 positive sequences 12

  13. Novel features (611D) l 188D Ø Features based on composition and physicochemical properties of amino acids Ø 20D: the proportions of the 20 kinds of amino acids in the sequence Ø 21 × 8D: 21 kinds of statistical properties time 8 physicochemical properties l 473D Ø Features from secondary structure l 188D + 473D = 611D Ø The composition, physicochemical and secondary structure features are combined into 611D high dimension feature vectors. 13

  14. Dimensionality reduction strategy initial dimension { secondary step dimension (1,1) { dimension (1,2) dimension (1) { primary step { dimension (1, k ) dimension (2) { dimension (3) dimension ( k-1 ) { secondary step dimension ( k ,1) primary step { dimension ( k ,2) the best dimension ( k ) { { accuracy dimension ( k , k ) tolerable dimension accuracy l Optimal dimension searching method reduce dimension of features to find the best results, including global multiple search and local linear search. 14

  15. Performance evaluation l Four commonly used metrics Ø Sensitivity (SE), Specificity (SP), Accuracy (ACC), and Mathew’s Correlation Coefficient (MCC) l Formulation for the four metrics Two metrics for comprehensive evaluation of a binary predictor 15

  16. Results and Discussion 16

  17. Classifier based on autocorrelation comparison results Accuracy 92% 90.46% 90% 87.66% 88% 86.29% 86% 81.84% 84% 82.80% 79.74% 82.71% 80.29% 78.44% 82% 77.96% 79.84% 79.57% 76.92% 79.25% 80% 76.97% 78% 76% 74% 72% 70% 188D 473D 611D LibD3C LibSVM IBK RandomForest Bagging • 611D has far better performance than 188D and 473D; • LibSVM is better than any else on each extraction method of them. 17

  18. Prediction methods comparison results ACC 92.92% SN 95.50% SP 87.30% 100% 90% Best 80% 70% 60% 50% 40% 30% 20% 10% 0% BLASTP PSI-BLASTP 611D Pretata ACC SN SP Pretata V.S. A group of traditional prediction methods 18

  19. Pretata searching process 100% Accuracy 95% 90% 85% 80% 75% ACC 92.92% 70% SN 95.50% 65% SP 87.30% 60% 10 50 90 130 170 210 250 290 330 370 410 450 490 530 570 610 650 Dimension Accuracy 99% 96% 93% 90% 87% Best 84% 81% 78% 75% 230 240 250 260 270 280 290 300 310 320 330 SN SP ACC Experiment Best dimension = 324D 19

  20. Conclusion 20

  21. Conclusions • Propose a highly represented method for imbalance dataset (Negative-class Purification) • Propose 611D high dimension feature vectors, including composition, physicochemical and secondary structure features (611 Dimension Feature Model) • Propose a novel and promising TBP prediction method - Pretata (Pretata Learning model) • Available webserver - Pretata Server (Pretata web server) website: http://server.malab.cn/preTata/ 21

  22. Conclusions 22

  23. Conclusions 23

  24. 24

Recommend


More recommend