Analysis of the Signal Peptide dataset November 28, 2019 1 Signal - PowerPoint PPT Presentation

Analysis of the Signal Peptide dataset November 28, 2019 1

Signal Peptide - A short peptide (typically 15-30 residues long), destined towards the secretory pathway - Cleaved during translocation across membrane existing in all 3 kingdoms of life 2

Our dataset ● FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. 3

Our dataset ● The FASTA file contains for each protein (in order): ● Header (e.g. ">Q8TF40|EUKARYA|NO_SP|0") ● Protein sequence (first 70 residues only) ● Residue annotation 4

Our dataset The header contains information about: ● The protein ID (e.g. "Q8TF40") ● The kingdom of life the organism (that contains the protein) belongs to (e.g. "EUKARYA") ● The type of signal peptide the protein contains (e.g. "NO_SP") ● The data set split the protein belongs to (e.g. "0") 5

Our dataset ● 20,758 proteins ● 4 types of signal peptides ● 6 residue types ● 20% sequence similarity 6

Our dataset ● 5 splits for cross-validation with similar residue distribution ● Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. 7

Class distributions Strong dataset imbalance: most proteins ● don’t contain Signal Peptides SP Signal Peptide LIPO Lipoprotein Signal Peptide TAT Tat Signal Peptide NO_SP No Signal Peptide 8

Residue annotations 91.25% S Sec/SPI signal peptide T Tat/SPI signal peptide L Sec/SPII signal peptide I Cytoplasm M Transmembrane 8.75% O Extracellular 9

Prediction Baseline 10

Dealing with class imbalance Undersampling (majority classes) ● Oversampling (minority classes) ● Class weights ● SMOTE (synthetic samples) ● 11

ELMo Embeddings ELMo Embeddings: ● Embedded Language Models Used in Natural Language Processing ● In our case, embeddings represent the context of each residue ● Either 64 dim or 1024 dim per residue ● 12

Learning from high-dimensional data Reduce the dimensions ● t-SNE ● ● Techniques for dimensionality reduction and clustering that preserve the proportionality of the objects -> Visualization of high dimensionality datasets 13

PCA vs t-SNE 14

Results of t-SNE for the 64 dim embeddings 15

Results of t-SNE for the 64 dim embeddings for L signal peptides 16

Results of t-SNE for the 64 dim embeddings for S signal peptides 17

Results of t-SNE for the 64 dim embeddings for T signal peptides 18

Notes ● Results are based on the perplexity = 30 ● Not a lot of information ● 1024 dimensional embeddings can be more helpful 19

References ● https://zhanglab.ccmb.med.umich.edu/FASTA/ ● https://machinelearningmastery.com/k-fold-cross-validation/ ● https://towardsdatascience.com/visualising-high-dimensional -datasets-using-pca-and-t-sne-in-python-8ef87e7915b 20

Thank you very much! 21

Analysis of the Signal Peptide dataset November 28, 2019 1 Signal - PowerPoint PPT Presentation

Analysis of the Signal Peptide dataset November 28, 2019 1 Signal Peptide - A short peptide (typically 15-30 residues long), destined towards the secretory pathway - Cleaved during translocation across membrane existing in all 3 kingdoms of

L14 Mass Spec Quantitation MS applications Microarray analysis CSE182 LC-MS Maps Peptide 2 I

Patterns in nature Patterns associated with function Not exactly the same Signal Peptide

Support vector machine prediction of signal peptide cleavage site using a new class of kernels

BNP signal peptide protects the heart from ischemia-reperfusion injury Chris Pemberton, Maithri

An Analysis of Amazon Reviews Joao Carreira Outline Dataset and Methodology

GR-INSPECTOR A SIGNAL ANALYSIS TOOLBOX FOR GNU RADIO MOTIVATION OF AUTOMATED SIGNAL ANALYSIS

Proteomics Informatics Analysis of mass spectra: signal processing, peak finding, and isotope

Peptide modeling in isolation and in interaction : steps towards rational peptide design Pierre

ELG3 1 2 5 Signal and System Analysis Lab2: Signal Manipulation and Graphics TA: Jungang Liu

Data The dataset used for this analysis provides information on establishments, not firms

The Problem I K G J E C H F A D B = dataset In dataset creation, if each step is

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Algorithms in Bioinformatics: A f Practical Introduction Practical Introduction Peptide

The United Nations Voting Dataset Exploratory Data Analysis: Case Study UN Voting Dataset Roll

The Prediction Error Signal 1 Prediction Error Signal Behavior 2 LP Speech Analysis file:s5,

1 | Core SMA Dataset Review 2020 Core SMA Dataset for TREAT-NMD affiliated Registries First

Naval Center for Cost Analysis Software Resource Data Report (SRDR) Analysis August 2013 Dataset

Art Analysis Dataset Characterization MIEIC IC 2020/2 /2021 Descr cri o, , Arma

Text Sentiment Analysis with rNN on the IMDB Dataset PyTorch and TensorFlow Comparative

Proto60 analysis and FPGA based signal analysis E. Guliyev, M. Kavatsyuk, P.J.J. Lemmens, H.

http://falconn-lib.org Dataset: n points in R d , r > 0 Dataset: n points in R d , r

Digital Signal Analysis Digital Signal Analysis X-Series Spectrum Analyzers Seminar 2010 Page 1

WAVcam WAVcam Stands for Innovative Signal Analysis, Inc Wide Area View camera The

CCECE 2003 Signal Classification through Multifractal Analysis and Complex Domain Neural

Analysis of the Signal Peptide dataset November 28, 2019 1 Signal - PowerPoint PPT Presentation

Analysis of the Signal Peptide dataset November 28, 2019 1 Signal Peptide - A short peptide (typically 15-30 residues long), destined towards the secretory pathway - Cleaved during translocation across membrane existing in all 3 kingdoms of

L14 Mass Spec Quantitation MS applications Microarray analysis CSE182 LC-MS Maps Peptide 2 I

Patterns in nature Patterns associated with function Not exactly the same Signal Peptide

Support vector machine prediction of signal peptide cleavage site using a new class of kernels

BNP signal peptide protects the heart from ischemia-reperfusion injury Chris Pemberton, Maithri

An Analysis of Amazon Reviews Joao Carreira Outline Dataset and Methodology

GR-INSPECTOR A SIGNAL ANALYSIS TOOLBOX FOR GNU RADIO MOTIVATION OF AUTOMATED SIGNAL ANALYSIS

Proteomics Informatics Analysis of mass spectra: signal processing, peak finding, and isotope

Peptide modeling in isolation and in interaction : steps towards rational peptide design Pierre

ELG3 1 2 5 Signal and System Analysis Lab2: Signal Manipulation and Graphics TA: Jungang Liu

Data The dataset used for this analysis provides information on establishments, not firms

The Problem I K G J E C H F A D B = dataset In dataset creation, if each step is

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Algorithms in Bioinformatics: A f Practical Introduction Practical Introduction Peptide

The United Nations Voting Dataset Exploratory Data Analysis: Case Study UN Voting Dataset Roll

The Prediction Error Signal 1 Prediction Error Signal Behavior 2 LP Speech Analysis file:s5,

1 | Core SMA Dataset Review 2020 Core SMA Dataset for TREAT-NMD affiliated Registries First

Naval Center for Cost Analysis Software Resource Data Report (SRDR) Analysis August 2013 Dataset

Art Analysis Dataset Characterization MIEIC IC 2020/2 /2021 Descr cri o, , Arma

Text Sentiment Analysis with rNN on the IMDB Dataset PyTorch and TensorFlow Comparative

Proto60 analysis and FPGA based signal analysis E. Guliyev, M. Kavatsyuk, P.J.J. Lemmens, H.

http://falconn-lib.org Dataset: n points in R d , r &gt; 0 Dataset: n points in R d , r

Digital Signal Analysis Digital Signal Analysis X-Series Spectrum Analyzers Seminar 2010 Page 1

WAVcam WAVcam Stands for Innovative Signal Analysis, Inc Wide Area View camera The

CCECE 2003 Signal Classification through Multifractal Analysis and Complex Domain Neural

http://falconn-lib.org Dataset: n points in R d , r > 0 Dataset: n points in R d , r