Supervised Convolutional GSN for Protein Secondary Structure - PowerPoint PPT Presentation

Supervised Convolutional GSN for Protein Secondary Structure Prediction Jian Zhou Olga Troyanskaya Princeton University

What’s In this talk.. • Problem: Predict protein secondary structure • Iterative prediction with multi-layer hierarchical representation – Supervised GSN – Convolutional architecture for GSN – A trick for improving convergence and performance • Performance evaluations

Protein secondary structure prediction Protein sequence 20 types of amino acids 3D structure MDLSALRVEEVQNVINAMQKILECP ICLELIKEPVSTKCDHIFCKFCMLKL LNQKKGPSQCPLCKNDITKRSLQE STRFSQLVEELLKIICAFQLDTGLEY ANSYNFAKKGK Predict Secondary structure 8 classes CCGGGSSHHHHHHHHHHHHHHTS CSSSCCCCSSCCBCTTSCCCCSH HHHHHHHSSSSSCCCTTTSCCCC TTTCBCCCSSSHHHHHHHHHHHH HHHHTCCCCCC Previous Approaches : neural network from 1988 (Qian & Sejnowski); bidirectioal recurrent neural network (Baldi et al., 1999); conditional neural fields (Peng et al., Image credit: 2009); many more… Wikimedia common

Protein Sequence -> Secondary Structure Protein sequence 20 types of amino acids Evolutionary neighborhood Predict Secondary structure 8 classes label sequence 3D structure

Motivation • Challenge: Prediction with both local and long-range dependencies Plan: • Multi-layer hierarchical representation   Both ‘upward’ and ‘downward’ connections Supervised GSN formulation 

Model Generative Stochastic Network • Bengio, Y., Thibodeau-Laufer, É., Alain, G., and Yosinski, J. Deep Generative Stochastic Networks Trainable by Backprop Learning the transition operators of a Markov chain whose stationary distribution estimates the data distribution 𝑄 (𝑌) . 𝐼 0 𝐼 1 𝐼 2 𝐼 3 𝐼 𝑢+1 ~ 𝑄 𝜄 1 𝐼 𝐼 𝑢 ,𝑌 𝑢 𝑌 𝑢+1 ~ 𝑄 𝜄 2 𝑌 𝐼 𝑢+1 ) 𝑌 2 𝑌 0 𝑌 1 Learning 𝑄 𝑌 𝐼) can be much easier than 𝑄 (𝑌) by design. Trainable using back-propagation

Model GSN 𝐼 1 𝐼 0 𝐼 2 𝐼 3 𝐼 𝑢+1 ~ 𝑄 𝜄 1 𝐼 𝐼 𝑢 ,𝑌 𝑢 P(X) 𝑌 𝑢+1 ~ 𝑄 𝜄 2 𝑌 𝐼 𝑢+1 ) 𝑌 2 𝑌 1 𝑌 0 𝐼 1 𝐼 0 𝐼 2 𝐼 3 Supervised GSN 𝐼 𝑢+1 ~ 𝑄 𝜄 1 𝐼 𝐼 𝑢 ,𝑍 𝑢 ,𝑌 0 𝑍 𝑍 𝑍 P(Y|X) 2 1 𝑍 𝑢+1 ~ 𝑄 𝜄 2 𝑍 𝐼 𝑢+1 ) 0 𝑌 0 Learning 𝑄 𝑍 𝐼) can be much easier than 𝑄 𝑍 𝑌 , utilizing previous state of the chain

Model True 𝑄(𝑍|𝑌 0 ) 𝑄 𝜄 (𝑍|𝐼 1 ) 𝑄 𝜄 (𝑍|𝐼 2 ) 𝑍 𝑍 0 1 Supervised 𝐼 1 𝐼 2 𝐼 3 GSN 𝐼 𝑢+1 ~ 𝑄 𝜄 1 𝐼 𝐼 𝑢 ,𝑍 𝑢 ,𝑌 0 P(Y|X) 𝑍 𝑍 𝑍 2 𝑍 𝑢+1 ~ 𝑄 𝜄 2 𝑍 𝐼 𝑢+1 ) 0 1 𝑌 0 Maximize log-likelihoods

Model Architecture for protein secondary structure prediction Multi-scale representation – multi-layer convolutional architecture Local information sensitive – output unit at bottom layer Conv tanh 𝐼 1 𝐼 2 𝐼 3 𝐼 1 Mean Pool 𝐼 1 pooling W2 W2 W2’ W2’ W2 𝐼 0 𝐼 0 Conv … tanh W1 W1’ W1 W1’ W1 W1’ 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 0 1 2 3 𝑌 𝑌 𝑌 0

Training Experiments on initialization of chain during training 𝐼 2 𝐼 3 𝐼 1 𝐼 1 W2 W2’ W2 W2’ W2 𝐼 0 W1 W1’ W1 W1’ W1 W1’ 𝑍 𝑍 𝑍 𝑍 𝑍 0 1 2 3 𝑌 𝑌 0 Initialize at a specified test initialization value for a subset of training batches: 0% - Optimal performance at 50% test initialization 20% Accuracy 50% 80% Accuracy 100% 𝑍 0 𝑢𝑠𝑣𝑓 # of iterations 𝑍 # of iterations 0 𝑢𝑓𝑡𝑢

Performance Cull PDB dataset (6133 proteins with <30% identity between any protein pairs); available at www.princeton.edu/~jzthree/datasets single protein prediction example Performance through averaging iterative predictions: CullPDB-30 Overall Accuracy 𝑀𝑏𝑐𝑓𝑚 test set (8-class) 𝑍 32 1 layer 0.714 ± 0.006 𝑍 16 2 layers 0.720 ± 0.006 𝑍 8 3 layers 0.721 ± 0.006 𝑍 4 𝑍 2 𝑍 1 CB513 dataset Overall Accuracy (8-class) RaptorSS8/CNF 0.649 ± 0.003 Our method 0.664 ± 0.005

Summary • We developed supervised convolutional GSN model for protein secondary structure prediction. • Supervised GSN – Stochastic iterative prediction through Markov chain – Initialization trick improve both performance and convergence rate empirically • Convolutional architecture for Supervised GSN – Combine high level representation and local prediction – Improved over previous best performance

• Filters: Layer1, 𝑌, 𝑍 ↔ 𝐼 0 𝑋 𝑌→𝐼 0 (Amino acids) Channel Position 𝑋 𝑍→𝐼 0 𝑋 𝐼 0 →𝑍 (Secondary (Secondary structure) structure)

Supervised Convolutional GSN for Protein Secondary Structure - PowerPoint PPT Presentation

Supervised Convolutional GSN for Protein Secondary Structure Prediction Jian Zhou Olga Troyanskaya Princeton University Whats In this talk.. Problem: Predict protein secondary structure Iterative prediction with multi-layer

A Decade of Baby Steps 2000 2005 Green Teams Involvement in the early days of GSN 2006 GSN

Secondary Framing Secondary Framing Secondary Framing Secondary Framing 1 1 Secondary Framing

GSN Investment Committee Presentation March 4, 2011 DRAFT Executive Summary GSN operates a

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Protein-Protein interactions Reducing the complexity Why are protein-protein interactions

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Convolutional Neural Networks ---- Off the shelf top notch performances Convolutional Neural

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Animal protein production in a Animal protein production in a Animal protein production in a

DNA RNA Protein synthesis AMINO ACIDS PROTEIN Protein degradation FUNCTION Some properties

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

Dynamics of Protein-Protein Interactions: A Probabilistic Model Toward Protein Function Amir

Exercise and Secondary Exercise and Secondary Exercise and Secondary Exercise and Secondary

Anytime Reliability of Systematic LDPC Motivation Convolutional Codes LDPC Convolutional Codes

Convolutional Autoencoder (CAE) Prof. Seungchul Lee Industrial AI Lab. Convolutional Autoencoder

Introduction CSCE 970 CSCE 970 Lecture 4: Lecture 4: Convolutional Convolutional Neural

Specificity of Protein-DNA recognition of a long DNA binding motif Francisco Melo Ledermann EMBO

FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES Robert Hirt Department of Zoology, The Natural

NMR Spectroscopy CH.EMBnet course 28.9.2004 Biozentrum, Basel D. Hussinger Overview 1. Basic

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus EPFL AACBB

Data Mining in Bioinformatics Day 5: Classification in Bioinformatics Karsten Borgwardt February

TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION Jun Tan Donald Adjeroh Biological background

IGSAnnota*onEngineandManatee MichelleGwinnGiglio

10: Biological Applications for HMMs Machine Learning and Real-world Data (MLRD) Ann Copestake