Supervised Convolutional GSN for Protein Secondary Structure Prediction Jian Zhou Olga Troyanskaya Princeton University
What’s In this talk.. • Problem: Predict protein secondary structure • Iterative prediction with multi-layer hierarchical representation – Supervised GSN – Convolutional architecture for GSN – A trick for improving convergence and performance • Performance evaluations
Protein secondary structure prediction Protein sequence 20 types of amino acids 3D structure MDLSALRVEEVQNVINAMQKILECP ICLELIKEPVSTKCDHIFCKFCMLKL LNQKKGPSQCPLCKNDITKRSLQE STRFSQLVEELLKIICAFQLDTGLEY ANSYNFAKKGK Predict Secondary structure 8 classes CCGGGSSHHHHHHHHHHHHHHTS CSSSCCCCSSCCBCTTSCCCCSH HHHHHHHSSSSSCCCTTTSCCCC TTTCBCCCSSSHHHHHHHHHHHH HHHHTCCCCCC Previous Approaches : neural network from 1988 (Qian & Sejnowski); bidirectioal recurrent neural network (Baldi et al., 1999); conditional neural fields (Peng et al., Image credit: 2009); many more… Wikimedia common
Protein Sequence -> Secondary Structure Protein sequence 20 types of amino acids Evolutionary neighborhood Predict Secondary structure 8 classes label sequence 3D structure
Motivation • Challenge: Prediction with both local and long-range dependencies Plan: • Multi-layer hierarchical representation Both ‘upward’ and ‘downward’ connections Supervised GSN formulation
Model Generative Stochastic Network • Bengio, Y., Thibodeau-Laufer, É., Alain, G., and Yosinski, J. Deep Generative Stochastic Networks Trainable by Backprop Learning the transition operators of a Markov chain whose stationary distribution estimates the data distribution 𝑄 (𝑌) . 𝐼 0 𝐼 1 𝐼 2 𝐼 3 𝐼 𝑢+1 ~ 𝑄 𝜄 1 𝐼 𝐼 𝑢 ,𝑌 𝑢 𝑌 𝑢+1 ~ 𝑄 𝜄 2 𝑌 𝐼 𝑢+1 ) 𝑌 2 𝑌 0 𝑌 1 Learning 𝑄 𝑌 𝐼) can be much easier than 𝑄 (𝑌) by design. Trainable using back-propagation
Model GSN 𝐼 1 𝐼 0 𝐼 2 𝐼 3 𝐼 𝑢+1 ~ 𝑄 𝜄 1 𝐼 𝐼 𝑢 ,𝑌 𝑢 P(X) 𝑌 𝑢+1 ~ 𝑄 𝜄 2 𝑌 𝐼 𝑢+1 ) 𝑌 2 𝑌 1 𝑌 0 𝐼 1 𝐼 0 𝐼 2 𝐼 3 Supervised GSN 𝐼 𝑢+1 ~ 𝑄 𝜄 1 𝐼 𝐼 𝑢 ,𝑍 𝑢 ,𝑌 0 𝑍 𝑍 𝑍 P(Y|X) 2 1 𝑍 𝑢+1 ~ 𝑄 𝜄 2 𝑍 𝐼 𝑢+1 ) 0 𝑌 0 Learning 𝑄 𝑍 𝐼) can be much easier than 𝑄 𝑍 𝑌 , utilizing previous state of the chain
Model True 𝑄(𝑍|𝑌 0 ) 𝑄 𝜄 (𝑍|𝐼 1 ) 𝑄 𝜄 (𝑍|𝐼 2 ) 𝑍 𝑍 0 1 Supervised 𝐼 1 𝐼 2 𝐼 3 GSN 𝐼 𝑢+1 ~ 𝑄 𝜄 1 𝐼 𝐼 𝑢 ,𝑍 𝑢 ,𝑌 0 P(Y|X) 𝑍 𝑍 𝑍 2 𝑍 𝑢+1 ~ 𝑄 𝜄 2 𝑍 𝐼 𝑢+1 ) 0 1 𝑌 0 Maximize log-likelihoods
Model Architecture for protein secondary structure prediction Multi-scale representation – multi-layer convolutional architecture Local information sensitive – output unit at bottom layer Conv tanh 𝐼 1 𝐼 2 𝐼 3 𝐼 1 Mean Pool 𝐼 1 pooling W2 W2 W2’ W2’ W2 𝐼 0 𝐼 0 Conv … tanh W1 W1’ W1 W1’ W1 W1’ 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 0 1 2 3 𝑌 𝑌 𝑌 0
Training Experiments on initialization of chain during training 𝐼 2 𝐼 3 𝐼 1 𝐼 1 W2 W2’ W2 W2’ W2 𝐼 0 W1 W1’ W1 W1’ W1 W1’ 𝑍 𝑍 𝑍 𝑍 𝑍 0 1 2 3 𝑌 𝑌 0 Initialize at a specified test initialization value for a subset of training batches: 0% - Optimal performance at 50% test initialization 20% Accuracy 50% 80% Accuracy 100% 𝑍 0 𝑢𝑠𝑣𝑓 # of iterations 𝑍 # of iterations 0 𝑢𝑓𝑡𝑢
Performance Cull PDB dataset (6133 proteins with <30% identity between any protein pairs); available at www.princeton.edu/~jzthree/datasets single protein prediction example Performance through averaging iterative predictions: CullPDB-30 Overall Accuracy 𝑀𝑏𝑐𝑓𝑚 test set (8-class) 𝑍 32 1 layer 0.714 ± 0.006 𝑍 16 2 layers 0.720 ± 0.006 𝑍 8 3 layers 0.721 ± 0.006 𝑍 4 𝑍 2 𝑍 1 CB513 dataset Overall Accuracy (8-class) RaptorSS8/CNF 0.649 ± 0.003 Our method 0.664 ± 0.005
Summary • We developed supervised convolutional GSN model for protein secondary structure prediction. • Supervised GSN – Stochastic iterative prediction through Markov chain – Initialization trick improve both performance and convergence rate empirically • Convolutional architecture for Supervised GSN – Combine high level representation and local prediction – Improved over previous best performance
• Filters: Layer1, 𝑌, 𝑍 ↔ 𝐼 0 𝑋 𝑌→𝐼 0 (Amino acids) Channel Position 𝑋 𝑍→𝐼 0 𝑋 𝐼 0 →𝑍 (Secondary (Secondary structure) structure)
Recommend
More recommend