PREDICTION OF HETERODIMERIC PROTEIN COMPLEXES FROM PROTEIN-PROTEIN INTERACTION NETWORKS USING DEEP LEARNING Peiying (Colleen) Ruan, PhD, Deep Learning Solution Architect 3/26/2018
Background Method OUTLINE Computational Experiments and Results Conclusions 2
BACKGROUND 3
BACKGROUND Our works DNA Forming Transcription Translation Protein mRNA complexes mRNA Performing functions Biological System ? Disease Cell Human Keeping healthy 4 Protein-protein interactions
BACKGROUND What is heterodimer and why predict it? Heterodimers Occupy 40% !!! 5
BACKGROUND Structure Domain Composition D 1 D 3 P 1 P 1 P 2 P 1 Interaction D 2 D 2 D 2 Pi : protein Di : domain 6
BACKGROUND Structure Domain Composition Weighted PPI Network D 1 D 3 w 12 P 1 P 1 P 2 P 1 Interaction P 2 P 1 D 2 D 2 D 2 Pi : protein Di : domain 7
METHOD 8
OVERVIEW OF THE PROBLEM Input : weighted PPI network Heterodimer? P i P j Input data 9
MULTIPLE INFORMATION + MULTIPLE DL MODELS ▪ Input data involving ▪ Deep neural network biological information models including Protein-protein interaction Convolutional neural (PPI) network (CNN) Domain Recurrent neural network (RNN) Phylogenetic profile CNN + RNN 10
PROTEIN-PROTEIN INTERACTION (PPI) Table 1. Feature space mapping from two interacting proteins P i , P j and neighbors. … The weights of interactions between w ij the focused proteins. P i P j D m D n D r The maximum weights of interactions between either of focused proteins and a w jk neighboring protein. w ik The minimum weights of interactions …… between either of focused proteins The maximum smaller weights of interactions and a P k with neighboring proteins. neighboring protein. Figure 1. Example of a subgraph with an The maximum differences of weights among interacting protein pair and their the neighboring weights. neighboring proteins. 11
DOMAIN Sample Domain pair of protein complex C j : P 2 P 1 C i ( D 3, D 3 ), ( D 3, D 3 ), ( D 3, D 10 ), ( D 8, D 3 ) , ( D 8, D 3 ) , D 3 D 8 D 3 D 10 D 3 D 9 ( D 8, D 10 ) , ( D 9, D 3 ) , ( D 9, D 3 ) , ( D 9, D 10 ) The whole domain pair sets for all complexes in the dataset {( D 1, D 1 ), ( D 1, D 2 ),…, ( D 3, D 3 ),…, ( D 9, D 10 ),…, ( D n, D n )} 5295 #domain pair is 5295 [ C j ]=[ 0 0 ,…, 2 ,…, 1 ,…, 0 ] 12
PHYLOGENETIC PROFILE SC BS EC P 3 P 2 P 1 P 4 P 1 P 1 1 0 1 P 2 0 1 1 S.Cerevisiae (SC) P 1 P 2 B.Subtilis (BS) P 3 0 1 0 P 4 1 1 0 E.Coli (EC) The whole organism for all complexes in the dataset { SC , BS , EC , …} 2717 Q ( a, b )=min( a , b ) #organism is 2717 [ C j = Q ( P 1 , P 2 )]=[ 0 0 1 , … ] 13
COMPUTATIONAL EXPERIMENTS 14
▪ Databases CYC2008: A manually curated comprehensive catalogue of yeast protein complexes, including 172(42%) heterodimers. WI-PHI: A PPI database with weights containing 49607 interacting protein pairs except self-interactions. ▪ Positives and Negatives C 2 Positives: (P 1 ,P 2 ) P 2 P 1 Negatives: ( P 1 , P 3 ), ( P 2 , P 4 ), ( P 3 , P 4 ) and ( P 1 , P 4 ) C 1 #Sample: 5497 P 4 P 3 15
INPUT DATA e.x.Domain property The whole domain pair set for all complexes in the dataset {( D 1, D 1 ), ( D 1, D 2 ),…, ( D 3, D 3 ),…, ( D 9, D 10 ),…, ( D n, D n )} Input data: Label: 0 [ C 1 ]=[ 0 0 ,…, 2 ,…, 1 ,…, 0 ] 1 [ C 2 ]=[ 0 1 ,…, 0 ,…, 0 ,…, 1 ] … … [ C 5497 ]=[ 0 0 ,…, 2 ,…, 1 ,…, 0 ] 0 ] 16
INPUT DATA e.x. Domain + Phylogenetic profile The whole (domain pair set + organism) for all complexes in the dataset {( D 1, D 1 ), ( D 1, D 2 ),…, ( D n, D n ), SC , BS , EC , …} 5295+2717 Label: Input data: 0 [ C 1 ]=[ 0 0 ,…, 0 , 0, 0, 1, … ] 1 [ C 2 ]=[ 0 1 ,…, 1 , 1, 0, 0, … ] … … [ C 5497 ]=[ 0 0 ,…, 0 , 0, 1, 1, … ] 0 ] 17
MODELS Input data Input data Input data Convolution Neural Network Recurrent Neural Network Convolution Neural Network Output Output Recurrent Neural Network Output D. Quang et al., DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences, Nucleic Acids Research, 2016 18
RESULTS 19
PERFORMANCE MEASURES 𝑢𝑞 + 𝑢𝑜 𝐵𝑑𝑑𝑣𝑠𝑏𝑑𝑧 = 𝑢𝑞 + 𝑢𝑜 + 𝑔𝑞 + 𝑔𝑜 𝑢𝑞 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = 𝑢𝑞 + 𝑔𝑞 𝑢𝑞 𝑆𝑓𝑑𝑏𝑚𝑚 = 𝑢𝑞 + 𝑔𝑜 𝐺1 = 2 · 𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 · 𝑠𝑓𝑑𝑏𝑚𝑚 𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 + 𝑠𝑓𝑑𝑏𝑚𝑚 tp: true positive, tn: true negative, fp: false positive, fn: false negative 20
COMPARISON OF MODEL + INFORMATION Models Training accuracy Training loss Test accuracy Evaluation score (F1) CNN (domain) 0.80 1.311 0.79 0.68 CNN (domain+PPI) 0.84 1.124 0.81 0.69 CNN 0.83 0.912 0.81 0.69 (domain+PPI+Phylogenetic profile) RNN 0.71 2.334 0.72 0.66 (domain+PPI+Phylogenetic profile) CNN+RNN 0.86 0.865 0.85 0.72 (domain+PPI+Phylogenetic profile) Baseline method* 0.65 - 0.73 0.63 SVM(PPI+domain) *P . Ruan et al. Prediction of Heterodimeric Protein Complexes from Weighted Protein-Protein Interaction Networks Using Novel Features and Kernel Functions, PLoS One , 2013 21
CPU VS GPU 600 500 8 min 400 DGX Station is 300 40 times faster!! 200 100 12 sec 0 Time(sec)/Epoch CPU DGX Station 22
CONCLUSIONS ▪ Applied deep learning to predicting heterodimeric protein complexes with multiple biological information ▪ The performance of hybrid model with multiple information is better than single model ▪ The speed of DGX station is 40 times faster than CPU 23
Thank you for your kind attention! Email: cruan@nvidia.com
Recommend
More recommend