Selective Integration of Multiple Biological Data for Supervised Network Inference Koji Tsuda National Institute for Advanced Industrial Science and Technology (AIST), Tokyo, Japan Joint work with Tsuyoshi Kato and Kiyoshi Asai 2005/8 DIMACS 1
Biological Networks • Physical Interaction network – Edge ⇔ Two proteins physically interact (e.g. docking) • Metabolic networks of enzymes – Edge ⇔ Two enzymes catalyzing successive reactions • Gene regulatory networks • Large graphs with sparse connections – 1,000~10,000 nodes – 10,000 – 100,000 edges 2005/8 DIMACS 2
3 Physical Interaction Network 2005/8 DIMACS
4 Metabolic Network 2005/8 DIMACS
Statistical Inference of Networks • Infer the network by data about proteins – Gene expressions, Phylogenetic profiles etc • Propose a Kernel-based inference method – 1. Supervised Inference • Learning from data and training network – 2. Weighted combination of multiple data • Identify unnecessary data that do not contribute for network inference 2005/8 DIMACS 5
Unsupervised Supervised Inference • Unsupervised network inference – Bayesian network (Friedman et al., 2000) – Infer every edge from scratch (no known edges) • Supervised network inference – A part of the network is known (training network) – Infer the rest of the network from data and training net – Kernel CCA (Yamanishi et al., ISMB, 2004) 2005/8 DIMACS 6
Supervised Network Inference Extra nodes Training network 2005/8 DIMACS 7
Single Data Multiple Data • Multiple data for inferring networks – Gene expression profiles – Subcellular locations – Phylogenetic profiles • Identify relevant data for inference • Weighted integration of multiple data ! – Feature selection to data selection • Kernel CCA: No mechanism for data selection 2005/8 DIMACS 8
Inferring a Network from Multiple Data Physical Interaction Gene Expression Metabolic Network Functional Category Gene Interaction Phylogenetic Profile Subcell. Localization Gene Regulatory Net 3D Structure 2005/8 DIMACS 9
Outline • Network Inference from a kernel matrix – Unsupervised, Single Data – Thresholding: Nearest neighbor connection • Incorporating the training network – Supervised, Single Data – Kernel Matrix Completion (Tsuda et al., 2003) • Weighted integration of multiple data – Supervised, Multiple Data – Weights determined by the EM algorithm 2005/8 DIMACS 10
Unsupervised, Single Data • Convert the data to a kernel matrix – Similarity among proteins – Gene expression: Pearson correlation – Phylogenetic profile: Tree kernel (Vert 2002) – 3D structure: Graph kernel (Borgwardt et al., 2005) 2005/8 DIMACS 11
Construct the network by thresholding • Establish an edge where the kernel value is more than threshold t=0.1 t=0.2 t=0.4 2005/8 DIMACS 12
Supervised, Single Data • Known Training Network • Data about all proteins (Only for first n nodes) Kernel Matrix 2005/8 DIMACS 13
Incomplete kernel matrix from training network • Convert the training graph to a kernel matrix • Synchronizing the representation • Diffusion kernel (Kondor and Lafferty, 2002) – Measure closeness of nodes by random walking *Thresholding approximately recover the original network 2005/8 DIMACS 14
Computation of Diffusion Kernel • A: Adjacency matrix, • D: Diagonal matrix of Degrees • L = D-A: Graph Laplacian Matrix • Diffusion kernel matrix : Diffusion paramater – • Characterizes closeness among nodes • Often used with SVM (Lanckriet et al, PSB 2004) 2005/8 DIMACS 15
Adjacency Matrix and Degree Matrix 2005/8 DIMACS 16
17 Graph Laplacian Matrix L 2005/8 DIMACS
Actual Values of Diffusion Kernels 0.00 0.01 0.00 0.01 0.00 0.00 1.00 0.10 0.10 0.66 0.00 0.01 0.00 0.01 β=0 0.00 0.10 β=0.15 0.03 0.02 0.15 0.13 0.47 Closeness from the 0.02 0.03 “central node” 0.15 β=0.3 2005/8 DIMACS 18
Kernel Matrix Completion P Q • P: Kernel matrix of the data • Q: Incomplete kernel matrix ⎡ ⎤ K Q = I vh ⎢ ⎥ Q T ⎣ ⎦ Q Q vh hh Missing values estimated by minimizing the KL divergence • KL ( Q , P ) Minimize w.r.t. − − = − − 1 1 l 1 1 1 KL ( Q , P ) tr( P Q ) logdet( P Q ) 2 2 2 • Closed from solution Q* • Threshold Q* to obtain the network 2005/8 DIMACS 19
Supervised, Multiple Data • Known Training Network • Multiple data about all proteins Diffusion Kernel Kernel Matrices Matrix 2005/8 DIMACS 20
Overview of Our Approach Adjacency Matrix Q Diffusion completion kernel threshold Kernel Matrices P ( b ) Result Weighted Combination 2005/8 DIMACS 21
Notations ⎡ ⎤ K Q = I vh ⎢ ⎥ Q T ⎣ ⎦ Q Q vh hh = b 1 + b 2 + b 3 + b 4 P ( b ) K 1 K 2 K 3 K 4 = ∑ n k + σ 2 P ( b ) b K I i i = i 1 Unknowns : Submatrices Q vh , Q hh , Weights b 2005/8 DIMACS 22
Objective Function • KL divergence − − = − − 1 1 l 1 1 1 b b b KL ( Q , P ( )) tr( P ( ) Q ) logdet( P ( ) Q ) 2 2 2 Minimize w.r.t. Q Submatrices Q vh , Q hh , Weights b P ( b ) Solved by the EM algorithm 2005/8 DIMACS 23
EM Algorithm • Repeat the following two steps KL ( , ( b )) Q P 1. E-step: minimize w.r.t. Q vh , Q hh KL ( Q , P ( b )) 2. M-step: minimize w.r.t. b E-step: Same as the single kernel case • M-step: Cannot be solved in closed form • 2005/8 DIMACS 24
EM Algorithm for Extended Matrices • Extended Kernel Matrices Λ ⎡ ⎤ ⎡ ⎤ Q Q P ( b ) ~ = = xz ⎢ ⎥ ⎢ ⎥ Q R ( b ) Λ σ T T 2 ⎣ ⎦ Q Q I ⎣ ⎦ xz zz where ∈ ℜ ∈ ℜ Q Q , × × l l l l zz n n xz n [ ] k k k Λ = Λ Λ = Λ Λ 1 L T , , K , n i i i k • The solution of the following problem is also optimal in the original problem ~ min KL ( Q , R ( b )) Q , Q Q , Q , b vh hh , xz zz 2005/8 DIMACS 25
Solutions of the steps • E-step − − − − = = − + 1 T 1 T 1 1 Q K P P , Q P P P P P P K P P vh I vv vh hh hh vh vv vh vh vv I vv vh − − − = σ Λ Λ = + σ Λ Λ 1 2 T 4 T V , Q V V Q V z zz z z z • M-step kN [ ] 1 ∑ = b Q k zz jj N = − + j ( k 1 ) N 1 2005/8 DIMACS 26
Edge prediction experiments ・ Metabolic Network ( KEGG ) Network ・ Protein interaction network ( von Mering, 2002 ) ・ exp: gene expression Data ・ y2h: Interaction net by yeast2hybrid ・ loc: subcellular location ・ phy: phylogenetic profile ・ rnd1,…,rnd4: random noise ・ Q: Proposed method Methods ・ P: Simple combination of kernel matrices ・ cca: kernel CCA (without noises) Evaluation ROC score of edge prediction accuracy (10-fold cross validation) 2005/8 DIMACS 27
Metabolic network • Made from LIGAND Database (KEGG) (Vert and Kanehisa, NIPS, 2003) • Connect enzymes of two successive reactions • 769 nodes, 3702 edges 2005/8 DIMACS 28
Interaction network (Von Mering et al., Nature, 2002) • Middle Confidence • Interactions validated by multiple experiments – High-throughput yeast two hybrid – Correlated mRNA expression – Genetic interaction – Tandem affinity purification, – High-throughput mass-spectrometric protein complex identification • 984nodes, 2438 edges 2005/8 DIMACS 29
Dataset Details Metabolic Net http://www.genome.jp/kegg/ Von Mering et al., Nature, 417 399--403 , 2002 Interaction Spellman et al., MBC, 9, 3273—3297, 1998 Expression Eisen et al., PNAS, 95, 14863—8, 1998 Ito et al., PNAS, 98, 4569—74, 2001 Y2H Uetz et al., Nature, 10, 601—3, 2000 Huh et al. Nature, 425, 686-91, 2003 Subcellular location Phylogenetic profile http://www.genome.jp/kegg/ 2005/8 DIMACS 30
31 Metabolic Network 2005/8 DIMACS
32 Physical Interaction Network 2005/8 DIMACS
Introduce More Random Matrices (Metabolic network) Sensitivity at 95% specificity 2005/8 DIMACS 33
Summary of Experiments • Simple combination (P) < Completed matrix ( Q) – Training network is essential • Selection did not improve accuracy • Accuracy comparable to kernel CCA • Automatic selection of datasets • 4 noise kernel matrices removed 2005/8 DIMACS 34
Conclusion • Supervised Inference of Network – Part of network known – Selection from multiple data – Formulation as kernel completion problem – Validation experiments on metabolic and interaction networks • Future work – Biological interpretation of selection results – Applications to non-bio data T. Kato, K. Tsuda, and K. Asai. Selective integration of multiple biological data for supervised network inference. Bioinformatics , 21(10):2488--2495, 2005. 2005/8 DIMACS 35
36 Experiments 2005/8 DIMACS
Recommend
More recommend