Reconstructing networks of pathways via significance analysis of their intersections Embedding biological knowledge in genomic statistical analysis Mirko Francesconi, Daniel Remondini, Nicola Neretti, John Sedivy, Ettore Verondini, Luciano Milanesi, Leon N Cooper, Gastone Castellani
Collaboration Bologna-Brown Brown University Bologna University Brain Research Center CIG-BBB Biophysics Genomic Protemic Center BioComplexity Bioinformatics Theoretical Physics Systems Biophysics Molecular Biology Unilever Research Center ITB CNR Milano Atlantic ocean
Gene expression • Regulation of transcription
We have generated and analyzed/ing several datasets 1) c-myc dataset (enginered rat fibroblasts) 2) TAC dataset (mouse) 3) Ewing sarcoma dataset (human) 4) Aging dataset (human time series & monozygotic twins) 5) c-myc exon array dataset (enginered rat fibroblasts)
Probe selection • Time series (myc on and myc off data sets, cardiac hypertrophy dataset) • Linear model with empirical bayes shrinkage of variance (limma, Bioconductor). • Contrasts of any time point with respect to zero time point
Significance analysis: ANOVA-MULTIPLE TEST COMPARISON • Preprocessing for “dimensionality reduction” of the probeset number • Identify genes with significative expression levels difference between the two conditions (perturbed and unperturbed) • Differences are analyzed over all times • Significance analysis applied to all probesets and eventual correction with FDR
c-Myc-triggered gene expression • C-Myc encode for transcriptional regulators whose inappropriate expression is correlated with a wide array of human malignancies. • Up-regulation of Myc enforces growth, antagonizes cell cycle withdrawal and differentiation, and in some situations promotes apoptosis. • c- myc -/- cells reconstituted with the conditionally active, tamoxifen-specific c-Myc-estrogen receptor fusion protein (MycER) allows the fine and selective change of of c-Myc activity by Tamoxifen . Time series experiment with 5 time points in triplicate and 9000 probes From the J.M. Sedivy lab O’Connel et al JBC 2004
Evaluation of global gene expression of left ventricular tissue in animal model of left ventricular hypertrophy (LVH) induced by transverse aortic constriction (TAC). • Time series experimental design • Measurements were done by 15 Affymetric chips at T1=0, T2=2,or T3=4 weeks after TAC. • Each time point have been repeated with 5 replica
Genomic analysis drawbacks • single gene analysis is not sufficient to understand cell mechanisms undergoing experimental conditions • cell behaviour is a complex phenomenon: several elements (e.g. genes) act together in order to generate it
Perturbation approach •These experiments can be conceptualized as “perturbation” of a “basal state” (cell growth, metabolism, young phenotype, cancer phenotype etc) •“External perturbations” like temperature in physical systems are realized by gene activation via transcription factor triggering (c- myc, dfoxo-nutrition, aging) •Emergent properties arising in the context of perturbation theory are the so called “phase transitions” (superconductivity, superfluidity,etc) and “condensation” phenomena. stimulus Increased order and cooperation
Multiscale correlation for co- regulation detection •Capture correlation profile changes at several scales (whole array, gene family and pathways) and is informative of significative activity •pathways synthesis into single functional forms ( Fluxes ) or index such as Subgraph Conductance. •assessment of co-regulation between and within several pathways •When the perturbation is conditionally switched on, the correlation between genes with a significant change in their expression level is altered on a genomic scale We have strong indications that a similar transition is conserved on different scales and is indicative of co-regulation changes To reduce the dimensionality of the problem and introduce “ a-priori biological knowledge ”, we will extend this method by mapping the gene arrays data onto gene pathways and ontologies . Castellani et al, PNAS 2001 Castellani et al, Learning and Memory 2005, BMC Bioinformatics 2007, IJCB 2007
Multiscale Correlation Model: c-Myc results
Multiscale Correlation Model: human aging results Protein Binding Plasma Membrane
Castellani et al International Journal of Chaos and Bifurcation 2007
HUMAN AGING
1 PPAR SigPath 26 Apoptosis 2 Adipocytokine SigPath 27 Carbon fixation 3 Inositol phosphate Met 28 Colorectal cancer 4 Jak-STAT SigPath 29 Glutathione metabolism γ -ExaCloCE Degr 5 Phosphatidylinositol SigSyst 30 6 Purine metabolism 31 Antigen ProcAndPres 7 Glyo and Dicarbo xylate Met 32 Cyanoamino Ac Met 8 Cysteine metabolism 33 Gap junction 9 B cell receptor SigPath 34 Taur HypoTaur Met 10 Glycolysis-Gluconeogenesis 35 ALA-ASP Met 11 Styrene degradation 36 Leuk tr-e migration 12 Long-term depression 37 Atrazine Deg
13 Alkaloid Bios I 38 Nitrogen metabolism 14 Tyrosine Met 39 Hematopoietic cell lineage 15 mTOR SigPath 40 Glycan STR-Bios 1 16 Fc ε RI SigPath 41 VEGF SigPath 17 Bisphenol A Degr 42 Focal adhesion 18 Val Leu ILeu Bios 43 Nicotinate and nicotinamide metabolism 19 Complement and Coag 44 Ribosome 20 Pyrimidine metabolism 45 Insulin SigPath 21 Pyruvate metabolism 46 Cell cycle 22 Benzoate degradation 47 Cytk-Citk RecInt 13 Type II Diab Mell 48 Glutamate Met 14 PhenylAla Met 49 Propanoate Met 15 T cell Reec SigPath 50 Toll-like Rec SigPath
“Databases” like KEGG have also an interesting network structure , it is possible that biologically relevant informations can be retrieved from the topological structure of nodes (pathways) and edges (common genes between two pathways) The most relevant edges can be focal areas from which biological messages are spread throughout the network (like the hubs for the nodes)
Pathway network analysis Given significant nodes and edges, the pathway network can be reconstructed . Edges and nodes can be ranked based on their centrality in the network (connectivity degree and betweenness)
Betweenness centrality Betweenness centrality is a very interesting parameter because: - it can be calculated both for nodes and edges - it is a measure of the possible information flow through that element, thus if it is affected by experimental conditions it is very likely that such perturbation can spread to the whole system more easily
1 50 100 150 197 1 1 Hsa KEGG complete 50 50 100 100 150 150 197 197 1 50 100 150 197
1 50 100 150 183 1 1 rno KEGG complete 50 50 100 100 150 150 183 183 1 50 100 150 183
120 Histogram of betweenness centrality of pathways extracted from KEGG hsa 100 80 60 40 20 0.01 0.02 0.03 0.04 0.05 0.06 0.07
Plot of betweenness centrality 0.08 of pathways extracted from KEGG hsa 0.06 0.04 0.02 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191
0.06869 7 Galactose metabolism 0.053446 169 Insulin signaling pathway Top 20 pathways 0.049498 20 Purine metabolism extracted 0.043014 39 Tryptophan metabolism from KEGG Database 0.039993 33 Tyrosine metabolism 0.039176 62 Glycerolipid metabolism ranked for their 0.032689 176 Alzheimer's disease betwennes centrality 0.031585 17 Androgen and estrogen metabolis 0.031433 173 Type II diabetes mellitus 0.02946 1 Glycolysis / Gluconeogenesis 0.029339 191 Prostate cancer 0.022151 24 Glycine, serine and threonine me 0.021969 172 Adipocytokine signaling pathway 0.020961 126 PPAR signaling pathway 0.020138 22 Glutamate metabolism 0.019782 30 Lysine degradation 0.018842 87 Butanoate metabolism 0.01853 96 Nicotinate and nicotinamide meta 0.018316 50 Starch and sucrose metabolism 0.018112 115 Glycan structures - biosynthesis
Pathway significance analysis Node (pathway) or edge (intersection) significance analysis can be performed by considering the total number of genes represented in KEGG and the total number of statistically significant genes, compared with the significant genes found in a node or edge and their total number of elements (e.g. by a test based on the hypergeometric distribution )
0 1 Totals 1 a b a+b Null table is constructed 0 c d c+d by the multinomial Totals a+c b+d n distribution and then i × T T tested by a χ 2 R C µ = j test ij T G
Fisher exact test for a 2x2 contingency table 0 1 Totals 1 a b a+b 0 c d c+d Totals a+c b+d n The probability Is due by the Hypergeometric distribution
Pathways and their intersections significance analysis • calculated considering the hypergeometric distribution: p(x) = choose(m, x) choose(n, k-x) / choose(m+n, k) • where – p= probability. – x = number of significant probes in a pathway (or intersection) – m = total number of significant probes. – n = total number of non significant probes. – k = number of probes in a pathway. • P <0.05 was considered as significant
Network representation • Significantly underrepresented: (-1) • Significantly overrepresented: 1 • Not significant: 0
c-Myc off
c-Myc on
Recommend
More recommend