Ontology, Network, and Pathway Analysis of Large Datasets Willard Freeman wfreeman@psu.edu
Flow of genetic information A buffet of ‘omes
Discovery approaches
Knowledge versus Data Discovery approaches
But we can’t see everything • “Measure what is measurable, and make measurable what is not so.” - Galileo
Technologies Available for Gene Expression Studies The choice of appropriate technology is balance between the # of genes to be analyzed and the # of samples to be analyzed (and costs).
Analytical Flow of Discovery Studies • I can generate lots of data. • How do I make use of it and what are the steps?
Application Specific Software DeCyder w/ EDA Statistical Analyses 2DIGE analysis, stats, some clustering Proteomics Progenesis GeneSpring, R, etc 2DIGE analysis, stats, some clustering Data analysis (stats, Ontology, classification statistics, PCA, heatmaps) Protein Pilot iTRAQ analysis Transcriptomics Genome Studio Ingenuity Illumina data analysis, QC Pathways, networks, effect on function, Biological Interpretation localization SDS qPCR analysis GeneOntology Biological processes Other visualization, process, and pathway programs
Gene Ontologies Are certain categories of genes/proteins over-represented in your population of changes as compared to the entire genome/proteome? The Gene Ontology (GO) project (http://www.geneontology.org/) provides structured, controlled vocabularies and classifications that cover several domains of molecular and cellular biology and are freely available for community use in the annotation of genes, gene products and sequences.
Gene Ontologies • Prime categories – Molecular function – what specific biochemical action(s) • Kinase, iron binding, etc. – Biological process – what process is this part of • Proteolytic degradation, neurotransmitter release, etc. – Cellular component – where is it • Nucleus, ER, etc. • Can assess over-representation of categories in your gene/protein list by Fisher’s Exact Test
Example output from GO analysis of the example dataset GO category Multiple testing corrected p-value Number of genes in your list that are in a category Number of genes in the genome that are in a category
Gene Ontologies • Important facts – Genes can belong to many categories – Classifications are artificial, human generated – Unknown functions not included – Just a list function – does not take direction of change into account • Strengths – Fairly comprehensive – Easy analysis • Weaknesses – Sometimes only vaguely informative – Redundant classifications
Other Ontology Tools • Kyoto Encyclopedia of Gene and Genomes (KEGG) – http://www.genome.jp/kegg/pathway.html • Database for Annotation Visualization and Integrated Discovery (DAVID ) – http://david.abcc.ncifcrf.gov/ • PANTHER (Protein ANalysis THrough Evolutionary Relationships) – http://www.pantherdb.org/
Panther • Are specific processes over/under represented? • Very similar to GO • Uncertainly as to continued curration
Panther
Pathways and Networks • Pathway – A well characterized chain of molecular events leading to some functional outcome – Human created • Network – A set of inter-related genes/proteins • Relationships can be human generated or as computer predicted (e.g. protein-protein binding) – May or may not have a known ‘functional’ outcome
Pathways and Network Programs • Ingenuity – www.ingenuity.com – Web accessed – Institutional license (will be re-activated soon) – May be small charge • Ariedne – http://www.ariadnegenomics.com/ – Individual license at UP • KEGG, PathCase, Gephi, GenMAPP
Ingenuity Example • Database is a combination of hand entered literature, natural language processing, and retrieval of public databases ranging from gene expression atlases to Clinicaltrials.gov. • Very good for a wide variety of input types and identifiers – SNPs (dbSNP) – mRNA (Unigene, RefSeq, specific array IDS, Entrez Gene) – miRNA (miRBase) – Proteins (UniProt, GI, HUGO gene symbol) – Metabolites (HMDB) – Small molecule chemicals (PubChem, CAS registry)
Single Gene Analysis • I have some gene and I don’t know what it is and PubMed isn’t helping
Importing a dataset • I have a set (10s to 1000s) of genes/proteins/etc and what to see how changes may work additively or synergistically • Importing data – Format • A recognizable identifier • A ratio or fold change value • Can also include p values, and multiple different comparisons
Most initial challenges with network/pathway analysis are formatting the primary date so that the appropriate software can parse the data. Remember, that software is not smart and you have to be explicit in organization
Importing data • Identifiers • Quantitative information • Did the parsing work? • Any additional filtering? • Starting an analysis
Summary of analysis
Functions – analogous to ontologies Threshold – Fisher’s exact test p value in –log. Can apply a BHMTC to statistics
Threshold – Fisher’s exact test p value in –log. Can apply a BHMTC to statistics Ratio – ‘%’ of genes in the pathway altered in dataset
Quantitative data – by shading, can mouse over for numeric data Genes – each gene (or group of genes) can be clicked for additional data Relationships – each line can be clicked for specific information Modifying – the pathway can be modified in Ingenuity, to add/subtract information and create presentation figures
Pathways • Are a conceptual artifice. • Significance is dependent on the size (# of molecules in pathway). • Can create your own pathways which are focused on the topic of interest. • Pathways do not always pop out as blatantly obvious. Heterogeneous and subtle treatments. • If you are an expert in the XYZ pathway you will know more than a database.
Networks Are there sets of inter-related genes that are regulated in your condition? Associative relationship and not necessarily deterministic. Calvano SE, et al Nature 2005 PMID: 16136080.
Can add function and process callouts Toggle between pair-wise comparisons to examine commonalities and differences
Other Ingenuity Functions • Toxicology, figure generations, data set combination, list comparisons, biomarker development. • Data sharing – Analyses can be shared across users – Excellent for sharing in collaborative projects or just between lab members in a manner more intuitive than an excel sheet.
Ingenuity • Strengths – Multiple sources of data compressed into one place – Extensive use of synonyms and multiple identifiers (e.g. same gene known as Lilrb3 and PirB, one field uses one name and another the other name – Finding unexpected results – Providing statistical and visual representations of findings to prioritize 2 nd generation experiments – Sharing across investigators – Combining multiple types of data
Ingenuity • Weaknesses – Not omniscient – For niche pathways, poor representation • Can create your own – Will not write your paper for you – Networks are sometimes not informative of biological outcomes due to limitations in the existing knowledge
Combining Proteomics and Functional Genomics • How well do message and transcript match up? Protein Protein-P mRNA Protein-Gly Gene mRNA Protein-P-Gly mRNA
Appropriate use of nomenclature is critical for comparing caterpillars and butterflies
Datasets from multiple sources can be merged for subsequent analyses with accurate translation to a common identifier.
Good hunting
Recommend
More recommend