ontology network and pathway analysis of large datasets
play

Ontology, Network, and Pathway Analysis of Large Datasets Willard - PowerPoint PPT Presentation

Ontology, Network, and Pathway Analysis of Large Datasets Willard Freeman wfreeman@psu.edu Flow of genetic information A buffet of omes Discovery approaches Knowledge versus Data Discovery approaches But we cant see everything


  1. Ontology, Network, and Pathway Analysis of Large Datasets Willard Freeman wfreeman@psu.edu

  2. Flow of genetic information A buffet of ‘omes

  3. Discovery approaches

  4. Knowledge versus Data Discovery approaches

  5. But we can’t see everything • “Measure what is measurable, and make measurable what is not so.” - Galileo

  6. Technologies Available for Gene Expression Studies The choice of appropriate technology is balance between the # of genes to be analyzed and the # of samples to be analyzed (and costs).

  7. Analytical Flow of Discovery Studies • I can generate lots of data. • How do I make use of it and what are the steps?

  8. Application Specific Software DeCyder w/ EDA Statistical Analyses 2DIGE analysis, stats, some clustering Proteomics Progenesis GeneSpring, R, etc 2DIGE analysis, stats, some clustering Data analysis (stats, Ontology, classification statistics, PCA, heatmaps) Protein Pilot iTRAQ analysis Transcriptomics Genome Studio Ingenuity Illumina data analysis, QC Pathways, networks, effect on function, Biological Interpretation localization SDS qPCR analysis GeneOntology Biological processes Other visualization, process, and pathway programs

  9. Gene Ontologies Are certain categories of genes/proteins over-represented in your population of changes as compared to the entire genome/proteome? The Gene Ontology (GO) project (http://www.geneontology.org/) provides structured, controlled vocabularies and classifications that cover several domains of molecular and cellular biology and are freely available for community use in the annotation of genes, gene products and sequences.

  10. Gene Ontologies • Prime categories – Molecular function – what specific biochemical action(s) • Kinase, iron binding, etc. – Biological process – what process is this part of • Proteolytic degradation, neurotransmitter release, etc. – Cellular component – where is it • Nucleus, ER, etc. • Can assess over-representation of categories in your gene/protein list by Fisher’s Exact Test

  11. Example output from GO analysis of the example dataset GO category Multiple testing corrected p-value Number of genes in your list that are in a category Number of genes in the genome that are in a category

  12. Gene Ontologies • Important facts – Genes can belong to many categories – Classifications are artificial, human generated – Unknown functions not included – Just a list function – does not take direction of change into account • Strengths – Fairly comprehensive – Easy analysis • Weaknesses – Sometimes only vaguely informative – Redundant classifications

  13. Other Ontology Tools • Kyoto Encyclopedia of Gene and Genomes (KEGG) – http://www.genome.jp/kegg/pathway.html • Database for Annotation Visualization and Integrated Discovery (DAVID ) – http://david.abcc.ncifcrf.gov/ • PANTHER (Protein ANalysis THrough Evolutionary Relationships) – http://www.pantherdb.org/

  14. Panther • Are specific processes over/under represented? • Very similar to GO • Uncertainly as to continued curration

  15. Panther

  16. Pathways and Networks • Pathway – A well characterized chain of molecular events leading to some functional outcome – Human created • Network – A set of inter-related genes/proteins • Relationships can be human generated or as computer predicted (e.g. protein-protein binding) – May or may not have a known ‘functional’ outcome

  17. Pathways and Network Programs • Ingenuity – www.ingenuity.com – Web accessed – Institutional license (will be re-activated soon) – May be small charge • Ariedne – http://www.ariadnegenomics.com/ – Individual license at UP • KEGG, PathCase, Gephi, GenMAPP

  18. Ingenuity Example • Database is a combination of hand entered literature, natural language processing, and retrieval of public databases ranging from gene expression atlases to Clinicaltrials.gov. • Very good for a wide variety of input types and identifiers – SNPs (dbSNP) – mRNA (Unigene, RefSeq, specific array IDS, Entrez Gene) – miRNA (miRBase) – Proteins (UniProt, GI, HUGO gene symbol) – Metabolites (HMDB) – Small molecule chemicals (PubChem, CAS registry)

  19. Single Gene Analysis • I have some gene and I don’t know what it is and PubMed isn’t helping

  20. Importing a dataset • I have a set (10s to 1000s) of genes/proteins/etc and what to see how changes may work additively or synergistically • Importing data – Format • A recognizable identifier • A ratio or fold change value • Can also include p values, and multiple different comparisons

  21. Most initial challenges with network/pathway analysis are formatting the primary date so that the appropriate software can parse the data. Remember, that software is not smart and you have to be explicit in organization

  22. Importing data • Identifiers • Quantitative information • Did the parsing work? • Any additional filtering? • Starting an analysis

  23. Summary of analysis

  24. Functions – analogous to ontologies Threshold – Fisher’s exact test p value in –log. Can apply a BHMTC to statistics

  25. Threshold – Fisher’s exact test p value in –log. Can apply a BHMTC to statistics Ratio – ‘%’ of genes in the pathway altered in dataset

  26. Quantitative data – by shading, can mouse over for numeric data Genes – each gene (or group of genes) can be clicked for additional data Relationships – each line can be clicked for specific information Modifying – the pathway can be modified in Ingenuity, to add/subtract information and create presentation figures

  27. Pathways • Are a conceptual artifice. • Significance is dependent on the size (# of molecules in pathway). • Can create your own pathways which are focused on the topic of interest. • Pathways do not always pop out as blatantly obvious. Heterogeneous and subtle treatments. • If you are an expert in the XYZ pathway you will know more than a database.

  28. Networks Are there sets of inter-related genes that are regulated in your condition? Associative relationship and not necessarily deterministic. Calvano SE, et al Nature 2005 PMID: 16136080.

  29. Can add function and process callouts Toggle between pair-wise comparisons to examine commonalities and differences

  30. Other Ingenuity Functions • Toxicology, figure generations, data set combination, list comparisons, biomarker development. • Data sharing – Analyses can be shared across users – Excellent for sharing in collaborative projects or just between lab members in a manner more intuitive than an excel sheet.

  31. Ingenuity • Strengths – Multiple sources of data compressed into one place – Extensive use of synonyms and multiple identifiers (e.g. same gene known as Lilrb3 and PirB, one field uses one name and another the other name – Finding unexpected results – Providing statistical and visual representations of findings to prioritize 2 nd generation experiments – Sharing across investigators – Combining multiple types of data

  32. Ingenuity • Weaknesses – Not omniscient – For niche pathways, poor representation • Can create your own – Will not write your paper for you – Networks are sometimes not informative of biological outcomes due to limitations in the existing knowledge

  33. Combining Proteomics and Functional Genomics • How well do message and transcript match up? Protein Protein-P mRNA Protein-Gly Gene mRNA Protein-P-Gly mRNA

  34. Appropriate use of nomenclature is critical for comparing caterpillars and butterflies

  35. Datasets from multiple sources can be merged for subsequent analyses with accurate translation to a common identifier.

  36. Good hunting

Recommend


More recommend