Ontology, Network, and Pathway Analysis of Large Datasets Willard - PowerPoint PPT Presentation

Ontology, Network, and Pathway Analysis of Large Datasets Willard Freeman wfreeman@psu.edu

Flow of genetic information A buffet of ‘omes

Discovery approaches

Knowledge versus Data Discovery approaches

But we can’t see everything • “Measure what is measurable, and make measurable what is not so.” - Galileo

Technologies Available for Gene Expression Studies The choice of appropriate technology is balance between the # of genes to be analyzed and the # of samples to be analyzed (and costs).

Analytical Flow of Discovery Studies • I can generate lots of data. • How do I make use of it and what are the steps?

Application Specific Software DeCyder w/ EDA Statistical Analyses 2DIGE analysis, stats, some clustering Proteomics Progenesis GeneSpring, R, etc 2DIGE analysis, stats, some clustering Data analysis (stats, Ontology, classification statistics, PCA, heatmaps) Protein Pilot iTRAQ analysis Transcriptomics Genome Studio Ingenuity Illumina data analysis, QC Pathways, networks, effect on function, Biological Interpretation localization SDS qPCR analysis GeneOntology Biological processes Other visualization, process, and pathway programs

Gene Ontologies Are certain categories of genes/proteins over-represented in your population of changes as compared to the entire genome/proteome? The Gene Ontology (GO) project (http://www.geneontology.org/) provides structured, controlled vocabularies and classifications that cover several domains of molecular and cellular biology and are freely available for community use in the annotation of genes, gene products and sequences.

Gene Ontologies • Prime categories – Molecular function – what specific biochemical action(s) • Kinase, iron binding, etc. – Biological process – what process is this part of • Proteolytic degradation, neurotransmitter release, etc. – Cellular component – where is it • Nucleus, ER, etc. • Can assess over-representation of categories in your gene/protein list by Fisher’s Exact Test

Example output from GO analysis of the example dataset GO category Multiple testing corrected p-value Number of genes in your list that are in a category Number of genes in the genome that are in a category

Gene Ontologies • Important facts – Genes can belong to many categories – Classifications are artificial, human generated – Unknown functions not included – Just a list function – does not take direction of change into account • Strengths – Fairly comprehensive – Easy analysis • Weaknesses – Sometimes only vaguely informative – Redundant classifications

Other Ontology Tools • Kyoto Encyclopedia of Gene and Genomes (KEGG) – http://www.genome.jp/kegg/pathway.html • Database for Annotation Visualization and Integrated Discovery (DAVID ) – http://david.abcc.ncifcrf.gov/ • PANTHER (Protein ANalysis THrough Evolutionary Relationships) – http://www.pantherdb.org/

Panther • Are specific processes over/under represented? • Very similar to GO • Uncertainly as to continued curration

Panther

Pathways and Networks • Pathway – A well characterized chain of molecular events leading to some functional outcome – Human created • Network – A set of inter-related genes/proteins • Relationships can be human generated or as computer predicted (e.g. protein-protein binding) – May or may not have a known ‘functional’ outcome

Pathways and Network Programs • Ingenuity – www.ingenuity.com – Web accessed – Institutional license (will be re-activated soon) – May be small charge • Ariedne – http://www.ariadnegenomics.com/ – Individual license at UP • KEGG, PathCase, Gephi, GenMAPP

Ingenuity Example • Database is a combination of hand entered literature, natural language processing, and retrieval of public databases ranging from gene expression atlases to Clinicaltrials.gov. • Very good for a wide variety of input types and identifiers – SNPs (dbSNP) – mRNA (Unigene, RefSeq, specific array IDS, Entrez Gene) – miRNA (miRBase) – Proteins (UniProt, GI, HUGO gene symbol) – Metabolites (HMDB) – Small molecule chemicals (PubChem, CAS registry)

Single Gene Analysis • I have some gene and I don’t know what it is and PubMed isn’t helping

Importing a dataset • I have a set (10s to 1000s) of genes/proteins/etc and what to see how changes may work additively or synergistically • Importing data – Format • A recognizable identifier • A ratio or fold change value • Can also include p values, and multiple different comparisons

Most initial challenges with network/pathway analysis are formatting the primary date so that the appropriate software can parse the data. Remember, that software is not smart and you have to be explicit in organization

Importing data • Identifiers • Quantitative information • Did the parsing work? • Any additional filtering? • Starting an analysis

Summary of analysis

Functions – analogous to ontologies Threshold – Fisher’s exact test p value in –log. Can apply a BHMTC to statistics

Threshold – Fisher’s exact test p value in –log. Can apply a BHMTC to statistics Ratio – ‘%’ of genes in the pathway altered in dataset

Quantitative data – by shading, can mouse over for numeric data Genes – each gene (or group of genes) can be clicked for additional data Relationships – each line can be clicked for specific information Modifying – the pathway can be modified in Ingenuity, to add/subtract information and create presentation figures

Pathways • Are a conceptual artifice. • Significance is dependent on the size (# of molecules in pathway). • Can create your own pathways which are focused on the topic of interest. • Pathways do not always pop out as blatantly obvious. Heterogeneous and subtle treatments. • If you are an expert in the XYZ pathway you will know more than a database.

Networks Are there sets of inter-related genes that are regulated in your condition? Associative relationship and not necessarily deterministic. Calvano SE, et al Nature 2005 PMID: 16136080.

Can add function and process callouts Toggle between pair-wise comparisons to examine commonalities and differences

Other Ingenuity Functions • Toxicology, figure generations, data set combination, list comparisons, biomarker development. • Data sharing – Analyses can be shared across users – Excellent for sharing in collaborative projects or just between lab members in a manner more intuitive than an excel sheet.

Ingenuity • Strengths – Multiple sources of data compressed into one place – Extensive use of synonyms and multiple identifiers (e.g. same gene known as Lilrb3 and PirB, one field uses one name and another the other name – Finding unexpected results – Providing statistical and visual representations of findings to prioritize 2 nd generation experiments – Sharing across investigators – Combining multiple types of data

Ingenuity • Weaknesses – Not omniscient – For niche pathways, poor representation • Can create your own – Will not write your paper for you – Networks are sometimes not informative of biological outcomes due to limitations in the existing knowledge

Combining Proteomics and Functional Genomics • How well do message and transcript match up? Protein Protein-P mRNA Protein-Gly Gene mRNA Protein-P-Gly mRNA

Appropriate use of nomenclature is critical for comparing caterpillars and butterflies

Datasets from multiple sources can be merged for subsequent analyses with accurate translation to a common identifier.

Good hunting

Ontology, Network, and Pathway Analysis of Large Datasets Willard - PowerPoint PPT Presentation

Ontology, Network, and Pathway Analysis of Large Datasets Willard Freeman wfreeman@psu.edu Flow of genetic information A buffet of omes Discovery approaches Knowledge versus Data Discovery approaches But we cant see everything

OR ey What are the pathways? Pathway 1 Pathway 2 Pathway 3 Pathway 4

Applying Ontology in Network Analysis EWG-DSS Research Collaboration Network EWG-DSS Collab-Net

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

Learning with Large Datasets L eon Bottou NEC Laboratories America Why Large-scale Datasets?

Whats in the PAH Nitric Oxide Pathway Pathway Pathway Endothelial cells Endothelial cells

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

ODPReco - A Tool to Recommend Ontology Design Patterns Maleeha Arif Yasvi, Raghava Mutharaju

Ontology Development 101: A Guide to Creating Your First Ontology Natalya F. Noy and Deborah L.

Combining XML querying Combining XML querying with ontology reasoning: with ontology reasoning:

Ontology Engineering Lecture 7: Top-down (and middle-out) Ontology Development II Maria Keet

Some (more) Burning Issues for Ontology Initiatives Background: Current Ontology Work in Bremen

Systematic Annotation Mark Voorhies 4/5/2011 The Gene Ontology Three directed acyclic graphs

Ontology Languages for the Semantic Web Ontology Languages Wide variety of languages for

Ontology Jan Pettersen Nytun Knowledge Representation Part I, JPN, UiA 1 Outline S O P

Database Resources for Crop Genomics, Genetics and Breeding Research 2014 SAAESD Spring Meeting

Drug Discovery in the Age of Genomics Mark Kiel, MD PhD Alex Joyner, PhD Senior Field

ICMP culture collection: M A N A A K I W H E N U A L A N D C A R E R E S E A R C H

Earl Bellinger and Fabio Mendes What are microarrays again? A microarray is a 2D array on a solid

Genomics & Personalized Medicine: Analysis & Clinical Implementation Our vision To

De novo genome assembly versus mapping to a reference genome Beat Wolf PhD. Student in Computer

U24: Informatics tools for cancer research ITCR Annual PI Meeting University of California Santa

Mutation detection in massively parallel sequencing 2012 Winter School in Mathematical and

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Ontology, Network, and Pathway Analysis of Large Datasets Willard - PowerPoint PPT Presentation

Ontology, Network, and Pathway Analysis of Large Datasets Willard Freeman wfreeman@psu.edu Flow of genetic information A buffet of omes Discovery approaches Knowledge versus Data Discovery approaches But we cant see everything

OR ey What are the pathways? Pathway 1 Pathway 2 Pathway 3 Pathway 4

Applying Ontology in Network Analysis EWG-DSS Research Collaboration Network EWG-DSS Collab-Net

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

Learning with Large Datasets L eon Bottou NEC Laboratories America Why Large-scale Datasets?

Whats in the PAH Nitric Oxide Pathway Pathway Pathway Endothelial cells Endothelial cells

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

ODPReco - A Tool to Recommend Ontology Design Patterns Maleeha Arif Yasvi, Raghava Mutharaju

Ontology Development 101: A Guide to Creating Your First Ontology Natalya F. Noy and Deborah L.

Combining XML querying Combining XML querying with ontology reasoning: with ontology reasoning:

Ontology Engineering Lecture 7: Top-down (and middle-out) Ontology Development II Maria Keet

Some (more) Burning Issues for Ontology Initiatives Background: Current Ontology Work in Bremen

Systematic Annotation Mark Voorhies 4/5/2011 The Gene Ontology Three directed acyclic graphs

Ontology Languages for the Semantic Web Ontology Languages Wide variety of languages for

Ontology Jan Pettersen Nytun Knowledge Representation Part I, JPN, UiA 1 Outline S O P

Database Resources for Crop Genomics, Genetics and Breeding Research 2014 SAAESD Spring Meeting

Drug Discovery in the Age of Genomics Mark Kiel, MD PhD Alex Joyner, PhD Senior Field

ICMP culture collection: M A N A A K I W H E N U A L A N D C A R E R E S E A R C H

Earl Bellinger and Fabio Mendes What are microarrays again? A microarray is a 2D array on a solid

Genomics &amp; Personalized Medicine: Analysis &amp; Clinical Implementation Our vision To

De novo genome assembly versus mapping to a reference genome Beat Wolf PhD. Student in Computer

U24: Informatics tools for cancer research ITCR Annual PI Meeting University of California Santa

Mutation detection in massively parallel sequencing 2012 Winter School in Mathematical and

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Genomics & Personalized Medicine: Analysis & Clinical Implementation Our vision To