Gene Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein
A quick review Gene expression profiling Which molecular processes/functions are involved in a certain phenotype (e.g., disease, stress response, etc.) The Gene Ontology (GO) Project Provides shared vocabulary/annotation Terms are linked in a complex structure
Enrichment analysis – the wrong way Functional # of genes in % category the study set Signaling 82 27.6 Metabolism 40 13.5 Others 31 10.4 Trans factors 28 9.4 Transporters 26 8.8 Proteases 20 6.7 Protein synthesis 19 6.4 Signaling category contains 27.6% of all genes Adhesion 16 5.4 in the study set - by far the largest category. Oxidation 13 4.4 Reasonable to conclude that signaling may be Cell structure 10 3.4 important in the condition under study Secretion 6 2.0 Detoxification 6 2.0
Enrichment analysis – the wrong way What if ~27% of the genes on the array are involved in signaling? The number of signaling genes in the set is what expected by chance. We need to consider not only the number of genes in the set for each category, but also the total number on the array. Functional # of genes in % % on category the study set array We want to know which category Signaling 82 27.6% 26% Metabolism 40 13.5% 15% is over-represented (occurs more Others 31 10.4% 11% times than expected by chance). Trans factors 28 9.4% 10% Transporters 26 8.8% 2% Proteases 20 6.7% 7% Protein synthesis 19 6.4% 7% Adhesion 16 5.4% 6% Oxidation 13 4.4% 4% Cell structure 10 3.4% 8% Secretion 6 2.0% 2% Detoxification 6 2.0% 2%
Enrichment analysis
Enrichment analysis – the right way Say, the microarray contains 50 genes, 10 of which are annotated as ‘signaling’. Your expression analysis reveals 8 differentially expressed genes, 4 of which are annotated as ‘signaling’. Is this significant? A statistical test, based on a null model Assume the study set has nothing to do with the specific function at hand and was selected randomly, would we be surprised to see this number of genes annotated with this function in the study set? The “urn” version: You pick a ranndon set of 8 balls from an urn that contains 50 balls: 40 white and 10 blue. How surprised will you be to find that 4 of the balls you picked are blue?
A quick review: Modified Fisher's exact test Genes/balls Differentially expressed (DE) genes/balls 10 out of 50 4 out of 8 Do I have a surprisingly high number of blue genes? Null model: the 8 genes/balls are selected randomly … 2 out of 8 1 out of 8 2 out of 8 5 out of 8 3 out of 8 4 out of 8 2 out of 8 So, if you have 50 balls, 10 of them are blue, and you pick 8 balls randomly, what is the probability that k of them are blue?
A quick review: Modified Fisher's exact test Probability Hypergeometric distribution 0.30 0.15 m=50, m t =10, n=8 0 0 1 2 3 4 5 6 7 8 k So … do I have a surprisingly high number of blue genes? What is the probability of getting P( σ t >=4) at least 4 blue genes in the null model?
Modified Fisher's Exact Test Let m denote the total number of genes in the array and n the number of genes in the study set. Let m t denote the total number of genes annotated with function t and n t the number of genes in the study set annotated with this function. We are interested in knowing the probability of seeing n t or more annotated genes! (This is equivalent to a one-sided Fisher exact test)
So … what do we have so far? A shared functional vocabulary A shared functional vocabulary Systematic linkage between genes and functions Systematic linkage between genes and functions A way to identify genes relevant to the condition under A way to identify genes relevant to the condition under study study Statistical analysis Statistical analysis (combining all of the above to identify cellular (combining all of the above to identify cellular functions that contributed to the disease or functions that contributed to the disease or condition under study) condition under study) A way to identify “related” genes
Still far from being perfect! A shared functional vocabulary Systematic linkage between genes and functions Considers only a few genes Arbitrary! A way to identify genes relevant to the condition under study Limited hypotheses Simplistic null model! Statistical analysis (combining all of the above to identify cellular functions that contributed to the disease or condition under study) A way to identify “related” genes
Get Set Enrichment analysis
Enrichment Analysis ClassA ClassB Biological function? Genes ranked by expression correlation to Class A Cutoff
Genes ranked by expression correlation to Class A ClassA ClassB Enrichment Analysis function? Biological Cutoff Function 1 (e.g., metabolism) 2 / 10 Function 2 (e.g., signaling) 5 / 11 Function 3 (e.g., regulation) 3 / 10
Problems with cutoff-based analysis After correcting for multiple hypotheses testing, no individual gene may meet the threshold due to noise. Alternatively, one may be left with a long list of significant genes without any unifying biological theme. The cutoff value is often arbitrary! We are really examining only a handful of genes, totally ignoring much of the data
Gene Set Enrichment Analysis MIT, Broad Institute V 2.0 available since Jan 2007 (Subramanian et al. PNAS. 2005.)
GSEA key features Does not require setting a cutoff! Identifies the set of relevant genes as part of the analysis! Calculates a score for the enrichment of a entire set of genes rather than single genes! Provides a more robust statistical framework!
Genes ranked by expression correlation to Class A ClassA ClassB Gene Set Enrichment Analysis function? Biological Cutoff Function 1 (e.g., metabolism) 2 / 10 Function 2 (e.g., signaling) 5 / 11 Function 3 (e.g., regulation) 3 / 10
Genes ranked by expression correlation to Class A ClassA ClassB Gene Set Enrichment Analysis Function 1 (e.g., metabolism) Function 2 (e.g., signaling) Function 3 (e.g., regulation)
Gene Set Enrichment Analysis Function 1 Function 3 Function 2 (e.g., metabolism) (e.g., regulation) ClassA ClassB (e.g., signaling) Genes ranked by expression correlation to Class A Running sum: Increase when gene annotated with the function under study Decrease otherwise
Gene Set Enrichment Analysis What would you expect if ALL genes annotated with this function cluster at the top of the list? What would you expect if genes annotated with this function are randomly distributed? What would you expect if most of the genes annotated with this function cluster at the top of the list?
Gene Set Enrichment Analysis ES = 0.69 Low ES (evenly distributed) ES = -0.59
Gene Set Enrichment Analysis Enrichment score (ES) = max deviation from 0 Running sum Leading Edge genes Genes within functional set (hits)
Estimating Significance of ES
Estimating Significance of ES An empirical permutation test Phenotype labels are shuffled and the ES for this functional set is recomputed. Repeat 1000 times. Generating a null distribution
GSEA Steps 1. Calculation of an enrichment score (ES) for each functional category 2. Estimation of significance level of the ES Shuffling-based null distribution 3. Adjustment for multiple hypotheses testing Necessary if comparing multiple gene sets (i.e.,functions) Computes FDR (false discovery rate)
Recommend
More recommend