Using non-parametric methods in the context of multiple testing to determine differentially expressed genes Greg Grant, Elisabetta Manduchi, Chris Stoeckert Penn Center for Bioinformatics CAMDA 2000
Outline • Differential Expression • Biological Variability and Replicates • Gene Intensity Distributions – necessitate nonparametric methods • Applications of – PaGE – t -statistic combined with a permutation algorithm
The Dataset • Golub et al. (1999), Science , 286 :531-537 • ALL-AML: heterogeneous groups: source (B- cells, T-cells, 4 AML types), sex, success, etc. • Focus on B-cells (37 replicates) vs T-cells (9 replicates): combined the training and the test sets • Affymetrix – single sample hybridization – each signal is a composite of hybridizations to probes in a set – absent calls
Distribution Heterogeneity
“Deterministic” differential expression B and T B T log scale Identifier: U23852, T-lymphocyte specific protein tyrosine kinase p56lck (lck) aberrant mRNA
“Non-deterministic” differential expression B and T B T log scale Identifier: M23323, T-cell surface glycoprotein CD3 epsilon chain precursor
Absent calls Only by including the absent calls do we see the difference in genes we expect to be differentially expressed, such as the following T-cell antigen CD7 precursor (Id: D007499) B and T B-cell T-cell log scale • Consequence of including the absent calls: the introduction of bimodal distributions and non-deterministic differential expression, thus complicating the problem of assigning confidence to predictions of differential expression.
t -statistics and adjusted p -values Use method described in Dudoit et al. (2000) • Assigns t -statistic to each gene. • p -values are obtained by permuting the columns in place of assuming t -distributions. • Corrects for multiple testing by Westfall and Young stepdown approach.
B-cell vs. T-cell using t -statistic Column n,9 = fraction of times gene is up-regulated in T-cells out of 100 comparisons between n randomly chosen B-cell and all 9 T-cell expmts.
PaGE: Patterns from Gene Expression • PaGE assigns confidence measures to predictions of differential expression. Handles multiple testing in a nonparametric (and non-standard) way. – Does not use t -statistic. • Patterns are generated by comparison of groups of replicates to a reference group. • See Manduchi et al. (2000), Bioinformatics , 16 :685-698.
PaGE: outline • Find C (the upper cutratio ) such that, if a gene is chosen at random from the set of genes which are true negatives, then the probability that i > X 2 , C is small. X 1 , i • This C gives a cutoff for making predictions about up-regulation. • Similarly for down-regulation (find an appropriate c [ lower cutratio ] , reverse the above inequality).
PaGE: approximations X 2 i µ The false positive rate is approximated by 2 i > C Prob X 1 i µ 1 i After having shifted all intensities by an appropriate numerical constant, we approximate the unknown distribution of X 1 , i µ 1 , i X 1 , i , j − by that of 1 X + 1 , i 1 − n 1 1 where i varies over the gene tags and j varies of the replicates for group 1. Similarly for group 2.
The effect of shifting hypothetical data: assuming variance proportional to magnitude. No shift necessary. real data: • variance greater for low intensities. • absent calls increase this effect. • a moderate shift compen- sates and reduces false positives at low and high intensity.
The effect of shifting (cont.) ↓ 37 B-cell replicates vs 9 T-cell replicates
B-cell vs. T-cell using PaGE Column n,9 = fraction of times gene is up-regulated in T-cells out of 100 comparisons between n randomly chosen B-cell and all 9 T-cell expmts.
Effect of Number of Replicates on False Positives Due to Biological Subclassing • Comparisons of B-cell to B-cell. No. of Indep. Genes – Any predictions are false positives. Reps 1000 3000 • Table entries are empirical likelihoods of observing any false positives. 5 0.39 0.44 • False positives are due to noise and/or biological subclassing, with the latter 10 0.15 0.19 effect diminishing as the number of 20 0.10 0.11 replicates increases. • Confidence was 90%. If PaGE was 30 0.06 0.07 exact instead of conservative, the numbers in each column would 40 0.06 0.02 converge to 0.1. • Tripling the number of independent 50 0.03 0.04 genes does not dramatically worsen the multiple testing problem of subclassing.
Summary • How many replicates are needed? – Gene intensity distributions can be very irregular – Noise and multiple testing (False negatives) • t -statistic: Continue to reduce false negatives even with 25 replicates • PaGE: Much less conservative – Biological variability and multiple testing (False positives) • PaGE: Confidence measures assume that the variability of each class is fully represented in the replicates. If a class is very heterogeneous (e.g. B-cells) then many replicates might be needed to avoid over- representing a subclass by chance and therefore introducing false positives. • The more homogeneous the group, the fewer replicates are needed. • How do findings generalize to other platforms?
Acknowledgements PCBI Brian Brunk Joan Mazzarelli Eugen Buehler Shannon McWeeney Jonathan Crabtree Colleen Petrelli Sue Davidson Debbie Pinney Sharon Diskin Angel Pizarro Georgi Kostov Jonathan Schug Phillip Le Jim Wolff URLs • http://www.cbil.upenn.edu/ • http://www.cbil.upenn.edu/PaGE • http://www.stat.berkeley.edu/users/terry/zarray/html/matt.html (Dudoit et al. ) • http://www.cbil.upenn.edu/tpWY (implementation of Dudoit et al. )
Recommend
More recommend