Gene expression analysis is a Bayesian modelling of multi-step process differential gene expression data Low-level Model Alex Lewin, with Sylvia Richardson, Clare (how the measured expression is related to the Marshall, Anne Glazier and Tim Aitman signal) We aim to integrate all (Imperial College) the steps in a common Normalisation statistical framework (to make samples comparable) In collaboration with Helen Causton (Imperial Microarray Centre) Anne-Mette Hein (Imperial) Clustering Differential Peter Green and Graeme Ambler (Bristol) Partition Model Expression Bayesian hierarchical model framework Data Set and Biological question Previous Work (Tim Aitman, Anne Marie Glazier) • Model different sources of variability simultaneously, within array, between array, estimation of gene Deficiency in gene Cd36 found to be associated with specific variability … insulin resistance in SHR (spontaneously hypertensive rat) • Uncertainty is propagated from data to parameter estimates Microarray Study •3 SHR compared with 3 transgenic rats • Share information in appropriate ways to get better •3 wildtype mice compared with 3 knockout mice estimates •Two tissues: fat and heart •Affymetrix chips U34A-C and U74A-C ( ≅ 12000 genes) Bayesian hierarchical model for genes Bayesian hierarchical model for genes under one condition (I) under one condition (II) Data: y gr = log gene expression for gene g, replicate r (can • 2nd level be any estimate of signal: Affymetrix, Li and Wong etc.) Priors for α g , coefficients { a } and { b } α g = gene effect σ g2 ∼ lognormal ( µ , τ ) β r(g) = array effect (expression-level dependent) Hyper-parameters µ and τ can be influential. σ g 2 = gene variance In a full Bayesian analysis, these are not fixed • 1st level • 3rd level y gr ∼ N( α g + β r(g) , σ g2 ), Σ r β r(g) = 0 µ ∼ N( c, d) τ ∼ lognormal (e, f) β r(g) = function of α g , parameters { a } and { b }
Details of array effects We will discuss: Exploratory work shows need for expression-level dependent normalisation • Array effects (normalisation) Piecewise polynomial with unknown break points: β r(g) = quadratic in α g for a rk-1 ≤ α g ≤ a rk • Bayesian model checks on gene variances with coeff (b rk(1) , b rk(2) ), k =1, … #breakpoints • Locations of break points not fixed • Confounding of differential and array effects • Must do sensitivity checks on # break points • Rank statistics • Cubic fits well for this data Non linear fit of array effect as a function Effect of normalisation on density of gene effect cubic Wildtype Knockout loess Before (y gr ) ^ After (y gr - β r(g) ) Bayesian Model Checking Smoothing of the gene specific variances • Check our assumptions on gene variances •Variances are estimated using 2 new from the model for • Predict sample variance S g information from all each gene G x R measurements • Compare predicted S g 2 new with observed S g 2 obs (~12000 x 3) rather than just 3 2 obs ) 2 new > S g Bayesian p-value Prob( S g •Variances are stabilised and shrunk towards • Distribution of p-values Uniform if model is adequate average variance • Easily implemented in MCMC algorithm
Bayesian predictive p-values Differential expression model Control for method: equal Exchangeable variance variance model has too little model is supported by the d g = differential effect for gene g between 2 variability for the data data conditions Joint model for the 2 conditions : y g1r ∼ N( α g - ½ d g + β r(g)1 , σ g12 ), (condition 1) y g2r ∼ N( α g + ½ d g + β r(g)2 , σ g22 ), (condition 2) So E(y g2 – y g1 ) = d g Prior can be put on d g directly Possible Statistics for Differential Credibility intervals for ranks Expression Ranks of modelled d g ≈ log fold change log fold change d g* = d g / ( σ 2 g1 / 3 + σ 2 g2 / 3 ) ½ (standardised difference) 150 genes with lowest rank Even genes with •We obtain the joint distribution of all {d g } and/or {d g* } median rank less than •Distributions of ranks 100 can have large uncertainty Summary Probability statements about ranks Under-expression: probability that gene is • Model different sources of variability in a single model ranked in bottom 100 genes • Borrow information from all genes to stabilise estimates of gene specific variances Have to choose rank cutoff (here 100) • Use joint distribution of ranks for inference Have to choose how confident we want to be in • Future work: mixture prior on log fold changes, with saying the rank is less than uncertainty propagated to mixture parameters the cutoff (eg prob=80%)
Recommend
More recommend