Motivation: Obvious potential for Bayesian and EB methods in gene Bayesian Two-way Clustering expression analysis: can they be made to work? for Gene Expression Data BGX project, BBSRC funded with Sylvia Richardson, Clare Marshall, Alex Lewin and Anne-Mette Hein (Imperial), in collaboration with Helen Causton and Tim Aitman and colleagues Graeme Ambler and Peter Green (CSC/IC Microarray Centre) University of Bristol 12 July 2003 Model-based, flexible approach to gene expression analysis 1 2 Gene expression using Plan Affymetrix chips * * Zoom Image of Hybridised Array Hybridised Spot * * • Variation and uncertainty in gene * Single stranded, expression labeled RNA sample Oligonucleotide element • Hierarchical models 20µm • Simultaneous inference • Common framework, including clustering Millions of copies of a specific oligonucleotide sequence element • Initial experiments with layer models Expressed genes Approx. ½ million different complementary oligonucleotides Non-expressed genes Slide courtesy of Affymetrix 1.28cm 3 4 Image of Hybridised Array Hierarchical models Variation and uncertainty Variables at Gene expression data (e.g. Affymetrix ) is the result of multiple sources of variability several levels - allows modelling of • condition/treatment • within/between complex systems array variation • biological • gene-specific • array manufacture variability • imaging • technical 5 6
Bayesian The Bayes orthodoxy hierarchical models • Should avoid a plug-in approach -- all sources of variation should be One of the most important benefits of assimilated the Bayesian approach has nothing much to do with having real • Propagates uncertainty quantitative prior information • ‘Borrows strength’ - shares out • it has more to do with the information - according to principle structures connecting variables • Avoids over-optimistic inference • especially when there is uncertainty at more than one level 7 8 Gene expression is a Bayes in hierarchical models hierarchical process • The arrows represent (top • Substantive question down) model specification, not the order in which • Experimental design operations are performed • Sample preparation • Once specified, model • Array design & manufacture unknowns should be estimated simultaneously • Gene expression matrix • (We cannot yet claim all of • Probe level data this is practical in gene • Image level data expression) 9 10 Hierarchical clustering of samples Additive models for (log-) gene expression The simplest model: gene + sample The gene expression profiles = α + β + ε g =gene y cluster gs g s gs s =sample/condition A subset of 1161 according to gene expression tissue of profiles, obtained in origin of the Under standard conditions, the ) α = − y g y 60 different samples samples (least-squares) estimates of gene g . .. Red : more mRNA effects are Green : less mRNA in the sample The model generates the method, and in this compared case performs a simple form of normalisation to a reference 11 12 Ross et al, Nature Genetics, 2000
Non-model-based clustering Model-based clustering • Many clustering algorithms have been • Build the cluster structure into the model, developed and used for exploratory purposes rather than estimating gene effects (say) first, • They rely on a measure of ‘distance’ and post-processing to seek clusters (dissimilarity) between gene or sample • Bayesian setting allows use of real prior profiles, e.g. Euclidean information where it is exists (biological • Hierarchical clustering proceeds in an understanding of pathways, etc, previous agglomerative manner: single profiles are experiments, …) joined to form groups using the distance metric, recursively • Good visual tool, but many arbitrary choices care in interpretation! 13 14 A common framework for Clustering via additive model specifying gene expression models (single sample first!) = α + ε y g =gene g g For ease of exposition, y gs consider only gene expression matrix = α + γ + ε y g =gene s =sample/condition g T g g with no structure to samples T g = unknown cluster to (although incorporating experimental structure is which gene g belongs a key goal for later) This is a mixture model 15 16 Clustering via additive model Clustering via additive model (multiple samples ) = α + β + γ + ε y gs g s T s gs = α + β + ε g y s =sample/condition gs g s gs g =gene T g =cluster to which gene g belongs = α + β + γ + ε y = α + β + δ + ε gs g s T s gs y g gs g s gU gs s T g = unknown cluster to which gene g belongs U s =cluster to which sample s belongs clustering of gene profiles 17 18
Two-way Lazzeroni and Owen Clustering via additive model ‘Plaid’ model = α + β + γ + ε y = α + β + γ + ε y gs g s T s gs g gs g s T s gs g = α + β + δ + ε y Now write ρ gh =1 if and only if T g =h , 0 otherwise gs g s gU gs s ∑ = α + β + ρ γ ( h ) + ε y = α + β + γ + δ + ε y gs g s gh s gs gs g s T s gU gs g s h = α + β + γ + ε y or h denotes a ‘cluster’, ‘block’ or ‘layer’ - and gs g s T U gs g s now we allow them to overlap 19 20 …. continued over .... samples ‘Plaid’ model ∑ = α + β + ρ γ + ε ( h ) y gs g s gh s gs h = ∑ genes ρ κ γ ( h ) + ε layers overlap y gs gh sh gs gs h (after re- h denotes a ‘cluster’, ‘block’ or ‘layer’ – pathway? ordering ρ gh = 0 or 1 and κ sh = 0 or 1 genes and samples) γ γ γ γ = = = = µ µ µ µ + + + α α β + β ( ( ( ( h h h h ) ) ) ) ( ( ( ( h h h h ) ) ) ) ( ( ( h h h ) ) ) ( h ) gs gs gs gs g g s s 21 22 samples MacKay and Miskin model = ∑ ρ κ γ + ε Instead of ( h ) y gs gh sh gs gs h where h denotes a ‘cluster’, ‘block’ or genes ‘layer’; ρ gh = 0 or 1 and κ sh = 0 or 1 MacKay and Miskin take simply = ∑ + ε ( h ) ( h ) y a b gs s g gs h 23 24
Markov chain Monte Carlo Simultaneous inference (MCMC) computation • An important example of the flexibility of • Fitting of Bayesian models hugely MCMC computation in a Bayesian model: facilitated by advent of these simulation inference about several unknowns at methods once. • Produce a large sample of values of all • e.g. not only ‘which gene has the biggest unknowns, ≈ from posterior given data estimated differential effect?’, but also • Easy to set up for hierarchical models ‘how probable is it that this gene has the • BUT can be slow to run (for many biggest differential effect?’ variables!) • and can fail to converge reliably 25 26 Contact details http://www.stats.bris.ac.uk/BGX Graeme.Ambler@bristol.ac.uk P.J.Green@bristol.ac.uk 27
Recommend
More recommend