Systems genetics with graphical Markov models Robert Castelo robert.castelo@upf.edu @robertclab Dept. of Experimental and Health Sciences (DCEXS) Universitat Pompeu Fabra (UPF) Barcelona Machine Learning for Personalized Medicine Satellite Symposium of the ESHG Conference Barcelona, May 19th, 2016 Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 1 / 63
DCEXS/UPF is located at the Barcelona Biomedical Research Park (PRBB) Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 2 / 63
Joint work with Inma Tur Alberto Roverato Kernel Analytics, Barcelona University of Bologna I. Tur, A. Roverato and R. Castelo. Mapping eQTL networks with mixed graphical Markov models. Genetics , 198(4):1377-1383, 2014. http://arxiv.org/abs/1402.4547 Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 3 / 63
Motivation - Quantitative genetics Primary goal: finding the genetic basis of complex (quantitative) higher-order phenotypes (traits). Intercross (Fig. by Karl Broman in ” Introduction to QTL mapping in model organisms” ) 0.025 P 1 P 2 0.020 Density 0.015 F 1 F 1 0.010 0.005 0.000 F 2 60 80 100 120 140 160 180 HDL Leduc et al. Using bioinformatics and systems genetics to dissect HDL-cholesterol genetics in an MRL/MpJ x SM/J intercross. Journal of Lipid Research , 53:1163-1175, 2012. Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 4 / 63
Motivation - Quantitative genetics Find DNA sites along the genome associated to the phenotype, known as quantitative trait loci (QTLs). Simplest approach: regress phenotype on each marker (Soller, 1976), calculating the so-called logarithm of odds (LOD) score. H 0 : y i ∼ N ( µ 0 , σ 2 H 1 : y i | g i ∼ N ( µ g i , σ 2 0 ) 1 ) . L 1 RSS 0 = n LOD = log 10 2 log 10 . L 0 RSS 1 12 10 LOD score 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Chromosome Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 5 / 63
Motivation - Quantitative genetics Estimate the effect size of found QTLs using, for instance, the percentage of variance explained by the QTL. 160 η 2 = RSS 0 − RSS 1 140 = 0 . 346 . ( n − 1) · s 2 Y HDL 120 100 About 35% of the variability in HDL levels is explained by this QTL. 80 MM MS SS Genotype Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 6 / 63
Motivation - Quantitative genetics on genomics data Yeast BY x RM cross (Fig. by Rockman and Kruglyak, 2006). The resulting data published by Brem and Kruglyak (2005) consists of ∼ 6 , 000 genes and ∼ 3 , 000 genotype markers. DNA sites along the genome associated to gene expression are called expression QTLs (eQTLs). Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 7 / 63
Motivation - Quantitative genetics on genomics data Straightforward approach: apply classical QTL analysis methods independently on each gene expression profile (Soller, 1976): H 0 : y ∼ N ( µ 0 , σ 2 � L 1 RSS 0 0 ) = n LOD = log 10 2 log 10 . H 1 : y | g ∼ N ( µ g , σ 2 1 ) L 0 RSS 1 Plot location of genome-wide significant eQTLs with respect to both, eQTL and gene genomic position ( dot plot ). Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 8 / 63
Motivation - Quantitative genetics on genomics data Let Γ denote the an index set for all genes with p Γ = | Γ | (thousands). Let n denote the number of profiled individuals (tens, hundreds). Let Y = { y ij } p Γ × n denote the matrix of gene expression values with p Γ ≫ n : 1 2 . . . Y n g 1 y 11 y 12 . . . y 2 n g 2 y 21 y 22 . . . y 2 n . . . g 3 y 31 y 32 y 3 n . . . . . . . . . . . . . . . g p Γ y p Γ 1 y p Γ 2 . . . y p Γ n Gene expression is a high-dimensional multivariate trait. Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 9 / 63
Motivation - Quantitative genetics on genomics data Gene expression measurements by high-througthput instruments are the result of multiple types of effects : Genetic : DNA polymorphisms affecting transcription initiation and RNA processing. Molecular : RNA-binding events affecting post-transcriptional regulation (e.g., RNA degradation). Environmental : response of the cell to external stimuli. Technical : sample preparation protocols or laboratory conditions create sample-specific biases affecting most of the genes. All these effects render expression measurements in Y highly-correlated, thereby complicating the distinction between direct and indirect effects. Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 10 / 63
Motivation - Quantitative genetics on genomics data Think of genes and eQTLs as forming a network, which we shall call an eQTL network . g5 QTL2 15 LOD scores 10 g15 g5 5 g22 g22 0 QTL2 0 20 40 60 80 100 120 g15 Map position (cM) Assume that gene expression forms a p Γ -multivariate sample following a conditional Gaussian distribution given the joint probability of all eQTLs = ⇒ mixed Graphical Markov model (Lauritzen and Wermuth, 1989) Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 11 / 63
Software availability: the R/Bioconductor package qpgraph Available at http://bioconductor.org/packages/qpgraph Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 12 / 63
Outline Overview of GMMs 1 Propagation of eQTL (genetic) additive effects 2 Conditional independence in mixed GMMs 3 q-Order correlation graphs 4 A three-step estimation strategy 5 Visualization of eQTL networks 6 Analysis of of a yeast cross 7 Concluding remarks 8 Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 13 / 63
Outline Overview of GMMs 1 Propagation of eQTL (genetic) additive effects 2 Conditional independence in mixed GMMs 3 q-Order correlation graphs 4 A three-step estimation strategy 5 Visualization of eQTL networks 6 Analysis of of a yeast cross 7 Concluding remarks 8 Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 14 / 63
Overview of GMMs - undirected Gaussian GMMs Let X V be continuous r.v.’s and G = ( V , E ) an undirected labeled graph: V = { 1 , ..., p } are the vertices of G X V ∼ P ( X V ) ≡ N ( µ, Σ) µ is the p -dimensional mean vector Σ = { σ ij } p × p is the covariance matrix Σ − 1 = { κ ij } p × p is the concentration matrix Note that Pearson and partial correlation coefficients follow from scaling covariance ( Σ ) and concentration ( Σ − 1 ) matrices, respectively: σ ij − κ ij , R = V \{ i , j } . ρ ij = ρ ij . R = √ σ ii σ jj √ κ ii κ jj Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 15 / 63
Overview of GMMs - undirected Gaussian GMMs Let G = ( V , E ) be an undirected graph with V = { 1 , . . . , p } , a Gaussian graphical model can be described as follows: 5 κ 11 κ 12 0 0 0 κ 21 κ 22 κ 23 κ 24 0 Σ − 1 = 3 4 0 κ 32 κ 33 0 κ 35 0 κ 42 0 κ 44 κ 45 2 0 0 κ 53 κ 54 κ 55 1 A probability distribution P ( X V ) is undirected Markov w.r.t. G if ( i , j ) �∈ E ⇒ κ ij = 0 ⇔ X i ⊥ ⊥ X j | X V \{ X i , X j } These models are also known as covariance selection models (Dempster, 1972) or concentration graph models (Cox and Wermuth, 1996). Two vertices i and j are separated in G by a subset S ⊂ V \{ i , j } iff every path between i and j intersects S , denoted hereafter by i ⊥ G j | S . Global Markov property (Hammersley and Clifford, 1971): i ⊥ G j | S ⇒ X i ⊥ ⊥ X j | X S . Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 16 / 63
Overview of GMMs - undirected Gaussian GMMs Consider simulating an undirected Gaussian GMM by simulating a covariance matrix Σ such that Σ is positive definite ( Σ ∈ S + ), 1 the off-diagonal cells of the scaled Σ corresponding to the present edges in 2 G match a given marginal correlation ρ , the zero pattern of Σ − 1 matches the missing edges in G . 3 This is not straightforward since setting directly off-diagonal cells to zero in some initial Γ ∈ S + will not typically lead to a positive definite matrix. Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 17 / 63
Overview of GMMs - undirected Gaussian GMMs Let Γ G be an incomplete matrix with elements { γ ij } for i = j or ( i , j ) ∈ G . 1 ∗ γ 11 γ 12 γ 13 ∗ γ 21 γ 22 γ 24 Γ G = 2 3 γ 31 ∗ γ 33 γ 34 ∗ γ 42 γ 43 γ 44 4 Γ is a positive completion of Γ G if Γ ∈ S + and { Γ − 1 } ij =0 for i � = j , ( i , j ) �∈ G . Draw Γ G from a Wishart distribution W p (Λ , p ) ; Λ=∆ R ∆ , ∆=diag( { � 1 / p } p ) and R = { R ij } p × p where R ij = 1 for i = j and R ij = ρ for i � = j . It is required that Λ ∈ S + and this happens if and only if − 1 / ( p − 1) < ρ < 1 . Finally, to obtain Σ ≡ Γ from Γ G , qpgraph uses the regression algorithm by Hastie, Tibshirani and Friedman (2009, pg. 634) as matrix completion algorithm. Robert Castelo - robert.castelo@upf.edu - @robertclab Systems genetics with GMMs 18 / 63
Recommend
More recommend