markov models in molecular phylogeny and evolution
play

Markov models in molecular phylogeny and evolution Nicolas Galtier - PowerPoint PPT Presentation

Markov models in molecular phylogeny and evolution Nicolas Galtier CNRS UMR 5554 Institut des Sciences de lEvolution Universit Montpellier 2 galtier@univ-montp2.fr Markov models in molecular phylogeny Generalities about Markov


  1. Markov models in molecular phylogeny and evolution Nicolas Galtier CNRS UMR 5554 – Institut des Sciences de l’Evolution Université Montpellier 2 galtier@univ-montp2.fr

  2. Markov models in molecular phylogeny Generalities about Markov processes - Definition : Markov chains (= Markov processes) are mathematical objects devoted to the description/modelling of the variations in time of a system under the (very weak) hypothesis of lack of memory : the future of the system only depends on its current state , not on the pathway that was followed to reach it. - A few examples : discrete time, discrete states: branching process discrete time, continuous states: random walks continuous time, discrete states: Poisson process continuous time, continuous states : Brownian motion - In molecular phylogeny , states are the 4 nucleotides / 20 amino-acids / 61 codons, and the process is typically represented by a rate matrix in continuous time.

  3. Markov models in molecular phylogeny Example rate matrices A C G T A X α κ . α α C α X α κ . α κ . α α X α G α κ . α α X T Kimura model (nucleotides) WAG model (amino-acids)

  4. Markov models in molecular phylogeny Markov models are the fundamental tool of molecular phylogeny Why? - because evolution is very generally memoryless - because the theory of Markov chains is well developed What for? - simulating data - building phylogenies accounting for the evolutionary process - inferring the processes and learn about the forces underlying molecular evolution How? - thanks to the statistical approach in molecular phylogeny

  5. Markov models in molecular phylogeny The statistical approach in molecular phylogeny 1- modelling Sequence evolution is represented by a Markov process running along a tree. 2- computing expectations Calculate the likelihood function , i.e. the probability of the data given the model. 3- fitting model to data Maximise the likelihood over the parameter space, and thus obtain maximum likelihood estimates for parameters. or Calculate the posterior probability of parameters given the data and the priors ( bayesian approach).

  6. Markov models in molecular phylogeny Likelihood calculation in molecular phylogeny Branch lengths : l i Tree topology T X 0 l 1 A C G T l 6 X 1 l 2 A β β α X 3 l 5 X 2 C β β α l 7 G l 8 β α β l 3 l 4 T β β α y 1 : A A C A G y 2 : T T C T T Rate matrix : M y 3 : A A A A A data : Y

  7. Markov models in molecular phylogeny Likelihood calculation in molecular phylogeny Branch lengths : l i Tree topology T X 0 l 1 A C G T l 6 X 1 l 2 A β β α X 3 l 5 X 2 C β β α l 7 G l 8 β α β l 3 l 4 T β β α y 1 : A A C A G y 2 : T T C T T Rate matrix : M y 3 : A A A A A data : Y L ( l i , Μ , T ) = Pr( Y | l i , Μ , T ) = Π Pr ( y i | l i , Μ , T ) i Pr( y 1 | l i , Μ , T ) = ΣΣΣΣ Pr( X 0 = x 0 ).Pr( X 1 = x 1 | X 0 = x 0 ). Pr( X 2 = x 2 | X 1 = x 1 ).Pr( y 11 =A| X 2 = x 2 ). Pr( y 12 =A| X 2 = x 2 ). x 0 x 1 x 2 x 3 Pr( y 13 =C| X 1 = x 1 ). Pr( X 3 = x 3 | X 0 = x 0 ). Pr( y 14 =A| X 3 = x 3 ). Pr( y 15 =G| X 3 = x 3 ) Felsenstein 1981 J Mol Evol 17:368

  8. Markov models in molecular phylogeny Calculating transition probabilities P ( t )=e M t t is for time (branch length) M is the rate matrix :1/ m ij = average waiting time before state i changes to state j P( t ) is the substitution probability matrix: p ij ( t ) is the probability of observing state j after evolution during time t starting from state i . Deriving this formula starts by writing differential equations like: A( t + dt )=A( t )(- m AC - m AG - m AT ) dt + C( t ) m CA dt + G( t ) m GA dt + T( t ) m TA dt Calculating the exponential of a matrix is easy when diagonalisable.

  9. Markov models in molecular phylogeny Using the likelihood function Knowing how to calculate the likelihood, we can: - estimate parameters by maximising (ML = Maximum Likelihood) - recover details of the process using conditional likelihoods (EB = Empirical Bayesian) - test hypotheses by comparing models (LRT = Likelihood Ratio Test) The bayesian approach can fulfill the same purposes with more complex models, if we accept to draw prior distributions for parameters.

  10. Markov models in molecular phylogeny Example biological questions requiring good usage of Markov models: - have my favourite protein evolved under positive selection ? (codon models) - have it undergone any functional change ? (covarion = heterotachous models) - have it undergone any compositional change ? (non-stationary models) - can we exhibit coevolution between sites ? (models of departure from independence) - what did the ancestral sequence look like ? (empirical bayesian) - which changes occurred ? In which branches ? ( substitution mapping ) - when did speciations occur ? (clock-relaxed models)

  11. A non-stationary model ω θ θ A C G T θ θ θ θ X (1- θ ) α (1- θ ) κα (1- θ ) α A θ θ 1 4 2 5 θ α X θ α θ κα 3 C stationary, homogeneous θ κα θ α X θ α G ω T (1- θ ) α (1- θ ) κα (1- θ ) α X θ 1 θ 2 Tamura 1992 model θ 3 θ 4 θ 5 θ 8 θ 7 θ = equilibrium GC-content θ 6 1 4 2 5 3 non-stationary, non-homogeneous Galtier and Gouy 1998 Mol Biol Evol 15:871

  12. A non-stationary model Accuracy of ancestral GC% estimation (simulations) actual actual actual MP MP MP NHML NHML NHML low GCanc (10-25%) 18% 18% 18% 32% 32% 19% 10% 10% 10% 27% 27% 11% 22% 22% 22% 40% 40% 21% 14% 14% 14% 30% 30% 16% 14% 14% 14% 28% 28% 15% medium high eqGC sequence GC (90%) (~40%)

  13. A non-stationary model Optimal growth temperature versus rRNA GC% in prokaryotes LSU 80 Topt 40 0 SSU 80 Topt 40 0 50 60 70 rRNA G+C-content

  14. A non-stationary model The rRNA universal tree of life Giardia 70.4% Entamoeba 43.7% Euglena 51.7% FUNGI 48.6% EUCARYA PLANTA 50.4% METAZOA 52.4% Desulfurococcus 64.2% CRENARCHAE Thermoproteus 63.5% M.vannieli 57.7% M.jannashi 62.3% Halococcus 58.9% estimated EURYARCHAE Halobacterium 58.7% 56.1% ancestral GC% : LOW GC GRAM+ 54.2% CHLOROPLASTS 52.5% BACTERIA PROTEOBACTERIA 54.1% HIGH GC GRAM+ 57.0% Thermus 61.3% Thermotoga 60.9%

  15. A non-stationary model A non-hyperthermophilic ancestor? LSU 80 Topt 40 0 SSU 80 Topt 40 0 50 60 70 rRNA G+C-content

  16. A non-stationary model Controlling for species sampling Giardia 70.4% Entamoeba 43.7% Euglena 51.7% FUNGI 48.6% EUCARYA PLANTA 50.4% METAZOA 52.4% Desulfurococcus 64.2% CRENARCHAE Thermoproteus 63.5% M.vannieli 57.7% M.jannashi 62.3% Halococcus 58.9% EURYARCHAE Halobacterium 58.7% 56.1% LOW GC GRAM+ 54.2% CHLOROPLASTS 52.5% BACTERIA PROTEOBACTERIA 54.1% HIGH GC GRAM+ 57.0% Thermus 61.3% Thermotoga 60.9% Eukaryote 1 70.9% Eukaryote 2 70.9% Crenarchae 1 65.4% Crenarchae 2 65.1% Euryarchae 1 65.2% 57.3% Euryarchae 2 65.0% Bacteria 1 63.2% Bacteria 2 62.3%

  17. A non-stationary model A non-hyperthermophilic ancestor? LSU 80 Topt 40 0 SSU 80 Topt 40 0 50 60 70 rRNA G+C-content Galtier et al. 1999 Science 283:221

  18. Codon models, positive selection The standard genetic code T C A G TTT → Phe TCT → Ser TAT → Tyr TGT → Cys TTC → Phe TCC → Ser TAC → Tyr TGC → Cys T TTA → Leu TCA → Ser TAA → Stop TGA → Stop TTG → Leu TCG → Ser TAG → Stop TGG → Trp CTT → Leu CCT → Pro CGT → Arg CAT → His C CTC → Leu CCC → Pro CAC → His CGC → Arg CTA → Leu CCA → Pro CAA → Gln CGA → Arg CTG → Leu CCG → Pro CAG → Gln CGG → Arg ATT → Ile ACT → Thr AAT → Asn AGT → Ser ATC → Ile ACC → Thr AAC → Asn AGC → Ser A ATA → Ile ACA → Thr AAA → Lys AGA → Arg ATG → Met ACG → Thr AAG → Lys AGG → Arg GTT → Val GCT → Ala GAT → Asp GGT → Gly G GTC → Val GCC → Ala GAC → Asp GGC → Gly GTA → Val GCA → Ala GAA → Glu GGA → Gly GTG → Val GCG → Ala GAG → Glu GGG → Gly

  19. Codon models, positive selection The Goldman-Yang codon model β . π Y if codon X and codon Y differ by one synonymous transversion β ω . π Y if codon X and codon Y differ by one nonsynonymous transversion α . π Y if codon X and codon Y differ by one synonymous transition m XY = α . ω . π Y if codon X and codon Y differ by one non-synonymous transition if codon X and codon Y differ by more than one base 0 ω is the parameter of interest: - ω =1 in case of neutral evolution - ω <1 in case of negative selection (constraint) - ω >1 in case of positive selection (adaptation) Goldman & Yang 1994 Mol Biol Evol 11:725

  20. Codon models, positive selection Primate lysosyme evolution Model 0 : ω 0 = ω C ln(L)= -1043.84 ω 0 = ω C = 0.574 Model 1 : ω 0 ≠ ω C ln(L)= -1041.70 ω 0 = 0.489 ; ω C = 3.383 Yang 1998 Mol Biol Evol 15:568

  21. Codon models, positive selection The likelihood ratio test (LRT) LRT are used to decide whether the increase in likelihood obtained by adding parameters (=degrees of freedom) to a model is significant. Let M O and M 1 be two nested models: M O ( p parameters) is a special instance of M 1 ( p + n parameters) Let L 0 and L 1 be the maximum likelihoods under M O and M 1 , respectively. Twice the log-likelihood ratio is asymptotically χ 2 distributed ( n degrees of freedom) under M O 2. log (L 1 /L 0 ) ~ χ 2 ( n df)

Recommend


More recommend