Likelihood in Molecular Phylogenetics Peter G. Foster The Natural History Museum, London Lausanne, September 2003 The likelihood supplies a natural order of Likelihood uses all the data preferences among the possibilities under con- A acgcaa sideration. B acataa -R.A. Fisher, 1956 C atgtca D gcgtta Likelihood in molecular phylogenetics • There are no parsimony informative sites in these • Why use likelihood? data • Simple likelihood calculations • Under unweighted parsimony, all three possible trees have zero length • Choosing a model A D A D A B • Practicals using PAUP B C C B D C Reference 0 • Although there are no parsimony informative sites, Swofford, Olsen, Waddell, and Hillis, 1996. in Hillis et there appear to have been several evolutionary al , Molecular Systematics . events, which should provide useful phylogenetic in- formation if we could use it. Why use likelihood? • It appears that transitions are more common than transversions • Models take into account branch lengths • The constant site provides useful information re- – Accurate branch lengths even if there are su- garding the tendency of the a to stay the same. perimposed hits ( ie more than one mutation • If we use this information, then one tree is more at the same site) optimal than the other two. • models are explicit What is the ancestral state? – assumptions are stated, not hidden Consider one site on this tree, with these character states— • You can make the model fit the data a • likelihood is efficient and powerful ? a – it uses all the data c g You can make the model fit the data t If the data . . . • Ancestral state “ a ” is most parsimonious • have an unusual composition • Wide character state variation together with short branches tells us that this is a fast site • have a transition/transversion ratio different from —so we should expect a large amount of change 1 over the long branches. • have both quickly and slowly evolving sites • Likelihood is equivocal about the ancestral state (it could be anything) . . . you can use that information in your model.
Branch lengths under parsimony and likelihood Likelihood calculations • Parsimony considers that you would have the same • In molecular phylogenetics, the data are an align- expectation that a character would change along ment of sequences both long and short branches. • We optimize parameters and branch lengths to get • Likelihood and distance methods, using models, the maximum likelihood consider that change is more probable along long • Each site has a likelihood branches than along short branches. – this differs depending on the model and tree Molecules do not evolve like morphological characters • The total likelihood is the product of the site like- • Molecular sequences appear to evolve mostly by lihoods random change, with a small amount of selection. – or the sum of the log of the site likelihoods • This behavior can be described well by stochastic models which incorporate among-site rate variation. • The maximum likelihood tree is the tree topology that gives the highest (optimized) likelihood under • This allows us to use probabilistic methods in our the given model. analyses • We use reversible models, so the position of the root – and puts our analyses on a sound statistical does not matter. footing. Reference 1 Likelihood is appropriate for data generated by a random process P. G. Foster 2001. “The Idiot’s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, These data were probably not generated by a random Unleashed” process • Read this only if you want to know where the num- 000000000000 bers come from 010301001000 222022100100 • Elementary likelihood calculations and definitions 131130010011 • Probability and rate matrices • Finding the maximum likelihood branch length • There are no constant sites. • Calculating likelihood values on a tree • There is an obvious ancestral taxon. • Some characters are binary, some are multi-state • Checking that PAUP* gets the correct likelihood values Simple likelihood calculations Choosing a model Likelihood • Don’t “assume” a model In general... • Rather, find a model that fits your data. The likelihood is the probability of the data given the model. Models are described in terms of... In phylogenetics, we can say (loosely) that the tree is • the tendency of one base to change to another part of the model • the composition The likelihood is the probability of the data given the tree and the model. • site-to-site rate variation Flip a coin– get a “head” Models often have “free” parameters. These can be fixed to a reasonable value, or estimated by ML. What is the likelihood of that data? • The likelihood depends on the model Tendency of one base to change to another • If you think its a fair coin, the likelihood of the data • This can be described by a rate matrix is 0.5 • The most complex in PAUP is the GTR, general • If you think it is a two-headed coin, the likelihood time-reversible of the data is 1.0 • Other models are simplifications of this • ...So the model that you use can have a big effect – HKY, F81, K2P, JC, etc ... on the likelihood
GTR: General time-reversible model What is the best way to fit a line (a model) through these points? 2 3 a b c − a d e 6 − 7 R = b b 6 7 b b b d f b b 4 − 5 b b b b b b c e f b b − b b b b b b b b • Symmetrical, so time-reversible • There are 6 substitution types ( lset nst=6 ), so 5 b b b b free parameters b b b b • You can restrict these using the rclass subcom- b b b b b b b b b b mand in lset b b b b b b — eg lset rclass=(a b c c b a) to make a sub- b b set with only 3 substitution types —The program modeltest uses rclass a lot, see b b the file modelblock3 b b Base frequencies (composition) How to tell if adding (or removing) a certain parameter is a good idea? • equal • Use statistics • specified • The null hypothesis is that the presence or absence • empirical of the parameter makes no difference – often a good approximation to ML-estimated, • In order to assess significance you need a null dis- and much faster tribution • estimated by ML Is it worth adding a parameter? —An example • For DNA, there are 4 compositions, so 3 free pa- We have some DNA data, and a tree. Evaluate the data rameters with 3 different models. model ln likelihood ∆ Among-site rate heterogeneity JC -2348.68 K2P -2256.73 91.95 • pInvar GTR -2254.94 1.79 • gamma-distributed variable sites • Evaluations with more complex models have higher – has an average rate of 1.0 likelihoods – shape can change greatly with only one param- • The K2P model has 1 more parameter than the JC eter ( α , shape in PAUP) model, the tRatio . – approximated with a discrete gamma distribu- • The GTR model has 4 more parameters than the tion with nCat divisions K2P model • pInvar + gamma • Are the extra parameters worth adding? • site-specific Is the K2P model better than the JC model for these data? – good for codons • Null hypothesis (generally): the extra parameter Parameters does not make any difference • Models differ in their free, ie adjustable, parameters • Null hypothesis (specifically): the tree and the JC model • More parameters are often necessary to better ap- proximate the reality of evolution • We need to know how much of an improvement in likelihood we can expect from true null hypothesis • The more free parameters, the better the fit (higher data when we add the tratio parameter the likelihood) of the model to the data. (Good!) – The increase in likelihood will be due to noise, • The more free parameters, the higher the variance, only and the less power to discriminate among compet- • We need a null distribution, which we can get by ing hypotheses. (Bad!) simulating data many times under the null hypoth- • We do not want to “over-fit” the model to the data esis
Recommend
More recommend