Phylogenetic trees IV Maximum Likelihood Gerhard Jäger ESSLLI 2016 Gerhard Jäger Maximum Likelihood ESSLLI 2016 1 / 50
Theory Theory Gerhard Jäger Maximum Likelihood ESSLLI 2016 2 / 50
Theory Recap: Continuous time Markov model ESSLLI 2016 Maximum Likelihood Gerhard Jäger 3 / 50 l 4 l 3 l 2 l 5 � s + re − t r − re − t � l 8 P ( t ) = s − se − t r + se − t l 1 = ( s, r ) π l 6 l 7
Theory Likelihood of a tree ESSLLI 2016 Maximum Likelihood Gerhard Jäger 4 / 50 different branches is independent suppose we know probability simplifying assumption: evolution at background reading: Ewens and Grant (2005), 15.7 l 4 l 3 l 2 l 5 l 8 l 1 distributions v t and v b over states at top and bottom of branch l k l 6 L ( l k ) = v T t P ( l k ) v b l 7
Theory Likelihood of a tree ESSLLI 2016 Maximum Likelihood Gerhard Jäger method from tips to root log-likelihood of larger tree: recursively apply this 5 / 50 log-likelihoods likelihoods of states (0 , 1) at root are v T 1 P ( l 1 ) v T 2 P ( l 2 ) l 2 log ( v T 1 P ( l 1 )) + log ( v T 2 P ( l 2 )) v 2 l 1 v 1
Theory (Log-)Likelihood of a tree Gerhard Jäger Maximum Likelihood ESSLLI 2016 6 / 50 log L ( tips below | mother = s ) = s ′ ∈ states log P ( s → s ′ | branchlength )+ � � d ∈ daughters log ( L ( tips below d | d = s ′ ))
Theory (Log-)Likelihood of a tree ESSLLI 2016 Maximum Likelihood Gerhard Jäger likelihoods for each character this is for one character — likelhood for all data is product of if we assume that root node is in equilibrium: root overall likelihood for entire tree depends on probability distribution on each branch this is essentially identical to Sankoff algorithm for parsimony: 7 / 50 weight ( i, j ) = log P ( l k ) ij weight matrix depends on branch length → needs to be recomputed for L ( tree ) = ( s, r ) T L ( root ) does not depend on location of the root ( → time reversibility)
Theory (Log-)Likelihood of a tree likelihood of tree depends on branch lengths rates for each character likelihood for tree topology: Gerhard Jäger Maximum Likelihood ESSLLI 2016 8 / 50 L ( tree | � L ( topology ) = max l k ) l k : k is a branch
Theory (Log-)Likelihood of a tree ESSLLI 2016 Maximum Likelihood Gerhard Jäger rates are gamma distributed 4 invariant 3 characters) 2 1 different options, increasing order of complexity Where do we get the rates from? 9 / 50 s = r = 0 . 5 for all characters r = empirical relative frequency of state 1 in the data (identical for all a certain proportion p inv (value to be estimated) of characters are
Theory rate matrix is multiplied with ESSLLI 2016 Maximum Likelihood Gerhard Jäger Gamma distribution Gamma-distributed rates 10 / 50 all characters equilibrium distribution is identical for except for mathematical convenience) common method (no real justification much we want allow rates to vary, but not too coefficient λ i for character i λ i is random variable drawn from a L ( r i = x ) = β β x ( β − 1) e − βx Γ( β )
Theory Gamma-distributed rates overall likelihood of tree topology: integrate computationally impractical approximate integration via Hidden Markov Model Gerhard Jäger Maximum Likelihood ESSLLI 2016 11 / 50 over all λ i , weighted by Gamma likelihood in practice: split Gamma distribution into n discrete bins (usually n = 4 ) and
Theory Modeling decisions to make ESSLLI 2016 Maximum Likelihood Gerhard Jäger This could be continued — you can build in rate variation across branches, you can fit the 1 0 none invariant characters 1 Gamma distributed 0 none rate variation 1 ML estimate 1 aspect of model possible choices number of parameters to estimate branch lengths unconstrained 12 / 50 ultrametric equilibrium probabilities uniform 0 empirical 2 n − 3 ( n is number of taxa) n − 1 p inv number of Gamma categories . . .
Theory Model selection tradeoff rich models are better at detecting patterns in the data, but are prone to over-fitting parsimoneous models less vulnerable to overfitting but may miss important information standard issue in statistical inference one possible heuristics: Akaike Information Criterion (AIC) the model minimizing AIC is to be preferred Gerhard Jäger Maximum Likelihood ESSLLI 2016 13 / 50 AIC = − 2 × log likelihood + 2 × number of free parameters
Theory unconstrained 16 unconstrained uniform Gamma 17496.73 17 empirical none none none 16106.52 18 unconstrained empirical 17494.73 Gamma 16049.28 none Example: Model selection for cognacy data/ 16009.90 13 unconstrained uniform none 17492.73 uniform 14 unconstrained uniform none 17494.73 15 unconstrained none 19 ultrametric 16025.99 16051.27 23 unconstrained ML Gamma none 24 ML unconstrained ML Gamma 16001.00 Gerhard Jäger Maximum Likelihood ESSLLI 2016 none unconstrained unconstrained empirical empirical Gamma none 16033.21 20 unconstrained Gamma 22 16011.38 21 unconstrained ML none none 16102.04 ML Gamma 12 ultrametric ultrametric uniform Gamma none 17517.89 4 uniform 17518.39 Gamma 17519.75 5 ultrametric empirical none none 3 none 6 AIC UPGMA tree model no. branch lengths eq. probs. rate variation inv. char. 1 uniform ultrametric uniform none none 17515.95 2 ultrametric 15981.94 16114.66 ultrametric ultrametric empirical ML none none 16034.96 10 ML 16022.21 none 16058.83 11 ultrametric ML Gamma none 9 ultrametric 14 / 50 empirical empirical ultrametric 8 none 16056.85 7 15997.16 ultrametric none Gamma Gamma p inv p inv p inv p inv p inv p inv p inv p inv p inv p inv p inv p inv
Theory model requires several hours on a single processor; parallelization helps ESSLLI 2016 Maximum Likelihood Gerhard Jäger in practice one has to make compromises model specification, and pick the tree+model with lowest AIC ideally, one would want to do 24 heuristic tree searches, one for each for the 25 taxa in our running example, ML tree search for the full Tree search computationally very demanding! optimize branch lengths to maximize likelihood for that topology heuristic search to find the topology maximizing likelihood ML tree: a model ML computation gives us likelihood of a tree topology, given data and 15 / 50
Running example Running example Gerhard Jäger Maximum Likelihood ESSLLI 2016 16 / 50
Running example ultrametric: ESSLLI 2016 Maximum Likelihood Gerhard Jäger Running example: cognacy data AIC = 7972 17 / 50 unconstrained branch lengths: AIC = 7929 Greek Irish Breton Welsh Bengali Hindi Nepali Lithuanian Bulgarian Czech Polish Russian Ukrainian Icelandic Swedish Danish English Dutch German Romanian French Italian Catalan Portuguese Spanish Greek Hindi Bengali Nepali Italian French Catalan Romanian Portuguese Spanish Irish Breton Welsh Lithuanian Russian Ukrainian Polish Bulgarian Czech English Dutch German Danish Icelandic Swedish
Running example ultrametric: ESSLLI 2016 Maximum Likelihood Gerhard Jäger Running example: WALS data AIC = 2828 18 / 50 unconstrained branch lengths: AIC = 2752 Bengali Nepali Hindi Breton Irish Welsh Bulgarian Greek Czech Lithuanian Polish Russian Ukrainian Catalan Italian Portuguese Romanian Spanish French Danish Swedish Icelandic Dutch German English Hindi Bengali Nepali Polish Czech Lithuanian Russian Ukrainian English Dutch German Swedish Danish Icelandic Bulgarian Greek Romanian Portuguese Spanish Catalan Italian French Breton Irish Welsh
Running example ultrametric: ESSLLI 2016 Maximum Likelihood Gerhard Jäger Running example: phonetic data AIC = 90575 19 / 50 unconstrained branch lengths: AIC = 89871 Bengali Hindi Nepali Lithuanian Bulgarian Polish Czech Russian Ukrainian English Dutch German Danish Icelandic Swedish Greek Irish Breton Welsh French Catalan Portuguese Romanian Spanish Italian Lithuanian Ukrainian Bulgarian Russian Polish Czech Icelandic Swedish Danish English Dutch German Bengali Hindi Nepali Greek Irish Breton Welsh Romanian French Italian Spanish Catalan Portuguese
Running example many parameter settings makes model selection difficult ESSLLI 2016 Maximum Likelihood Gerhard Jäger (more than 100–200 taxa) ultrametric constraint makes branch lengths optimization even though they have higher AIC) (note that the ultrametric trees in our example are sometimes better computationally demanding Wrapping up disadvantages: character states at each internal node can be read off side effect of likelihood computation: probability distribution over on branch lengths possibility of multiple mutations are taken into account — depending data different mutation rates for different characters are inferred from the ML is conceptually superior to MP (let alone distance methods) 20 / 50 computationally more expensive ⇒ not feasible for larger data sets
Cleaning up from yesterday Cleaning up from yesterday Gerhard Jäger Maximum Likelihood ESSLLI 2016 21 / 50
Recommend
More recommend