cs 581 algorithmic genomic biology
play

CS 581: Algorithmic Genomic Biology L.S. Kubatko and J.H. Degnan: - PowerPoint PPT Presentation

CS 581: Algorithmic Genomic Biology L.S. Kubatko and J.H. Degnan: Inconsistency of Phylogenetic Estimates from Concatenated Data under Coalescence. Systematic Biology 2007, 56 (1): 17-24. Ben Kurtovic March 9, 2017 Ben Kurtovic CS 581:


  1. CS 581: Algorithmic Genomic Biology L.S. Kubatko and J.H. Degnan: ”Inconsistency of Phylogenetic Estimates from Concatenated Data under Coalescence.” Systematic Biology 2007, 56 (1): 17-24. Ben Kurtovic March 9, 2017 Ben Kurtovic CS 581: Kubatko and Degnan, 2007 March 9, 2017 1 / 11

  2. Overview Concatenation is a popular, simple method for analyzing multiple gene sequences from a set of species Incomplete lineage sorting causes incorrect species trees to be inferred due to the higher probability of anomalous gene trees Through simulation, shown that maximum likelihood on concatenated sequences appears to be statistically inconsistent in the number of genes for a variety of parameters Bootstrap can strongly support an incorrect phylogeny Ben Kurtovic CS 581: Kubatko and Degnan, 2007 March 9, 2017 2 / 11

  3. Model tree Branch lengths given in coalescent t units 2 N , where t is the number of generations and N is the effective population size Most probable gene tree can differ from the species tree when branch lengths are sufficiently small: anomalous gene trees (AGTs) Asymmetric topologies constrain order of coalescent events more than symmetric, so symmetric topologies can appear more likely Ben Kurtovic CS 581: Kubatko and Degnan, 2007 March 9, 2017 3 / 11

  4. Anomaly zone When x or y is small, an anomaly zone occurs wherein tree S 1 is more probable than the correct tree, MT When x and y are both small, all three symmetric trees are more probable Ben Kurtovic CS 581: Kubatko and Degnan, 2007 March 9, 2017 4 / 11

  5. Simulation method For each pair ( x, y ) , independent gene trees were sampled using the coalescent for varying numbers of genes Gene tree branch lengths determined by the species tree length in coalescent units times θ 2 , where θ = 4 Nµ , N is effective population size, µ is mutation rate Considered θ = 0 . 001 and θ = 0 . 01 DNA sequences for each gene tree generated using Jukes-Cantor, alignments concatenated, and maximum likelihood performed using PAUP* Ben Kurtovic CS 581: Kubatko and Degnan, 2007 March 9, 2017 5 / 11

  6. Results ( θ = 0 . 001 ) Ben Kurtovic CS 581: Kubatko and Degnan, 2007 March 9, 2017 6 / 11

  7. Results ( θ = 0 . 01 ) Ben Kurtovic CS 581: Kubatko and Degnan, 2007 March 9, 2017 7 / 11

  8. Results summary For points in anomaly zone, incorrect tree S 1 gets inferred with increasing probability as number of genes increases Even points outside of the anomaly zone (A, B) can infer an incorrect phylogeny – despite S 1 not being the most frequent topology Therefore, existence of an AGT is neither necessary nor sufficient for inconsistency, but can be a good indicator Other variables ( z and θ ) influence amount of evolution, but not gene tree probabilities, and do not greatly affect results Ben Kurtovic CS 581: Kubatko and Degnan, 2007 March 9, 2017 8 / 11

  9. Results (varying branch length x ) Frequency that correct tree ( MT ) is inferred correlates with the probability of MT given the species tree However, a large branch length or large number of genes is required for this probability to be high (convergence is slow) Ben Kurtovic CS 581: Kubatko and Degnan, 2007 March 9, 2017 9 / 11

  10. Bootstrap Bootstrap used to determine confidence in phylogenic estimates Resample concatenated sequences with replacement and compare how well new sequences support the various trees Per results, bootstrap generally supports the inferred tree, even if it is anomalous Ben Kurtovic CS 581: Kubatko and Degnan, 2007 March 9, 2017 10 / 11

  11. Conclusions Conditions exist in which concatenation results in very poor performance Short branches are due to a small number of generations relative to population size In practice, this is more common near the tips of the tree, due to extinction and other reasons As a result, in larger trees, the inaccuracies in the inferred tree may be relatively small (even if they are likely to happen) Sampling more individuals per species may be helpful (only one individual per species used here) As this is a simulation study, no true claims about statistical consistency can be made (note figure 3B) Ben Kurtovic CS 581: Kubatko and Degnan, 2007 March 9, 2017 11 / 11

Recommend


More recommend