methodological challenges in the pursuit of the tree of
play

Methodological Challenges in the Pursuit of the Tree of Life ! - PowerPoint PPT Presentation

Reviews in Computational Biology Methodological Challenges in the Pursuit of the Tree of Life ! Christophe Dessimoz February 13th, 2013 Outline Introduction Mature methods: supermatrix, supertree Emerging methods: species-tree


  1. Reviews in Computational Biology Methodological Challenges in the Pursuit of the Tree of Life ! Christophe Dessimoz February 13th, 2013

  2. Outline • Introduction • Mature methods: supermatrix, supertree • Emerging methods: species-tree • Outlook

  3. Augustin Augier, Arbre Botanique (1801)

  4. Lamarck, Philosophie Zoologique , 1809

  5. Darwin, Notebook B, 1837 Wikipedia

  6. 16S rRNA was used by Woese (1987) to group early life forms into three kingdoms

  7. Genomic Era Snel et al. Genome trees and the nature of genome evolution. Annu Rev Microbiol (2005) vol. 59 pp. 191-209

  8. PART I Established Methods: Supermatrix and Supertree

  9. Gene trees, Homology Orthology & Paralogy orthologs( , ) Speciation Duplication paralogs( , ) Gene loss Altenhoff and Dessimoz, Methods in Molecular Biology 2012

  10. 1000 # of genomes 100 30 genes Fraction of marker genes used

  11. 1,000 # of genomes 100 Full genome 13 genes Fraction of marker gene used

  12. 1000 # of genomes 578 Full genome 31 genes Fraction of marker genes used

  13. 1000 # of genomes 2684 8 genes Full genome Fraction of marker genes used

  14. i.e. 50% bootstrap support! i.e. 95% bootstrap support!

  15. But!

  16. Actually, use only small fraction of data.

  17. Since then... Goloboff et al. 2009 73,060 ? 2684 Edwards et al. 2010 1000 # of species 578 Wu & Eisen 2008 Ciccarelli 2006 191 Pisani 2007 Hejnol et al. 2009 77 Dunn et al. 2008 Smith et al. 2011 150 31 >1000 8 # of marker genes

  18. Gene tree ≠ Species tree

  19. Gene tree ≠ Species tree • Gene duplication (paralogs) • Lateral gene transfer (xenologs) • Endosymbiosis (e.g. Delusc et al. 2005) • Hybridization (Hallström & Janke 2008) • Incomplete lineage sorting (aka deep coalescence) Jeffroy et al. 2006 McInerney et al. 2008 Edwards 2009 Philippe et al. 2011

  20. Systematic Errors • Branch-length heterogeneity (Matsen & Steel 2007, Edwards 2009) • Nucleotide composition heterogeneity across species (Hasegawa & Hashimoto 1993, Jeffroy et al. 2006) • Missing data (Hartmann & Vision 2008) • In general: model violations

  21. Systematic error can result in overconfidence e.g. Same argument in Philippe et al. 2011 Bilateria Bilateria Bilateria Sponges (Porifera) Comb Jellies Sponges (Porifera) (Ctenophora) 80-90% 62-96% 0-70% All photos from Wikipedia 53% 78-99% Cnidaria Cnidaria Cnidaria (Corals, jellyfish) (Corals, jellyfish) (Corals, jellyfish) 27% Comb Jellies Sponges (Porifera) Comb Jellies (Ctenophora) (Ctenophora) Philippe et al. Dunn et al. Schierwater et al. Current Biol 2009 Nature 2008 PLoS Biol 2009

  22. PART II Emerging Methods: Species-Tree Inference Methods Most relevant review: Anderson et al. Methods in Molecular Biology 2012

  23. Two main classes • Methods modelling specific processes (“mechanistic”) a) Rate variation within/among markers b) Gene duplication c) Deep coalescence d) Lateral gene transfer • Process agnostic (“empirical”)

  24. a) Rate heterogeneity within/between genes • Within genes: • Among-site rate heterogeneity (Gamma-rate): Yang 1993, Yang 1994 • Among-site model heterogeneity (CAT model): Lartillot & Philippe 2004 • Heterotachy (change over time, i.e. branches): Galtier 2001, Penny 2001 • Among genes: • Proportional model: Pupko 2002, Dessimoz et al. 2008

  25. b) Duplication events Intro: gene/species tree reconciliation Homo sapiens G1 Homo sapiens Pan troglodytes Loss Pan troglodytes G2 Mus musculus Mus musculus S Rattus norvegicus Loss Rattus norvegicus Loss Homo sapiens G1 Homo sapiens G4 Pan troglodytes G2 Mus musculus Loss Mus musculus R G3 Rattus norvegicus G3 Rattus norvegicus G G4 Pan troglodytes Duplication node Dufayard et al., Bioinformatics, 2005 Reviewed in Altenhoff & Dessimoz, Methods in Molecular Biology 2012

  26. Reconciliation: Parsimony & Likelihood Likelihood: Parsimony: G1 Homo sapiens Pick the Minimise # Loss Pan troglodytes reconciliation(s) that G2 Mus musculus duplication Loss Rattus norvegicus maximise the Loss Homo sapiens & losses probability of G4 Pan troglodytes observing the data Loss Mus musculus R G3 Rattus norvegicus (i.e. gene/species Duplication node trees) under a particular model Reviewed in Altenhoff & Dessimoz, Methods in Molecular Biology 2012

  27. IDEA: treat species tree as unknown (or at least somewhat uncertain) quantity

  28. c) Modelling Coalescent time to most recent common ancestor Model Gene Sequence Parameters Trees alignments time of speciation locus IDEA: instead of fixing species tree, treat as parameter! Rannala & Yang, Annu Rev Genomics Hum Genet 2008

  29. Methods (parsimony) (parsimony) (summary statistics) also see review of Liu et al 2009

  30. d) Lateral gene transfer

  31. Process agnostic assumption • Independent tree of independence among genes inference for each gene (relatively efficient!) • Number of different Gene-to-tree All Sequence Tree of trees modeled as map alignments gene i Dirichlet process

  32. Dirichlet Process a.k.a. Chinese Restaurant Process e.g. http://www.cs.princeton.edu/courses/archive/fall07/cos597C/scribe/20070921.pdf

  33. Evaluation with simulated data

  34. Leaché & Rannala, Syst Biol 2010 tree length population size * mutation rate Difference between gene and species tree (baseline)

  35. Chung & Ané 2011 H orizontal G ene T ransfers+ILS I ncomplete L ineage S orting only mechanistic (ILS) Better empirical

  36. Better

  37. Evaluation with empirical data

  38. • “Note that the concordance factors in the [BUCKy] tree are much more conservative than the posterior probabilities in the topology estimated from the concatenated alignment” • “Taking into account the incongruence between gene trees does not drastically change our overall view of rice phylogeny, but it does give a more varied picture of the support across the tree.” • “[BUCKy] is robust to the prior probability on gene tree incongruence (the α parameter)” • “[The 6-species, 162 genes Bayesian analysis] had not yet reached stationarity after 1.6 billion iterations.” (2 months on 96 CPU cores)

  39. Outlook • Bottleneck is methods, not data • Need methods able to deal with different gene histories • Very difficult to say which approach yields better results solely from first principle -> need for sound simulation/empirical tests • Efficiency needs to be improved (“ The largest data set yet tested with these species tree methods is yeast, with 106 loci in 8 species ” Cranston 2009)

Recommend


More recommend