Reviews in Computational Biology Methodological Challenges in the Pursuit of the Tree of Life ! Christophe Dessimoz February 13th, 2013
Outline • Introduction • Mature methods: supermatrix, supertree • Emerging methods: species-tree • Outlook
Augustin Augier, Arbre Botanique (1801)
Lamarck, Philosophie Zoologique , 1809
Darwin, Notebook B, 1837 Wikipedia
16S rRNA was used by Woese (1987) to group early life forms into three kingdoms
Genomic Era Snel et al. Genome trees and the nature of genome evolution. Annu Rev Microbiol (2005) vol. 59 pp. 191-209
PART I Established Methods: Supermatrix and Supertree
Gene trees, Homology Orthology & Paralogy orthologs( , ) Speciation Duplication paralogs( , ) Gene loss Altenhoff and Dessimoz, Methods in Molecular Biology 2012
1000 # of genomes 100 30 genes Fraction of marker genes used
1,000 # of genomes 100 Full genome 13 genes Fraction of marker gene used
1000 # of genomes 578 Full genome 31 genes Fraction of marker genes used
1000 # of genomes 2684 8 genes Full genome Fraction of marker genes used
i.e. 50% bootstrap support! i.e. 95% bootstrap support!
But!
Actually, use only small fraction of data.
Since then... Goloboff et al. 2009 73,060 ? 2684 Edwards et al. 2010 1000 # of species 578 Wu & Eisen 2008 Ciccarelli 2006 191 Pisani 2007 Hejnol et al. 2009 77 Dunn et al. 2008 Smith et al. 2011 150 31 >1000 8 # of marker genes
Gene tree ≠ Species tree
Gene tree ≠ Species tree • Gene duplication (paralogs) • Lateral gene transfer (xenologs) • Endosymbiosis (e.g. Delusc et al. 2005) • Hybridization (Hallström & Janke 2008) • Incomplete lineage sorting (aka deep coalescence) Jeffroy et al. 2006 McInerney et al. 2008 Edwards 2009 Philippe et al. 2011
Systematic Errors • Branch-length heterogeneity (Matsen & Steel 2007, Edwards 2009) • Nucleotide composition heterogeneity across species (Hasegawa & Hashimoto 1993, Jeffroy et al. 2006) • Missing data (Hartmann & Vision 2008) • In general: model violations
Systematic error can result in overconfidence e.g. Same argument in Philippe et al. 2011 Bilateria Bilateria Bilateria Sponges (Porifera) Comb Jellies Sponges (Porifera) (Ctenophora) 80-90% 62-96% 0-70% All photos from Wikipedia 53% 78-99% Cnidaria Cnidaria Cnidaria (Corals, jellyfish) (Corals, jellyfish) (Corals, jellyfish) 27% Comb Jellies Sponges (Porifera) Comb Jellies (Ctenophora) (Ctenophora) Philippe et al. Dunn et al. Schierwater et al. Current Biol 2009 Nature 2008 PLoS Biol 2009
PART II Emerging Methods: Species-Tree Inference Methods Most relevant review: Anderson et al. Methods in Molecular Biology 2012
Two main classes • Methods modelling specific processes (“mechanistic”) a) Rate variation within/among markers b) Gene duplication c) Deep coalescence d) Lateral gene transfer • Process agnostic (“empirical”)
a) Rate heterogeneity within/between genes • Within genes: • Among-site rate heterogeneity (Gamma-rate): Yang 1993, Yang 1994 • Among-site model heterogeneity (CAT model): Lartillot & Philippe 2004 • Heterotachy (change over time, i.e. branches): Galtier 2001, Penny 2001 • Among genes: • Proportional model: Pupko 2002, Dessimoz et al. 2008
b) Duplication events Intro: gene/species tree reconciliation Homo sapiens G1 Homo sapiens Pan troglodytes Loss Pan troglodytes G2 Mus musculus Mus musculus S Rattus norvegicus Loss Rattus norvegicus Loss Homo sapiens G1 Homo sapiens G4 Pan troglodytes G2 Mus musculus Loss Mus musculus R G3 Rattus norvegicus G3 Rattus norvegicus G G4 Pan troglodytes Duplication node Dufayard et al., Bioinformatics, 2005 Reviewed in Altenhoff & Dessimoz, Methods in Molecular Biology 2012
Reconciliation: Parsimony & Likelihood Likelihood: Parsimony: G1 Homo sapiens Pick the Minimise # Loss Pan troglodytes reconciliation(s) that G2 Mus musculus duplication Loss Rattus norvegicus maximise the Loss Homo sapiens & losses probability of G4 Pan troglodytes observing the data Loss Mus musculus R G3 Rattus norvegicus (i.e. gene/species Duplication node trees) under a particular model Reviewed in Altenhoff & Dessimoz, Methods in Molecular Biology 2012
IDEA: treat species tree as unknown (or at least somewhat uncertain) quantity
c) Modelling Coalescent time to most recent common ancestor Model Gene Sequence Parameters Trees alignments time of speciation locus IDEA: instead of fixing species tree, treat as parameter! Rannala & Yang, Annu Rev Genomics Hum Genet 2008
Methods (parsimony) (parsimony) (summary statistics) also see review of Liu et al 2009
d) Lateral gene transfer
Process agnostic assumption • Independent tree of independence among genes inference for each gene (relatively efficient!) • Number of different Gene-to-tree All Sequence Tree of trees modeled as map alignments gene i Dirichlet process
Dirichlet Process a.k.a. Chinese Restaurant Process e.g. http://www.cs.princeton.edu/courses/archive/fall07/cos597C/scribe/20070921.pdf
Evaluation with simulated data
Leaché & Rannala, Syst Biol 2010 tree length population size * mutation rate Difference between gene and species tree (baseline)
Chung & Ané 2011 H orizontal G ene T ransfers+ILS I ncomplete L ineage S orting only mechanistic (ILS) Better empirical
Better
Evaluation with empirical data
• “Note that the concordance factors in the [BUCKy] tree are much more conservative than the posterior probabilities in the topology estimated from the concatenated alignment” • “Taking into account the incongruence between gene trees does not drastically change our overall view of rice phylogeny, but it does give a more varied picture of the support across the tree.” • “[BUCKy] is robust to the prior probability on gene tree incongruence (the α parameter)” • “[The 6-species, 162 genes Bayesian analysis] had not yet reached stationarity after 1.6 billion iterations.” (2 months on 96 CPU cores)
Outlook • Bottleneck is methods, not data • Need methods able to deal with different gene histories • Very difficult to say which approach yields better results solely from first principle -> need for sound simulation/empirical tests • Efficiency needs to be improved (“ The largest data set yet tested with these species tree methods is yeast, with 106 loci in 8 species ” Cranston 2009)
Recommend
More recommend