Aims of this course: The Use of Molecular Data to • To introduce the theory and Infer the History of Species practice of phylogenetic inference and Genes from molecular data • To introduce some of the most useful methods and programs Richard Owen Some basic concepts Charles Darwin Owen’s definition of homology • Homologue: the same organ under every variety of form and function (true or essential correspondence - homology) • Analogy: superficial or misleading similarity Richard Owen 1843 1
Darwin and homology Homology is... • Homology: similarity that is the result of • “The natural system is based upon descent with modification .. the characters that naturalists inheritance from a common ancestor consider as showing true affinity (i.e. homologies) • The identification and analysis of homologies is are those which have been inherited from a common central to phylogenetics (the study of the parent, and, in so far as all true classification is genealogical; that community of descent is the evolutionary history of genes and species) common bond that naturalists have been seeking” • Similarity and homology are not be the same thing Charles Darwin, Origin of species 1859 p. 413 although they are often and wrongly used interchangeably Cladograms and phylograms Phylogenetic systematics Bacterium 1 Cladograms show Bacterium 2 branching order - Bacterium 3 • Uses tree diagrams to portray relationships branch lengths are Eukaryote 1 based upon recency of common ancestry meaningless Eukaryote 2 Eukaryote 3 • There are two types of trees commonly displayed in publications: Eukaryote 4 Phylograms show Bacterium 1 – Cladograms branch order and Bacterium 2 – Phylograms branch lengths Bacterium 3 Eukaryote 1 Eukaryote 2 Eukaryote 3 Eukaryote 4 Groups on trees Rooting trees using an outgroup archaea eukaryote A polyphyletic group is not a archaea Unrooted tree group at all! (e.g. if we put all things with wings in a single archaea group) eukaryote eukaryote eukaryote Rooted bacteria outgroup by outgroup archaea Monophyletic group archaea A paraphyletic group is one A monophyletic group (a clade) which includes only some archaea descendents (e.g. a group contains species derived from a eukaryote comprising animals without Monophyletic unique common ancestor with respect eukaryote humans would be paraphyletic) root group to the rest of the tree eukaryote eukaryote Baldauf (2003). Phylogeny for the faint of heart: a tutorial. Trends in Genetics 19:345-351. 2
The use of molecules to Linus Pauling reconstruct the past Molecules as documents of DNA sequences can be used to make evolutionary history ‘family trees’ of species or genes • “We may ask the question where in the now Gene Sequence living systems the greatest amount of GAACTCGACG information of their past history has survived and how it can be extracted” • “Best fit are the different types of GATCTCGACG Common macromolecules (sequences) which carry the ancestral GATCTGGGCG genetic information” sequence GCTCTGGGCA GCTCTGCGTA An alignment involves hypotheses of Exploring patterns in sequence data 1: positional homology between bases or amino acids • Which sequences should we use? • Do the sequences contain phylogenetic <---------------(--------------------HELIX 19---------------------) <---------------(22222222-000000-111111-00000-111111-0000-22222222 signal for the relationships of interest? Thermus ruber UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGA Th. thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGA E.coli UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGA (might be too conserved or too variable) Ancyst.nidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGA B.subtilis UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGA Chl.aurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGA • Are there features of the data which match ** *** * ** ** * ** might mislead us about evolutionary relationships? Alignment of 16S rRNA sequences from different bacteria 3
The molecular clock for alpha-globin: Is there a molecular clock? Each point represents the number of substitutions separating each animal from humans 100 number of substitutions shark • The idea of a molecular clock was 80 initially suggested by Zuckerkandl and carp Pauling in 1962 60 platypus chicken • They noted that rates of amino acid 40 replacements in animal haemoglobins cow 20 were roughly proportional to time - as judged against the fossil record 0 100 200 300 400 500 0 Time to common ancestor (millions of years) Small subunit ribosomal RNA Rates of amino acid replacement in different proteins Protein Rate (mean replacements per site per 10 9 years) Fibrinopeptides 8.3 Insulin C 2.4 18S or 16S rRNA Ribonuclease 2.1 Haemoglobins 1.0 Cytochrome C 0.3 Histone H4 0.01 There is no universal molecular clock Clock literature • The initial proposal saw the clock as a Poisson process with a constant rate • Benton and Ayala (2003) Dating the tree of • Now known to be more complex - differences in rates occur for: life. Science 300: 1698-1700. • different sites in a molecule • different genes • different regions of genomes • different genomes in the same cell • different taxonomic groups for the same gene • There is no universal molecular clock affecting all genes • There might be ‘local’ clocks but they need to be carefully tested and calibrated 4
Unequal rates in different lineages may Rate heterogeneity is a common cause us to recover the wrong tree problem in phylogenetic analyses • Felsenstein (1978) made a simple model phylogeny including • Differences in rates occur between: four taxa and a mixture of short and long branches • different sites in a molecule (e.g. at different A A B codon positions) • different genes on genomes p p D • different regions of genomes q TRUE TREE WRONG TREE p > q • different genomes in the same cell q q • different taxonomic groups for the same gene C C D • We need to consider these issues when we B make trees - otherwise we can get the • All methods are susceptible to “long branch” problems wrong tree • Methods which do not assume that all sites change at the same rate are generally better at recovering the true tree Chaperonin 60 Protein Maximum Likelihood Tree Saturation in sequence data: (PROTML, Roger et al. 1998, PNAS 95: 229) • Saturation is due to multiple changes at the same site in a sequence • Most data will contain some fast evolving sites which are potentially saturated (e.g. in Longest proteins often position 3) branches • In severe cases the data becomes essentially random and all information about relationships can be lost Convergence can also mislead Multiple changes at a single site - hidden changes our methods: Seq 1 AGCGAG • Thermophilic convergence or biased Seq 2 GCGGAC codon usage patterns may obscure Number of changes phylogenetic signal 1 3 2 Seq 1 C G T A Seq 2 C A 1 5
% Guanine + Cytosine in 16S rRNA External data suggests that Deinococcus and genes from mesophiles and thermophiles Thermus share a recent common ancestor %GC variable • Most gene trees e.g. RecA, GroEL place them Thermophiles: all sites sites together 62 72 Thermotoga maritima • Both have the same very unusual cell wall Thermus thermophilus 64 72 based upon ornithine Aquifex pyrophilus 65 73 • Both have the same menaquinones (Mk 9) • Both have the same unusual polar lipids Mesophiles: • Congruence between these complex characters Deinococcus radiodurans 55 52 supports a phylogenetic relationship between Bacillus subtilis Deinococcus and Thermus 55 50 Gene trees and species trees - Shared nucleotide or amino acid composition biases can cause the wrong tree to be recovered why might they differ? • Gene duplication Aquifex Thermus Aquifex (73%) Bacillus (50%) • Horizontal gene transfer between species True Wrong • Can be difficult to distinguish from each 16S rRNA tree tree other Thermus Bacillus Deinococcus Deinococcus • Both can produce trees that conflict with (72%) (52% G+C) accepted ideas of species relationships based upon external data Most phylogenetic methods will give the wrong tree Gene duplication, orthologues and Gene trees and species trees paralogues paralogous a A Gene tree Species tree b* C* A* orthologous orthologous B b a b* c C* B A* Sampling a mixture of orthologues and paralogues can mislead us about c D species relationships We often assume that gene trees give us Duplication to give 2 species trees copies = paralogues on Ancestral gene the same genome 6
The malic enzyme gene tree contains a Horizontal gene transfer does mixture of orthologues and paralogues occur between species Gene duplication Cyt 100 Homo sapiens 1 97 Anas platyrhynchos Cyt Anas = a duck! Mit Homo sapiens 2 100 Mit Ascaris suum Zea mays Ch 100 Plant chloroplast Flaveria trinervia Ch Ch Populus trichocarpa 75 Mit Solanum tuberosum 100 Plant Amaranthus Mit mitochondrion 100 Neocallimastix Hyd Hyd Trichomonas vaginalis Cyt Giardia lamblia Schizosaccharomyces Saccharomyces Lactococcus lactis Chaperonin 60 Protein Maximum Likelihood Tree (PROTML, Roger et al. 1998, PNAS 95: 229) 7
Recommend
More recommend