1
play

1 Molecular characters Nucleotide sequences structural genes - PDF document

Phylogenetics 3: Methods to reconstruct phylogenies A generalized protocol for molecular phylogenetics, and the associated concerns with each step Concerns: Collect homolgous sequences gene tree-species tree / paralogyorthology / trees


  1. Phylogenetics 3: Methods to reconstruct phylogenies A generalized protocol for molecular phylogenetics, and the associated concerns with each step Concerns: Collect homolgous sequences gene tree-species tree / paralogy–orthology / trees within trees Multiple sequence alignment positional homology / gaps / subjectivity-objectivity / methods philosophy / methods / consistency / power and accuracy Phylogeny estimation Test reliability or fit of phylogenetic branch support / tree comparison / statistic issues with trees estimates independent contrasts / impact of error on conclusions Interpretation and application 1

  2. Molecular characters • Nucleotide sequences structural genes (protein, RNA, regulatory) non-structural genes (introns, intergenic sequences, pseudogenes) • Protein sequences translate DNA to protein thousands to choose from • INDELS nucleotides, amino acids, genes, segments of DNA, etc. • DNA-DNA hybridization • Restriction fragment length polymorphisms (RFLPs) • Large scale genomic rearrangements A generalized protocol for molecular phylogenetics, and the associated concerns with each step Concerns: Collect homolgous sequences gene tree-species tree / paralogy–orthology / trees within trees Multiple sequence alignment positional homology / gaps / subjectivity-objectivity / methods philosophy / methods / consistency / power and accuracy Phylogeny estimation Test reliability or fit of phylogenetic branch support / tree comparison / statistic issues with trees estimates independent contrasts / impact of error on conclusions Interpretation and application 2

  3. Molecular characters: multiple sequence alignment Some methods: 1. sum-all-pairs method: count cost of aligning all pairs of sequences and select alignment that minimizes the total cost 2. star alignment: alignment based on tree that assumes all seqeunces are equally related 3. tree alignment: uses “ known ” information about relationships of sequences (lineages) to guide the alignment Take course called “ Bioinformatics ” (BIOC 4010 / BIOL 4041) to learn more of alignments. Species 1 … T A G … Species 2 … T A G … Species 3 … T A A … Species 4 … T C A … Species 5 … C C A … 3

  4. Molecular characters: multiple sequence alignment Molecular characters: DNA alignment β Alignment of the nucleotide character states of the -globin gene from five species of mammals human cow rabbit rat opossum GTG CTG TCT CCT GCC GAC AAG ACC AAC GTC AAG GCC GCC TGG GGC AAG GTT GGC GCG CAC ... ... ... G.C ... ... ... T.. ..T ... ... ... ... ... ... ... ... ... .GC A.. ... ... ... ..C ..T ... ... ... ... A.. ... A.T ... ... .AA ... A.C ... AGC ... ... ..C ... G.A .AT ... ..A ... ... A.. ... AA. TG. ... ..G ... A.. ..T .GC ..T ... ..C ..G GA. ..T ... ... ..T C.. ..G ..A ... AT. ... ..T ... ..G ..A .GC ... GCT GGC GAG TAT GGT GCG GAG GCC CTG GAG AGG ATG TTC CTG TCC TTC CCC ACC ACC AAG ... ..A .CT ... ..C ..A ... ..T ... ... ... ... ... ... AG. ... ... ... ... ... .G. ... ... ... ..C ..C ... ... G.. ... ... ... ... T.. GG. ... ... ... ... ... .G. ..T ..A ... ..C .A. ... ... ..A C.. ... ... ... GCT G.. ... ... ... ... ... ..C ..T .CC ..C .CA ..T ..A ..T ..T .CC ..A .CC ... ..C ... ... ... ..T ... ..A ACC TAC TTC CCG CAC TTC GAC CTG AGC CAC GGC TCT GCC CAG GTT AAG GGC CAC GGC AAG ... ... ... ..C ... ... ... ... ... ... ... ..G ... ... ..C ... ... ... ... G.. ... ... ... ..C ... ... ... T.C .C. ... ... ... .AG ... A.C ..A .C. ... ... ... ... ... ... T.T ... A.T ..T G.A ... .C. ... ... ... ... ..C ... .CT ... ... ... ..T ... ... ..C ... ... ... ... TC. .C. ... ..C ... ... A.C C.. ..T ..T ..T ... The order of DNA sequences in the alignment is specified by the order of the taxa in the list. To fit it on the page, the alignment is broken into three parts; such alignments are called INTERLEAVED . The complete DNA sequence is shown for the fist taxon (human). All the other sequences are shown relative to human, with the dot, “.”, signifying a match in the character state with the human sequences. Differences are indicated by using the single-letter nucleotide code (A,C,T or G). Note that this alignment could also be analyzed by using distance, likelihood, and Bayesian methods. Positional homology is always assumed when constructing alignments 4

  5. Molecular characters: presence-absence data matrix (easy) Hypothetical presence-absence data matrix for a diversity of molecular characters Species 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 0 1 0 1 1 1 0 1 1 1 0 0 Species 2 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 1 0 1 0 1 Species 3 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 0 0 1 1 1 0 1 0 1 0 1 Species 4 1 1 0 0 1 0 1 1 1 0 0 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 Species 5 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1 0 1 1 1 Amino acid INDELS Pseudogene / Presence / absence Tandem gene in 8 different genes functional gene of transposon duplication elements at 8 events different genomic locations Molecular characters: positional homology of gaps are a real pain in the ass 10 20 30 40 50 60 ....|....| ....|....| ....|....| ....|....| ....|....| ....|....| Mus2.FAS MTTPALLPLS -----GRRIP PLNL--GPP- ----SFPHHR ATLRLSEKFI LLLILSAFIT Human_GIA ---------- ---------- ---------- -----MNSNF ITFDLKMSLL PSNLFSAFIT Human_GIB MTTPALLPLS -----GRRIP PLNL--GPP- ----SFPHHR ATLRLSEKFI LLLILSAFIT Mus_GIA MPVGGLLPLF SSPGGGGLGS GLGGGLGGG- ----RKGSGP AAFRLTEKFV LLLVFSAFIT Rabbit_GIA ---------- ---------- ---------- ---------- ---------- ---------- Sus_GIA MPVGGLLPLF SSPAGGGLGG GLGGGLGGGG GGGGRKGSGP SAFRLTEKFV LLLVFSAFIT 70 80 90 100 110 120 ....|....| ....|....| ....|....| ....|....| ....|....| ....|....| Mus2.FAS LCFGAFFFLP DSSKHKRFDL G-LEDVLIPH VDAGKG---- AKNPGVFLIH GPDEHRHREE Human_GIA LCFGAIFFLP DSSKLLSGVL FHSSPALQPA ADHKPGPGAR AEDAAEGRAR RREEGAPGDP Human_GIB LCFGAFFFLP DSSKHKRFDL G-LEDVLIPH VDAGKG---- AKNPGVFLIH GPDEHRHREE Mus_GIA LCFGAIFFLP DSSKLLSGVL FHSNPALQPP AEHKPGLGAR AEDAAEGRVR HREEGAPGDP Rabbit_GIA ---------- ---------- ---------- ---------- AEDAADGRAR PGEEGAPGDP Sus_GIA LCFGAIFFLP DSSKLLSGVL FHSSPALQPA ADHKPGPGAR AEDAADGRAR PGEEGAPGDP An alignment of real amino acid sequences of the mannosidase protein 5

  6. Molecular characters: multiple sequence alignment • software is far from flawless (many different methods) • all alignments must be inspected “ by eye ” • any manual adjustments (by eye) introduces subjectivity • one solution is to publish alignments: • public database • in scientific paper • supplementary online materials of a scientific journal A generalized protocol for molecular phylogenetics, and the associated concerns with each step Concerns: Collect homolgous sequences gene tree-species tree / paralogy–orthology / trees within trees Multiple sequence alignment positional homology / gaps / subjectivity-objectivity / methods philosophy / methods / consistency / power and accuracy Phylogeny estimation Test reliability or fit of phylogenetic branch support / tree comparison / statistic issues with trees estimates independent contrasts / impact of error on conclusions Interpretation and application 6

  7. Molecular phylogenetics: methods We divide methods up by two criteria (data and method): Type of data: 1. characters: discrete character states at positionally homologous sites in a multiple sequence alignment (hence, discrete character methods) 2. distances: evolutionary distance, measured in average numbers of substitutions per positional homologous sites, between all pairs of taxa (hence, distance methods) Species 1 ! T A G ! Species 2 ! T A G ! Species 3 ! T A A ! Species 4 ! T C A ! Species 5 ! C C A ! Molecular phylogenetics: methods We divide methods up by two criteria (data and method): Type of data: 1. characters: discrete character states at positionally homologous sites in a multiple sequence alignment (hence, discrete character methods) 2. distances: evolutionary distance, measured in average numbers of substitutions per positional homologous sites, between all pairs of taxa (hence, distance methods) 7

  8. Molecular phylogenetics: methods We divide methods up by two criteria (data and method): Type of data: 1. characters: discrete character states at positionally homologous sites in a multiple sequence alignment (hence, discrete character methods) 2. distances: evolutionary distance, measured in average numbers of substitutions per positional homologous sites, between all pairs of taxa (hence, distance methods) Type of tree-building: 1. clustering algorithm: computationally “ build a tree ” according to a specific set of “ steps ” . 2. optimality criterion: a criterion for scoring a tree and comparing different trees with the goal of finding the tree with the best (optimal) score. [also called objective function] Molecular phylogenetics: most common methods Type of data Character Distance Tree-building method UPGMA Clustering algorithm Neighbor-joining (NJ) Maximum parsimony Least squares (MP) Optimality Minimum evolution criterion Maximum likelihood (ME) (ML) 8

Recommend


More recommend