dimacs tutorial on phylogenetic trees and rapidly
play

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens - PowerPoint PPT Presentation

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens Katherine St. John City University of New York 1 Thanks to the DIMACS Staff Linda Casals Walter Morris Nicole Clark Katherine St. John City University of New


  1. Distance-Based Methods Popular distance based methods include • Neighbor Joining (Saitou & Nei ‘87) which repeatedly joins the “nearest neighbors” to build a tree, and • UPGMA (“Unweighted Pair Group Method with Arithmetic Mean”) (Sneath & Snokal ‘73 ) similarly clusters close taxa, assuming the rate of evolution is the same across lineages.

  2. Distance-Based Methods Popular distance based methods include • Neighbor Joining (Saitou & Nei ‘87) which repeatedly joins the “nearest neighbors” to build a tree, and • UPGMA (“Unweighted Pair Group Method with Arithmetic Mean”) (Sneath & Snokal ‘73 ) similarly clusters close taxa, assuming the rate of evolution is the same across lineages. • Quartet-based methods that decide the topology for every 4 taxa and then assemble them to form a tree (Berry et al. 1999, 2000, 2001). Katherine St. John City University of New York 18

  3. Other Distance-Based Methods • Weighbor (Bruno et al. ‘00) is a weighted version of Neighbor Joining, that combines based on a likelihood function of the distances.

  4. Other Distance-Based Methods • Weighbor (Bruno et al. ‘00) is a weighted version of Neighbor Joining, that combines based on a likelihood function of the distances. • Disk Covering Method (Warnow et al. ‘98, ‘99, ‘04)– a divide-and-conquer approach of theoretical interest that has been combined with many other methods.

  5. Other Distance-Based Methods • Weighbor (Bruno et al. ‘00) is a weighted version of Neighbor Joining, that combines based on a likelihood function of the distances. • Disk Covering Method (Warnow et al. ‘98, ‘99, ‘04)– a divide-and-conquer approach of theoretical interest that has been combined with many other methods. Katherine St. John City University of New York 19

  6. Neighbor Joining (NJ) • [Saitou & Nei 1987]: very popular and fast: O ( n 3 ) .

  7. Neighbor Joining (NJ) • [Saitou & Nei 1987]: very popular and fast: O ( n 3 ) . – Based on the distance between nodes, join neighboring leaves , replace them by their parent, calculate distances to this node, and repeat.

  8. Neighbor Joining (NJ) • [Saitou & Nei 1987]: very popular and fast: O ( n 3 ) . – Based on the distance between nodes, join neighboring leaves , replace them by their parent, calculate distances to this node, and repeat. – This process eventually returns a binary (fully resolved) tree.

  9. Neighbor Joining (NJ) • [Saitou & Nei 1987]: very popular and fast: O ( n 3 ) . – Based on the distance between nodes, join neighboring leaves , replace them by their parent, calculate distances to this node, and repeat. – This process eventually returns a binary (fully resolved) tree. – Joining the leaves with the minimal distance does not suffice, so subtract the averaged distances to compensate for long edges.

  10. Neighbor Joining (NJ) • [Saitou & Nei 1987]: very popular and fast: O ( n 3 ) . – Based on the distance between nodes, join neighboring leaves , replace them by their parent, calculate distances to this node, and repeat. – This process eventually returns a binary (fully resolved) tree. – Joining the leaves with the minimal distance does not suffice, so subtract the averaged distances to compensate for long edges. – Experimental work shows that NJ trees are reasonably accurate, given a rate of evolution is neither too low nor too high. Katherine St. John City University of New York 20

  11. Quartet Methods • A quartet is an unrooted binary tree on four taxa: c c b d d d t t t t t t ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � r r r r r r � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ a c a a b b t t t t t t { ab | cd } { ac | bd } { ad | bc }

  12. Quartet Methods • A quartet is an unrooted binary tree on four taxa: c c b d d d t t t t t t ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � r r r r r r � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ a c a a b b t t t t t t { ab | cd } { ac | bd } { ad | bc } • Let Q ( T ) = all quartets that agree with T . os et al. 1997] : T can be reconstructed from Q ( T ) in [Erd˝ polynomial time. Katherine St. John City University of New York 21

  13. Quartet Methods • Quartet-based methods operate in two phases:

  14. Quartet Methods • Quartet-based methods operate in two phases: – Construct quartets on all four taxa sets.

  15. Quartet Methods • Quartet-based methods operate in two phases: – Construct quartets on all four taxa sets. – Combine these quartets into a tree.

  16. Quartet Methods • Quartet-based methods operate in two phases: – Construct quartets on all four taxa sets. – Combine these quartets into a tree. • Running time: – For most optimizations, determining a quartet is fast.

  17. Quartet Methods • Quartet-based methods operate in two phases: – Construct quartets on all four taxa sets. – Combine these quartets into a tree. • Running time: – For most optimizations, determining a quartet is fast. – There are Θ( n 4 ) quartets, giving Ω( n 4 ) running time.

  18. Quartet Methods • Quartet-based methods operate in two phases: – Construct quartets on all four taxa sets. – Combine these quartets into a tree. • Running time: – For most optimizations, determining a quartet is fast. – There are Θ( n 4 ) quartets, giving Ω( n 4 ) running time. – In practice, the input quality is insufficient to ensure that all quartets are accurately inferred.

  19. Quartet Methods • Quartet-based methods operate in two phases: – Construct quartets on all four taxa sets. – Combine these quartets into a tree. • Running time: – For most optimizations, determining a quartet is fast. – There are Θ( n 4 ) quartets, giving Ω( n 4 ) running time. – In practice, the input quality is insufficient to ensure that all quartets are accurately inferred. – Quartet methods have to handle incorrect quartets. Katherine St. John City University of New York 22

  20. Popular Quartet Methods • Q ∗ or Naive Method [Berry & Gascuel ‘97, Buneman ‘71]: Only add edges that agree with all input quartets. Doesn’t tolerate errors– outputs conservative, but unresolved tree.

  21. Popular Quartet Methods • Q ∗ or Naive Method [Berry & Gascuel ‘97, Buneman ‘71]: Only add edges that agree with all input quartets. Doesn’t tolerate errors– outputs conservative, but unresolved tree. • Quartet Cleaning (QC) [Berry et al. 1999]: Add edges with a small number of errors proportional to q e . Many variants: all handle a small number of errors.

  22. Popular Quartet Methods • Q ∗ or Naive Method [Berry & Gascuel ‘97, Buneman ‘71]: Only add edges that agree with all input quartets. Doesn’t tolerate errors– outputs conservative, but unresolved tree. • Quartet Cleaning (QC) [Berry et al. 1999]: Add edges with a small number of errors proportional to q e . Many variants: all handle a small number of errors. • Quartet Puzzling [Strimmer & von Haeseler 1996]: “Order taxa randomly, greedily add edges, repeat 1000 times.” Output majority tree. Most popular with biologists. Katherine St. John City University of New York 23

  23. Constructing Networks • What if evolution isn’t tree-like?

  24. Constructing Networks • What if evolution isn’t tree-like? For example:

  25. Constructing Networks • What if evolution isn’t tree-like? For example:

  26. Constructing Networks • What if evolution isn’t tree-like? For example: (from W.P. Maddison, Systematic Biology ‘97) Katherine St. John City University of New York 24

  27. Network Methods • Split Decomposition (Bandelt & Dress ‘92) decomposes the distance matrix into sums of “split” metrics and small residue, yielding a set of splits (bipartitions of taxa).

  28. Network Methods • Split Decomposition (Bandelt & Dress ‘92) decomposes the distance matrix into sums of “split” metrics and small residue, yielding a set of splits (bipartitions of taxa). • NeighborNet (Bryant & Moulton ‘02) is an agglomerative clustering algorithm that uses splits to produce networks.

  29. Network Methods • Split Decomposition (Bandelt & Dress ‘92) decomposes the distance matrix into sums of “split” metrics and small residue, yielding a set of splits (bipartitions of taxa). • NeighborNet (Bryant & Moulton ‘02) is an agglomerative clustering algorithm that uses splits to produce networks. • TCS (Posada & Crandall ‘01) estimates gene phylogenies based on statistical parsimony method. Katherine St. John City University of New York 25

  30. Input to Reconstruction Algorithms • Almost all assume that the data is aligned: (Alignment of bacterial genes by Geneious (Drummond ‘06).) • Many assume corrections have been made for the underlying model of evolution. Katherine St. John City University of New York 26

  31. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution.

  32. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . AACGT ✟ ❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ 0 1 ✟ ✟ ✑ ◗◗◗◗◗◗ ✑ ◗◗◗◗◗◗ ✑ ✑ ✑ ✑ ✑ ✑ 2 1 1 3 ✑ ✑ ✑ ✑ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ 0 1 0 1 � ❅ � ❅ Katherine St. John City University of New York 27

  33. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . AACGT ✟ ❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ 0 1 ✟ ✟ AACGT AACGA ✑ ◗◗◗◗◗◗ ✑ ◗◗◗◗◗◗ ✑ ✑ ✑ ✑ ✑ ✑ 2 1 1 3 ✑ ✑ ✑ ✑ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ 0 1 0 1 � ❅ � ❅ Katherine St. John City University of New York 28

  34. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . AACGT ✟ ❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ 0 1 ✟ ✟ AACGT AACGA ✑ ◗◗◗◗◗◗ ✑ ◗◗◗◗◗◗ ✑ ✑ ✑ ✑ ✑ ✑ 2 1 1 3 ✑ ✑ ✑ ✑ ACCCT GACGT � ❅ AACGA � ❅ GGCGT � ❅ � ❅ � ❅ � ❅ 0 1 0 1 � ❅ � ❅ Katherine St. John City University of New York 29

  35. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . AACGT ✟ ❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ 0 1 ✟ ✟ AACGT AACGA ✑ ◗◗◗◗◗◗ ✑ ◗◗◗◗◗◗ ✑ ✑ ✑ ✑ ✑ ✑ 2 1 1 3 ✑ ✑ ✑ ✑ ACCCT GACGT � ❅ AACGA � ❅ GGCGT � ❅ � ❅ � ❅ � ❅ 0 1 0 1 � ❅ � ❅ GACGT AACGT GACGT GGCGA Katherine St. John City University of New York 30

  36. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . AACGT ✟ ❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ 0 1 ✟ ✟ AACGT AACGA ✑ ◗◗◗◗◗◗ ✑ ◗◗◗◗◗◗ ✑ ✑ ✑ ✑ ✑ ✑ 2 1 1 3 ✑ ✑ ✑ ✑ ACCCT GACGT � ❅ AACGA � ❅ GGCGT � ❅ � ❅ � ❅ � ❅ 0 1 0 1 � ❅ � ❅ GACGT AACGT GACGT GGCGA Katherine St. John City University of New York 31

  37. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . { ACCCT, GACGT, AACGT, GACGT, GGCGA } Katherine St. John City University of New York 32

  38. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . • The assumptions of the model are: 1. the sites (i.e., the positions within the sequences) evolve independently and identically 2. if a site changes state it changes with equal probability to each of the remaining states, and 3. the number of changes of each site on an edge e is a Poisson random variable with expectation λ ( e ) (this is also called the “length” of the edge e ). Katherine St. John City University of New York 33

  39. How Methods Use Models of Evolution • As an explicit part of the algorithm: for example, maximum likelihood, weighbor.

  40. How Methods Use Models of Evolution • As an explicit part of the algorithm: for example, maximum likelihood, weighbor. • Indirectly, via assumptions on the data or by inputting data that has been corrected under a certain model. Katherine St. John City University of New York 34

  41. Testing Methods Empirically • How accurate are the methods at reconstructing trees?

  42. Testing Methods Empirically • How accurate are the methods at reconstructing trees? • In biological applications, the true, historical tree is almost never known, which makes assessing the quality of phylogenetic reconstruction methods problematic.

  43. Testing Methods Empirically • How accurate are the methods at reconstructing trees? • In biological applications, the true, historical tree is almost never known, which makes assessing the quality of phylogenetic reconstruction methods problematic.

  44. Testing Methods Empirically • How accurate are the methods at reconstructing trees? • In biological applications, the true, historical tree is almost never known, which makes assessing the quality of phylogenetic reconstruction methods problematic. • Simulation is used instead to evaluate methods, given a model of evolution. Katherine St. John City University of New York 35

  45. Simulation Studies 1. Construct a “model” tree.

  46. Simulation Studies 1. Construct a 2. “Evolve” “model” tree. sequences down the tree. A GTTAGAAGGCGGCCA . . . B CATTTGTCCTAACTT . . . C CAAGAGGCCACTGCA . . . D CCGACTTCCAACCTC . . . E ATGGGGCACGATGGA . . . F TACAAATACGCGCAA . . .

  47. Simulation Studies 1. Construct a 2. “Evolve” 3. Reconstruct “model” tree. sequences down the tree using the tree. method. A GTTAGAAGGCGGCCA . . . B CATTTGTCCTAACTT . . . C CAAGAGGCCACTGCA . . . D CCGACTTCCAACCTC . . . E ATGGGGCACGATGGA . . . F TACAAATACGCGCAA . . .

  48. Simulation Studies 1. Construct a 2. “Evolve” 3. Reconstruct “model” tree. sequences down the tree using the tree. method. A GTTAGAAGGCGGCCA . . . B CATTTGTCCTAACTT . . . C CAAGAGGCCACTGCA . . . D CCGACTTCCAACCTC . . . E ATGGGGCACGATGGA . . . F TACAAATACGCGCAA . . . 4. Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 36

  49. Simulation Studies 1. Construct a 2. “Evolve” 3. Reconstruct “model” tree. sequences down the tree using the tree. method. A GTTAGAAGGCGGCCA . . . B CATTTGTCCTAACTT . . . C CAAGAGGCCACTGCA . . . D CCGACTTCCAACCTC . . . E ATGGGGCACGATGGA . . . F TACAAATACGCGCAA . . . 4. Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 37

  50. Simulating Data: Choosing Trees • Usually chosen from a random distribution on trees: Uniform, or Yule-Harding (birth-death trees) ✉ ✉ ✉ ❅ � ❅ � ❅ � r r � ❅ � ❅ � ❅ ✉ ✉ ✉

  51. Simulating Data: Choosing Trees • Usually chosen from a random distribution on trees: Uniform, or Yule-Harding (birth-death trees) ✉ ✉ ✉ ❅ � ❅ � ❅ � r r � ❅ � ❅ � ❅ ✉ ✉ ✉ • Can view this as two different random processes:

  52. Simulating Data: Choosing Trees • Usually chosen from a random distribution on trees: Uniform, or Yule-Harding (birth-death trees) ✉ ✉ ✉ ❅ � ❅ � ❅ � r r � ❅ � ❅ � ❅ ✉ ✉ ✉ • Can view this as two different random processes: – generate the tree shape, and then

  53. Simulating Data: Choosing Trees • Usually chosen from a random distribution on trees: Uniform, or Yule-Harding (birth-death trees) ✉ ✉ ✉ ❅ � ❅ � ❅ � r r � ❅ � ❅ � ❅ ✉ ✉ ✉ • Can view this as two different random processes: – generate the tree shape, and then – assign weights or branch lengths to the shape. Katherine St. John City University of New York 38

  54. Simulating Data: Evolving Sequences • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . AACGT ✟ ❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ 0 1 ✟ ✟ AACGT AACGA ✑ ◗◗◗◗◗◗ ✑ ◗◗◗◗◗◗ ✑ ✑ ✑ ✑ ✑ ✑ 2 1 1 3 ✑ ✑ ✑ ✑ ACCCT GACGT � ❅ AACGA � ❅ GGCGT � ❅ � ❅ � ❅ � ❅ 0 1 0 1 � ❅ � ❅ GACGT AACGT GACGT GGCGA Katherine St. John City University of New York 39

  55. Simulating Data: Evolving Sequences • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . { ACCCT, GACGT, AACGT, GACGT, GGCGA } Katherine St. John City University of New York 40

  56. Simulation Studies 1. Construct a 2. “Evolve” 3. Reconstruct “model” tree. sequences down the tree using the tree. method. A GTTAGAAGGCGGCCA . . . B CATTTGTCCTAACTT . . . C CAAGAGGCCACTGCA . . . D CCGACTTCCAACCTC . . . E ATGGGGCACGATGGA . . . F TACAAATACGCGCAA . . . 4. Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 41

  57. Simulation Studies 1. Construct a 2. “Evolve” 3. Reconstruct “model” tree. sequences down the tree using the tree. method. A GTTAGAAGGCGGCCA . . . B CATTTGTCCTAACTT . . . C CAAGAGGCCACTGCA . . . D CCGACTTCCAACCTC . . . E ATGGGGCACGATGGA . . . F TACAAATACGCGCAA . . . 4. Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 42

  58. Simulation Studies 1. Construct a 2. “Evolve” 3. Reconstruct “model” tree. sequences down the tree using the tree. method. A GTTAGAAGGCGGCCA . . . B CATTTGTCCTAACTT . . . C CAAGAGGCCACTGCA . . . D CCGACTTCCAACCTC . . . E ATGGGGCACGATGGA . . . F TACAAATACGCGCAA . . . 4. Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 43

  59. Evaluating Accuracy • To compare reconstructed tree to model tree, the Robinson-Foulds Score is often used: False Positives + False Negatives total edges ✟ ❍❍❍❍ ✟ ❍❍❍❍ ✟ ✟ ✑ . ✟ . ✟ . . ✟ ✟ . . . ✑ ◗◗◗ ✑ ◗◗◗ ✑ . ✑ ◗◗◗ . . ✑ ✑ ✑ ✑ ✑ ✑ ✑ � ❅ � ❅ � ❅ � ❅ • a b c b � ❅ � ❅ � ❅ � ❅ c d e f d a f e

Recommend


More recommend