evolutionary analysis
play

Evolutionary Analysis From trees to networks Dr. Taoyang Wu School - PowerPoint PPT Presentation

Evolutionary Analysis From trees to networks Dr. Taoyang Wu School of Computing Sciences, University of East Anglia Shanghai Jiao Tong University August 2016 T. Wu Evolutionary Analysis Research interests Discrete Mathematics


  1. � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� ���� �� �� � � � � � � �� �� � � � � � � � � � � � � � � � � � � � � Degrees ◮ G NNI ( n ) is regular with degree 2( n − 3); (Robinson 1971) ◮ G SPR ( n ) is regular with degree 2( n − 3)(2 n − 7); (Allen&Steel 2001) ◮ G TBR ( n ) is not regular, the maximal degree is obtained by caterpillar trees. (Humphries, 2008) T. Wu Evolutionary Analysis

  2. Degrees ◮ G NNI ( n ) is regular with degree 2( n − 3); (Robinson 1971) ◮ G SPR ( n ) is regular with degree 2( n − 3)(2 n − 7); (Allen&Steel 2001) ◮ G TBR ( n ) is not regular, the maximal degree is obtained by caterpillar trees. (Humphries, 2008) � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� ���� �� �� � � � � � � �� �� � � � � � � � � � � � � � � � � � � � � Figure: A caterpillar tree T. Wu Evolutionary Analysis

  3. Our result Theorem (Humphries-W, TCBB 2013) For each vertex T ∈ T ∗ n with n ≥ 3 , its degree in G TBR ( n ) is 4Γ( T ) − (8 n 2 − 18 n + 6) T. Wu Evolutionary Analysis

  4. Our result Theorem (Humphries-W, TCBB 2013) For each vertex T ∈ T ∗ n with n ≥ 3 , its degree in G TBR ( n ) is 4Γ( T ) − (8 n 2 − 18 n + 6) with � Γ( T ) := dist T ( u , v ) { u , v }⊆ L ( T ) denoting the sume of the distance between all leaves of T. T. Wu Evolutionary Analysis

  5. Our result Theorem (Humphries-W, TCBB 2013) For each vertex T ∈ T ∗ n with n ≥ 3 , its degree in G TBR ( n ) is 4Γ( T ) − (8 n 2 − 18 n + 6) with � Γ( T ) := dist T ( u , v ) { u , v }⊆ L ( T ) denoting the sume of the distance between all leaves of T. For the vertices in G TBR ( n ): ◮ Maximal degree: Caterpillar Trees ◮ Minimal degree: Semi-regular Trees (see, also, [Szekely-Wang-W, DM 2011]) T. Wu Evolutionary Analysis

  6. A key lemma Lemma For two “distinct” TBR operations θ and θ ′ , θ ( T ) = θ ′ ( T ) implies that both θ and θ ′ are NNI operations. T. Wu Evolutionary Analysis

  7. A key lemma Lemma For two “distinct” TBR operations θ and θ ′ , θ ( T ) = θ ′ ( T ) implies that both θ and θ ′ are NNI operations. Note: Here two TBR operations are distinct if T. Wu Evolutionary Analysis

  8. A key lemma Lemma For two “distinct” TBR operations θ and θ ′ , θ ( T ) = θ ′ ( T ) implies that both θ and θ ′ are NNI operations. Note: Here two TBR operations are distinct if ◮ they delete different edges in the bisection step, or T. Wu Evolutionary Analysis

  9. A key lemma Lemma For two “distinct” TBR operations θ and θ ′ , θ ( T ) = θ ′ ( T ) implies that both θ and θ ′ are NNI operations. Note: Here two TBR operations are distinct if ◮ they delete different edges in the bisection step, or ◮ they use different edges in the reconnection step. T. Wu Evolutionary Analysis

  10. The PDA model ◮ The number of trees in T n is ϕ ( n ) := (2 n − 3)!! = 1 · 3 · · · (2 n − 3) T. Wu Evolutionary Analysis

  11. The PDA model ◮ The number of trees in T n is ϕ ( n ) := (2 n − 3)!! = 1 · 3 · · · (2 n − 3) ◮ Under the proportional to distinguishable arrangements (PDA) model, each tree has the same probability to be generated, that is, we have 1 P u ( T ) = (1) ϕ ( n ) for every T in T n . T. Wu Evolutionary Analysis

  12. The YHK model Under the Yule–Harding model [Yule 1925, Harding 1971], ◮ Beginning with a two leafed tree, we “grow” it by repeatedly splitting a leaf into two new leaves. T. Wu Evolutionary Analysis

  13. The YHK model Under the Yule–Harding model [Yule 1925, Harding 1971], ◮ Beginning with a two leafed tree, we “grow” it by repeatedly splitting a leaf into two new leaves. ◮ The splitting leaf is chosen randomly and uniformly among all the present leaves in the current tree. T. Wu Evolutionary Analysis

  14. The YHK model Under the Yule–Harding model [Yule 1925, Harding 1971], ◮ Beginning with a two leafed tree, we “grow” it by repeatedly splitting a leaf into two new leaves. ◮ The splitting leaf is chosen randomly and uniformly among all the present leaves in the current tree. ◮ After obtaining an unlabeled tree with n leaves, we label each of its leaves with a label sampled randomly uniformly (without replacement) from { 1 , · · · , n } . T. Wu Evolutionary Analysis

  15. The YHK model Under the Yule–Harding model [Yule 1925, Harding 1971], ◮ Beginning with a two leafed tree, we “grow” it by repeatedly splitting a leaf into two new leaves. ◮ The splitting leaf is chosen randomly and uniformly among all the present leaves in the current tree. ◮ After obtaining an unlabeled tree with n leaves, we label each of its leaves with a label sampled randomly uniformly (without replacement) from { 1 , · · · , n } . When branch lengths are ignored, the Yule–Harding model is shown [Aldous,1996] to be equivalent to the trees generated by Kingman’s coalescent process, and so we call it the YHK model. T. Wu Evolutionary Analysis

  16. Subtree Pattern ◮ Cherry: a subtree with two leaves ◮ Pitchfork: a subtree with three leaves T. Wu Evolutionary Analysis

  17. Subtree Pattern ◮ Cherry: a subtree with two leaves ◮ Pitchfork: a subtree with three leaves Figure: A tree with three cherries and one pitchfork. T. Wu Evolutionary Analysis

  18. Subtree Pattern II Given a phylogenetic tree T , let ◮ A ( T ): the number of pitchforks; ◮ C ( T ): the number of cherries. T. Wu Evolutionary Analysis

  19. Subtree Pattern II Given a phylogenetic tree T , let ◮ A ( T ): the number of pitchforks; ◮ C ( T ): the number of cherries. For n ≥ 2, consider the random variables ◮ A n : the number of pitchforks in a random tree; ◮ C n : the number of cherries in a random tree. T. Wu Evolutionary Analysis

  20. Subtree Pattern II Given a phylogenetic tree T , let ◮ A ( T ): the number of pitchforks; ◮ C ( T ): the number of cherries. For n ≥ 2, consider the random variables ◮ A n : the number of pitchforks in a random tree; ◮ C n : the number of cherries in a random tree. What are the joint distributions of A n and C n ? T. Wu Evolutionary Analysis

  21. Joint distributions: formulae Theorem (W-Choi, 2016) For n > 3 and 1 < b < n, we have P y ( A n +1 = a , C n +1 = b ) = 2 a n P y ( A n = a , C n = b ) + ( a + 1) P y ( A n = a + 1 , C n = b − 1) n + 2( b − a + 1) P y ( A n = a − 1 , C n = b ) n + ( n − a − 2 b + 2) P y ( A n = a , C n = b − 1) . n T. Wu Evolutionary Analysis

  22. Joint distributions: formulae Theorem (W-Choi, 2016) For n > 3 and 1 < b < n, we have P y ( A n +1 = a , C n +1 = b ) = 2 a n P y ( A n = a , C n = b ) + ( a + 1) P y ( A n = a + 1 , C n = b − 1) n + 2( b − a + 1) P y ( A n = a − 1 , C n = b ) n + ( n − a − 2 b + 2) P y ( A n = a , C n = b − 1) . n Note: A similar formula for the PDA model. T. Wu Evolutionary Analysis

  23. Statistical properties ◮ A dynamic approach to computing the joint distributions. T. Wu Evolutionary Analysis

  24. Statistical properties ◮ A dynamic approach to computing the joint distributions. ◮ A unified approach to calculating the moments of the joint (and the marginal) distributions. T. Wu Evolutionary Analysis

  25. Statistical properties ◮ A dynamic approach to computing the joint distributions. ◮ A unified approach to calculating the moments of the joint (and the marginal) distributions. ◮ The cherry distributions are log-concave. That is, for n > 2 and 1 < k < n , we have P y ( C n = k ) 2 ≥ P y ( C n = k + 1) P y ( C n = k − 1) T. Wu Evolutionary Analysis

  26. Statistical properties ◮ A dynamic approach to computing the joint distributions. ◮ A unified approach to calculating the moments of the joint (and the marginal) distributions. ◮ The cherry distributions are log-concave. That is, for n > 2 and 1 < k < n , we have P y ( C n = k ) 2 ≥ P y ( C n = k + 1) P y ( C n = k − 1) ◮ There exists a unique change point for the cherry distributions between the YHK and the PDA models. T. Wu Evolutionary Analysis

  27. Statistical properties ◮ A dynamic approach to computing the joint distributions. ◮ A unified approach to calculating the moments of the joint (and the marginal) distributions. ◮ The cherry distributions are log-concave. That is, for n > 2 and 1 < k < n , we have P y ( C n = k ) 2 ≥ P y ( C n = k + 1) P y ( C n = k − 1) ◮ There exists a unique change point for the cherry distributions between the YHK and the PDA models. ◮ Similar results for clade sizes and clan sizes [Zhu-Than-W, 2015]. T. Wu Evolutionary Analysis

  28. Part III: Phylogenetic Networks T. Wu Evolutionary Analysis

  29. The tangled tree of life T. Wu Evolutionary Analysis

  30. From trees to networks Phylogenetic tree is useful, but networks provide a better tool for studying ◮ conflicting signals ◮ recombination ◮ gene flow ◮ hybridization ◮ horizontal gene transfer ◮ · · · T. Wu Evolutionary Analysis

  31. Phylogenetic Networks: Unrooted (11) (11) (3) (3) (4) (4) (7) (1) (12) (1) (6) (5) (15) (9) (8) (10) (10) (13) (7) (2) (14) (14) (2) (8) (6) (13) (5) (12) (9) (15) Figure: A phylogenetic tree and network relating 15 plants species from the genus Solanum ; from [Bastkowski-Moulton-Spillner-Wu, 2015, Bull. Math. Biol. ] T. Wu Evolutionary Analysis

  32. Network thinking: pedigree Figure: A partial pedigree of Prince Charles; from [Gusfield, 2014]. T. Wu Evolutionary Analysis

  33. Recombination Figure: A history with recombination; from [Gusfield, 2014]. T. Wu Evolutionary Analysis

  34. Phylogenetic Networks A (rooted) phylogenetic network: ◮ a directed acyclic graph ◮ a unique root ◮ leaves are labelled by taxa ◮ no vertex with one parent and one child ◮ binary A central problem: How to reconstruct phylogenetic networks? T. Wu Evolutionary Analysis

  35. Assembling trees: Supertree a c a d b a b e b a b c d e c e c a d e b e d Input trees T. Wu Evolutionary Analysis

  36. Assembling trees: Supertree a c a d b a b e b a b c d e c e c a d e b e d Input trees ◮ A tree is encoded by its subtrees on three leaves. T. Wu Evolutionary Analysis

  37. Assembling trees: Supertree a c a d b a b e b a b c d e c e c a d e b e d Input trees ◮ A tree is encoded by its subtrees on three leaves. ◮ A polynomial algorithm to assemble trees [Aho et al. 1981]. T. Wu Evolutionary Analysis

  38. Assembling trees: Supertree a c a d b a b e b a b c d e c e c a d e b e d Input trees ◮ A tree is encoded by its subtrees on three leaves. ◮ A polynomial algorithm to assemble trees [Aho et al. 1981]. T. Wu Evolutionary Analysis

  39. A Quiz! Question: Are networks encoded by their trees? T. Wu Evolutionary Analysis

  40. A Quiz! Question: Are networks encoded by their trees? ρ ρ ρ N T 2 T 1 a b c a b c a b c T. Wu Evolutionary Analysis

  41. Answer Question: Are networks encoded by their trees? ρ ρ ρ ρ N ′ N T 2 T 1 a b c a b c a b c a b c Answer: No. T. Wu Evolutionary Analysis

  42. Another quiz! Question: Are networks encoded by their subnetworks? T. Wu Evolutionary Analysis

  43. Another quiz! Question: Are networks encoded by their subnetworks? f e c f e d c b a f e d c b a Figure: An example of subnetwork. T. Wu Evolutionary Analysis

  44. A nontrivial answer Theorem (Huber-Iersel-Moulton-Wu, 2015, Syst. Biol. ) For every n ≥ 3 , there exist two non-isomorphic phylogenetic networks N 1 and N 2 with n leaves such that they display the same set of subnetworks (and the same set of trees). T. Wu Evolutionary Analysis

  45. A nontrivial answer Theorem (Huber-Iersel-Moulton-Wu, 2015, Syst. Biol. ) For every n ≥ 3 , there exist two non-isomorphic phylogenetic networks N 1 and N 2 with n leaves such that they display the same set of subnetworks (and the same set of trees). a b c d a b c d T. Wu Evolutionary Analysis

  46. Level-1 networks In [Huber-Moulton, 2013, Algorithmica ], it is shown that level-1 networks are encoded by their subnetworks. a c b e g f h d i j N Figure: level-1 = all undirected cycles are disjoint T. Wu Evolutionary Analysis

  47. Trinets z x y x z x y z x y z y T 1 ( x, y ; z ) N 1 ( x, y ; z ) N 2 ( x, y ; z ) S 1 ( x, y ; z ) z x y x x y z y x y z z N 5 ( x ; y ; z ) N 3 ( x ; y ; z ) N 4 ( x ; y ; z ) S 2 ( x ; y ; z ) Figure: Eight types of level-1 networks on three leaves. T. Wu Evolutionary Analysis

  48. Assembling Trinets Input: A collection of trinets. c a e d a b c c f Task: (1)To decide whether there exists a binary level-1 phylogenetic network display- b c ing the collection of trinets. h e g e f g i Input trinets T. Wu Evolutionary Analysis

  49. Assembling Trinets Input: A collection of trinets. c a e d a b c c f Task: (1)To decide whether there exists a binary level-1 phylogenetic network display- b c ing the collection of trinets. h e g (2)Construct such a network if e f g i it exists. Input trinets T. Wu Evolutionary Analysis

  50. Incomplete data In [Huber-Iersel-Moutlon-Scornavacca-Wu, in revision for Algorithmica ], we show that when some trinet is missing, then ◮ the trinet assembling problem is NP-hard; T. Wu Evolutionary Analysis

  51. Incomplete data In [Huber-Iersel-Moutlon-Scornavacca-Wu, in revision for Algorithmica ], we show that when some trinet is missing, then ◮ the trinet assembling problem is NP-hard; ◮ it can be solved by an O (3 n poly ( n )) algorithm. T. Wu Evolutionary Analysis

  52. Incomplete data In [Huber-Iersel-Moutlon-Scornavacca-Wu, in revision for Algorithmica ], we show that when some trinet is missing, then ◮ the trinet assembling problem is NP-hard; ◮ it can be solved by an O (3 n poly ( n )) algorithm. Question: How about ’real data’ (often noisy and containing conflict signals)? T. Wu Evolutionary Analysis

  53. Trilonet ATCGTCATTCCGG a h ATCGTCATTCCGG c b ATGGTCAATCTGG a e d i ATGGTCAATCTGG a b c c c ATGGTCAATGTCC f ATGGTCAATGTCC j b h h ATCGTCATTCCGG e g e f g i i ATGGTCAATCTGG j j ATGGTCAATGTCC h A dense set of trinets An alignment on X = { a, . . . , j } i Identify a suitable subst of taxa a a y ∗ b c b c e g e g f h d d f h i j j i N Figure: A schematic view of Tri net-based L evel O ne Net work reconstructor, from [Oldman ∗ -Wu ∗ -Iersel-Moutlon, in revision for MBE]. T. Wu Evolutionary Analysis

  54. Trilonet: a case study Giardia_lamblia_ATCC_50803_WB Giardia_intestinalis_isolate_246 Giardia_intestinalis_isolate_303 Giardia_intestinalis_isolate_305 Giardia_intestinalis_isolate_55 Giardia_intestinalis_isolate_JH #H1 Giardia_intestinalis_isolate_335 Figure: The inferred phylogeny of 7 Giardia strains by Trilonet; data from [Cooper et al, Curr. Biol., 2007]. T. Wu Evolutionary Analysis

  55. Trilonet Trilonet is an algorithm for inferring level-1 network: ◮ Constructing a network directly from sequence data (without using breaking points or gene trees). ◮ Efficient, and robust for noisy data. T. Wu Evolutionary Analysis

  56. Trilonet Trilonet is an algorithm for inferring level-1 network: ◮ Constructing a network directly from sequence data (without using breaking points or gene trees). ◮ Efficient, and robust for noisy data. ◮ Implemented in Java, and will be available at https://www.uea.ac.uk/computing/trilonet ◮ Consistent. T. Wu Evolutionary Analysis

  57. Trilonet Trilonet is an algorithm for inferring level-1 network: ◮ Constructing a network directly from sequence data (without using breaking points or gene trees). ◮ Efficient, and robust for noisy data. ◮ Implemented in Java, and will be available at https://www.uea.ac.uk/computing/trilonet ◮ Consistent. Future improvement includes ◮ level-k networks ◮ statistical consistency T. Wu Evolutionary Analysis

  58. Part IV: Future Directions T. Wu Evolutionary Analysis

  59. Network models and inference More realistic models: ◮ Superimposing molecular evolutionary models on edges ◮ Quantifying the contribution made by reticulate processes T. Wu Evolutionary Analysis

  60. Network models and inference More realistic models: ◮ Superimposing molecular evolutionary models on edges ◮ Quantifying the contribution made by reticulate processes Reconstructing networks ◮ Rigorous statistical frameworks ( Maximal Likelihood or Bayesian ) T. Wu Evolutionary Analysis

  61. Network models and inference More realistic models: ◮ Superimposing molecular evolutionary models on edges ◮ Quantifying the contribution made by reticulate processes Reconstructing networks ◮ Rigorous statistical frameworks ( Maximal Likelihood or Bayesian ) ◮ Accounting for non-tree like patterns resulted from ◮ Sequencing errors (e.g. SNP calling) ◮ Incomplete Lineage Sorting (see, e.g. Yu et al. 2014 PNAS) T. Wu Evolutionary Analysis

  62. Network models and inference More realistic models: ◮ Superimposing molecular evolutionary models on edges ◮ Quantifying the contribution made by reticulate processes Reconstructing networks ◮ Rigorous statistical frameworks ( Maximal Likelihood or Bayesian ) ◮ Accounting for non-tree like patterns resulted from ◮ Sequencing errors (e.g. SNP calling) ◮ Incomplete Lineage Sorting (see, e.g. Yu et al. 2014 PNAS) ◮ Efficient algorithms for searching the network space T. Wu Evolutionary Analysis

  63. Space of phylogenetic networks c b c c d b a d a a b d a c a d a b b c c d b d a c a b a b a c a d a d c c b d d d b d b c b c c d b b c b d d c a a a Figure: Space of level-1 networks with four taxa; from [Huber-Linz-Moulton-Wu, J. Math. Biol., 2016] T. Wu Evolutionary Analysis

  64. Network operation v 1 v 4 v 1 v 4 A C A C v 1 v 1 v 3 v 5 v 5 v 2 v 3 v 6 v 6 v 2 v 4 v 4 v 3 v 3 v 2 v 2 B D B D T ′ N ′ T N (i) (ii) Figure: A generalisation of the NNI operation on networks. T. Wu Evolutionary Analysis

Recommend


More recommend