computing triplet and quartet
play

Computing Triplet and Quartet Distances Between Trees Gerth Stlting - PowerPoint PPT Presentation

Computing Triplet and Quartet Distances Between Trees Gerth Stlting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University of Southern Denmark Thomas Mailund, Christian N. S. Pedersen, Andreas Sand Aarhus


  1. Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University of Southern Denmark Thomas Mailund, Christian N. S. Pedersen, Andreas Sand Aarhus University, Bioinformatics Research Center Work presented at SODA 2013 and ALENEX 2014 Department of Computer Science, University of Copenhagen, 20 January 2014

  2. Outline  Evolutionary trees – rooted vs. unrooted , binary vs. arbitrary degree  Tree distances – Robinson-Foulds, triplet , quartet  Results and previous work – triplet , quartet distances  Algorithms – triplet (quartet)  Experimental results (ALENEX 2014)

  3. Rooted Evolutionary Tree Time Bonobo Chimpanzee Human Neanderthal Gorilla Orangutan

  4. Unrooted Evolutionary Tree Dominant modern approach to study evolution is from DNA analysis

  5. Constructing Evolutionary Trees – Binary or Arbitrary Degrees ? Distance matrix Sequence data 1 2 3 ··· n 1 1 2 2 3 3 ··· ··· n n Binary trees Arbitrary degree Arbitrary degrees (despite no evidence (compromise ; good (strong support for all in distance data) support for all edges) edges ; few branches) .... .... Neighbor Joining Refined Buneman Trees Buneman Trees Saitou, Nei 1987 Buneman 1971 Moulton, Steel 1999 [ O( n 3 ) Berry, Bryan 1999 ] [ O( n 3 ) Saitou, Nei 1987 ] [ O( n 3 ) Brodal et al . 2003 ]

  6. Data Analysis vs Expert Trees – Binary vs Arbitrary Degrees ? Cultural Phylogenetics of the Tupi Language Family in Lowland South America. R. S. Walker, S. Wichmann, T. Mailund, C. J. Atkisson. PLoS One. 7(4), 2012. Linguistic expert classification Neighbor Joining on linguistic data (Aryon Rodrigues)

  7. Evolutionary Tree Comparison split 8 1357|2468 2 7 6 ? 4 4 7  5 6 5 2 8 3 1 1 3 T 1 T 2 Common Only T 1 Only T 2 1357|2468 35|124678 57|123468 13567|248 48|123567 Robinson-Foulds distance = # non-common splits = 2 + 1 = 3 D. F. Robinson and L. R. Foulds. Comparison of weighted labeled trees. In Combinatorial mathematics, VI , Lecture Notes in Mathematics, pages 119 – 126. Springer, 1979. [Day 1985] O( n ) time algorithm using 2 x DFS + radix sort

  8. Robinson-Foulds Distance (unrooted trees) D. F. Robinson and L. R. Foulds. Comparison of weighted labeled trees. In Combinatorial mathematics, VI , Lecture Notes in Mathematics, pages 119 – 126. Springer, 1979. 3 4 3 4 6 6 8 Common Only T 1 Only T 2 ? (none) 12567|348 125678|34 1257|3468 12578|346 2  1 2 1 157|23468 1578|2346 57|123468 578|12346 5 8 78|123456 5 7 7 T 1 T 2 RF-dist( T 1 , T 2 ) = 4 + 5 = 9 Robinson-Foulds very RF-dist( T 1 \{8} , T 2 \{8}) = 0 sensitive to outliers

  9. Quartet Distance (unrooted trees) G. Estabrook, F. McMorris, and C. Meacham. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology , 34:193-200, 1985. Consider all n 4 quartets, i.e. topologies of subsets of 4 leaves { i , j , k , l } j l j l i k i k resolved : ij | kl unresolved : ijkl (only non-binary trees) 5 5 2 Quartet T 1 T 2 1 {1,2,3,4} 14|23 14|23 3 3 {1,2,3,5} 13|25 15|23 4 {1,2,4,5} 14|25 1245 4 2 {1,3,4,5} 14|35 1345 1 {2,3,4,5} 25|34 23|45 T 1 T 2 Quartet-dist( T 1 , T 2 ) = n 4 - # common quartets = 5 - 1 = 4

  10. Triplet Distance (rooted trees) D. E. Critchlow, D. K. Pearl, C. L. Qian: The triples distance for rooted bifurcating phylogenetic trees. Systematic Biology , 45(3):323-334, 1996. Consider all n 3 triplets, i.e. topologies of subsets of 3 leaves { i , j , k } k i k j i j unresolved : ijk resolved : k | ij (only non-binary trees) Triplet T 1 T 2 {1,2,3} 2|13 2|13 {1,2,4} 1|24 4|12 {1,2,5} 1|25 5|12 {1,3,4} 4|13 4|13 {1,3,5} 5|13 5|13 3 1 5 2 4 5 {1,4,5} 1|45 1|45 {2,3,4} 3|24 4|23 4 2 3 1 {2,3,5} 3|25 5|23 T 1 T 2 {2,4,5} 5|24 2|45 {3,4,5} 3|45 3|45 Triplet-dist( T 1 , T 2 ) = n 3 - # common triplets = 10 - 5 = 5

  11. Computational Results Rooted Unrooted Triplet distance Quartet distance 5 2 3 3 1 5 4 O( n 3 ) 4 2 D 1985 1 O( n 2 ) Binary CPQ 1996 O( n 2 ) BTKL 2000 O( n  log 2 n ) SBFPM 2013 O( n  log 2 n ) BFP 2001 O( n  log n ) [SODA 2013] O( n  log n ) BFP 2003 10 5 9 3 1 8 7 3 1 5 12 13 11 6 7 Arbitrary 4 6 2 O( d 9  n  log n ) SPMBF 2007 degrees O( n 2 ) BDF 2011 O( n 2.688 ) NKMP 2011 O( n  log n ) O( d  n  log n ) [SODA 2013] [SODA 2013] [ALENEX 2014]

  12. Distance Computation Triplet-dist( T 1 , T 2 ) = B + C + D = n 3 – A – E T 2 Resolved Unresolved A : Agree Resolved k C i j k B : Disagree i k i j j i j j k i k T 1 Unresolved D E i k k j i j i k j i j i k j k Sufficient to compute A and E D + E and C + E unresolved in one tree (For binary trees C, D and E are all zero)

  13. Parameterized Triplet & Quartet Distances B + α ·(C + D) , 0  α  1 T 2 Resolved Unresolved A : Agree Resolved k C i j k B : Disagree i k i j j i j j k i k T 1 Unresolved D E i k k j i j i k j i j i k j k BDF 2011 O( n 2 ) for triplet, NKMP 2011 O( n 2.688 ) for quartet [SODA 2013/ALENEX 2014] O( n ·log n ) and O( d · n ·log n ), respectively

  14. Counting Unresolved Triplets in One Tree v n i · n j · n k v i < j < k n 1 n 2 n 3 ··· n d Triplet anchored at v Computable in O( n ) time using DFS + dynamic programming Quartets (root tree arbitrary) v + n − n l n i · n j · n k n i · n j · n k · n l v i < j < k < l i < j < k n 1 n 2 n 3 ··· n d l Quartet anchored at v

  15. Counting Agreeing Triplets (Basic Idea) 0 v w c j j 1 i d i i T 1 T 2 nic nw − nc − niw + nic 2 v  T 1 w  T 2 c 1≤ i ≤ d niw 1≤ i ≤ d

  16. Efficient Computation 0 v Limit recolorings in T 1 ( and T 2 ) to O( n ·log n ) 0 0 0 0 1 T 1 Recolor Recolor Recurse v v v v ... 0 1 2 v 1 d 1 1 1 (precondition) Count T 2 Recolor & 1 contribution recurse Reduce recoloring cost in T 2 to O( n ·log 2 n ) T 2 H ( T 2 ) 7 9 8 5 arbitrary binary 6 3 height 2 1 3 6 9 4 height degree 2 7 5 8 O(log n ) 4 1 Reduce recoloring cost in T 2 from O( n ·log 2 n ) to O( n ·log n )  Contract T 2 and reconstruct H ( T 2 ) during recursion

  17. Counting Agreeing Triplets (II) C 2 node in H ( T 2 ) = T 1 component i j 0 composition in T 2 i v i j C 1 j i 1 i d i i j Contribution to agreeing triplets at node in H ( T 2 ) niC 1 niC 1 · ni ↑∗ C 2 n ∗ C 1 − niC 1 n ( ii ) C 2 n ∗ C 2 − niC 2 + + 2 1≤ i ≤ d 1≤ i ≤ d 1≤ i ≤ d

  18. From O(n·log 2 n ) to O( n ·log n ) Compressed version Update O (1) counters for all T 1 of T 2 of size O( n v ) colors through node 0 v 2 ) H ( T w j 1 i d n i = n i ∙ log nv n v 2 | log | T Colored path lengths n i n i 2≤ i ≤ d 2≤ i ≤ d Total cost for updating counters a (5) a (4) log na ( j +1) = n · log n T 1 a (3) na ( j ) leaf l ∈ T 1 a (2) ancestor a ( j ) a (1) not heavy child l = a (0)

  19. Counting Quartets...  Root T 1 and T 2 arbitrary  Keep up to 7 d 2 + 97 d + 29 different counters per node in H ( T 2 )... Bottleneck in computing disagreeing resolved-resolved quartets T 1 T 2 0 G 1 G 2 v i j i j n ( ij ) G 1 · n ( ij ) G 2 j 1 i d 1≤ i < d i < j ≤ d double-sum  factor d time

  20. Distance Computation Triplet-dist( T 1 , T 2 ) = B + C + D = n 3 – A – E T 2 Resolved Unresolved A : Agree Resolved k C i j k B : Disagree i k i j j i j j k i k T 1 Unresolved D E i k k j i j i k j i j i k j k Sufficient to compute A and E

  21. ALENEX 2014: Implementation (M.Sc. thesis Morten Kragelund Holt and Jens Johansen) Binary Arbitrary degree time counters time counters Triplet O( n log n ) 6 O( n log n ) 4 d +2 O(max( d 1 , d 2 ) n log n ) 2 d 2 + 79 d + 22 (B, with T 1  T 2 ) O(min( d 1 , d 2 ) n log n ) 7 d 2 + 97 d + 29 (B, no swap) Quartet O( n log n ) 40 d 2 + 12 d + 12 (E, no swap) Worst-case #counters per node in HDT( T 2 )  First implementation for triplets for arbitrary degree  Space usage  10 KB per node for quartet (binary trees) can handle  1,000,000 leaves  64 bit integers, except 128 bit integers for values > n 3 quartet distance of up to  2,000,000 leaves

  22. Experimental Results Quartet Distance – Binary Trees [SODA 2013] MP 2004 NKMP 2011  [ALENEX 2014] are the first O( n  log n ) implementations  MP 2004 overhead from working with polynomials

  23. Experimental Results Quartet Distance – High Degree Trees max [SODA 2013] NKMP 2011 d = 1024 d = 256  [ALENEX 2014] are the first n  poly(log n , d ) implementation

  24. Experimental Results Triplet Distance – Binary Trees [SODA 2013] SBFPM 2013  [ALENEX 2014] are the first O( n  log n ) implementation  SBFPM 2013 only binary trees, no contractions

  25. Experimental Results Triplet Distance – High Degree Trees [SODA 2013], d = 2 SODA 2013 [SODA 2013], d = 256 [SBFPM 2013] [SODA 2013], d = 1024  [ALENEX 2014] first implementation  Triplet distance appears hardest for binary trees

Recommend


More recommend