Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Rolf Fagerberg Aarhus University University of Southern Denmark Thomas Mailund, Christian N. S. Pedersen, Andreas Sand Aarhus University, Bioinformatics Research Center ACM-SIAM Symposium on Discrete Algorithms, New Orleans, Louisiana, USA, 8 January 2013
Rooted Evolutionary Tree Time Bonobo Chimpanzee Human Neanderthal Gorilla Orangutan
Unrooted Evolutionary Tree Dominant modern approach to study evolution is from DNA analysis
Constructing Evolutionary Trees – Binary or Arbitrary Degrees ? Distance matrix Sequence data 1 2 3 ··· n 1 1 2 2 3 3 ··· ··· n n Binary trees Arbitrary degree Arbitrary degrees (despite no evidence (compromise ; good (strong support for all in distance data) support for all edges) edges ; few branches) .... .... Neighbor Joining Refined Buneman Trees Buneman Trees Saitou, Nei 1987 Buneman 1971 Moulton, Steel 1999 [ O( n 3 ) Berry, Bryan 1999 ] [ O( n 3 ) Saitou, Nei 1987 ] [ O( n 3 ) Brodal et al . 2003 ]
Data Analysis vs Expert Trees – Binary vs Arbitrary Degrees ? Cultural Phylogenetics of the Tupi Language Family in Lowland South America. R. S. Walker, S. Wichmann, T. Mailund, C. J. Atkisson. PLoS One. 7(4), 2012. Linguistic expert classification Neighbor Joining on linguistic data (Aryon Rodrigues)
Evolutionary Tree Comparison split 8 1357|2468 2 7 6 ? 4 4 7 5 6 5 2 8 3 1 1 3 T 1 T 2 Common Only T 1 Only T 2 1357|2468 35|124678 57|123468 13567|248 48|123567 Robinson-Foulds distance = # non-common splits = 2 + 1 = 3 D. F. Robinson and L. R. Foulds. Comparison of weighted labeled trees. In Combinatorial mathematics, VI , Lecture Notes in Mathematics, pages 119 – 126. Springer, 1979. [Day 1985] O( n ) time algorithm using 2 x DFS + radix sort
Robinson-Foulds Distance (unrooted trees) D. F. Robinson and L. R. Foulds. Comparison of weighted labeled trees. In Combinatorial mathematics, VI , Lecture Notes in Mathematics, pages 119 – 126. Springer, 1979. 3 4 3 4 6 6 8 Common Only T 1 Only T 2 ? (none) 12567|348 125678|34 1257|3468 12578|346 2 1 2 1 157|23468 1578|2346 57|123468 578|12346 5 8 78|123456 5 7 7 T 1 T 2 RF-dist( T 1 , T 2 ) = 4 + 5 = 9 Robinson-Foulds very RF-dist( T 1 \{8} , T 2 \{8}) = 0 sensitive to outliers
Quartet Distance (unrooted trees) G. Estabrook, F. McMorris, and C. Meacham. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology , 34:193-200, 1985. Consider all n 4 quartets, i.e. topologies of subsets of 4 leaves { i , j , k , l } j l j l i k i k resolved : ij | kl unresolved : ijkl (only non-binary trees) 5 5 2 Quartet T 1 T 2 1 {1,2,3,4} 14|23 14|23 3 3 {1,2,3,5} 13|25 15|23 4 {1,2,4,5} 14|25 1245 4 2 {1,3,4,5} 14|35 1345 1 {2,3,4,5} 25|34 23|45 T 1 T 2 Quartet-dist( T 1 , T 2 ) = n 4 - # common quartets = 5 - 1 = 4
Triplet Distance (rooted trees) D. E. Critchlow, D. K. Pearl, C. L. Qian: The triples distance for rooted bifurcating phylogenetic trees. Systematic Biology , 45(3):323-334, 1996. Consider all n 3 triplets, i.e. topologies of subsets of 3 leaves { i , j , k } k i k j i j unresolved : ijk resolved : k | ij (only non-binary trees) Triplet T 1 T 2 {1,2,3} 2|13 2|13 {1,2,4} 1|24 4|12 {1,2,5} 1|25 5|12 {1,3,4} 4|13 4|13 {1,3,5} 5|13 5|13 3 1 5 2 4 5 {1,4,5} 1|45 1|45 {2,3,4} 3|24 4|23 4 2 3 1 {2,3,5} 3|25 5|23 T 1 T 2 {2,4,5} 5|24 2|45 {3,4,5} 3|45 3|45 Triplet-dist( T 1 , T 2 ) = n 3 - # common triplets = 10 - 5 = 5
Computational Results Rooted Unrooted Triplet distance Quartet distance 5 2 3 3 1 5 4 O( n 3 ) 4 2 D 1985 1 Binary O( n 2 ) O( n 2 ) CPQ 1996 BTKL 2000 O( n log n ) O( n log 2 n ) [SODA 2013] BFP 2001 O( n log n ) BFP 2003 10 5 9 3 1 8 7 3 1 5 12 13 11 6 7 4 6 2 Degrees d O( d 9 n log n ) SPMBF 2007 O( n 2 ) BDF 2011 O( n 2.688 ) NKMP 2011 O( n log n ) [SODA 2013] O( d n log n ) [SODA 2013]
Distance Computation Triplet-dist( T 1 , T 2 ) = n 3 – A – E = B + C + D T 2 Resolved Unresolved A : Agree Resolved k C i j k B : Disagree i k i j j i j j k i k T 1 Unresolved D E i k k j i j i k j i j i k j k A + B + C + D + E = n 3 D + E and C + E unresolved in one tree Sufficient to compute A and E or A and B
Parameterized Triplet & Quartet Distances B + α ·(C + D) , 0 α 1 T 2 Resolved Unresolved A : Agree Resolved k C i j k B : Disagree i k i j j i j j k i k T 1 Unresolved D E i k k j i j i k j i j i k j k BDF 2011 O( n 2 ) for triplet, NKMP 2011 O( n 2.688 ) for quartet [SODA 13] O( n ·log n ) and O( d · n ·log n ), respectively
Counting Unresolved Triplets in One Tree v n i · n j · n k v i < j < k n 1 n 2 n 3 ··· n d Triplet anchored at v Computable in O( n ) time using DFS + dynamic programming Quartets (root tree arbitrary) v + n − n l n i · n j · n k n i · n j · n k · n l v i < j < k < l i < j < k n 1 n 2 n 3 ··· n d l Quartet anchored at v
Counting Agreeing Triplets (Basic Idea) 0 v w c j j 1 i d i i T 1 T 2 nic nw − nc − niw + nic 2 v T 1 w T 2 c 1≤ i ≤ d niw 1≤ i ≤ d
Efficient Computation 0 v Limit recolorings in T 1 ( and T 2 ) to O( n ·log n ) 0 0 0 0 1 T 1 Recolor Recolor Recurse v v v v ... 0 1 2 v 1 d 1 1 1 (precondition) Count T 2 Recolor & 1 contribution recurse Reduce recoloring cost in T 2 to O( n ·log 2 n ) T 2 H ( T 2 ) 7 9 8 5 arbitrary binary 6 3 height 2 1 3 6 9 4 height degree 2 7 5 8 O(log n ) 4 1 Reduce recoloring cost in T 2 from O( n ·log 2 n ) to O( n ·log n ) Contract T 2 and reconstruct H ( T 2 ) during recursion
Counting Agreeing Triplets (II) C 2 node in H ( T 2 ) = T 1 component i j 0 composition in T 2 i v i j C 1 j i 1 i d i i j Contribution to agreeing triplets at node in H ( T 2 ) niC 1 niC 1 · ni ↑∗ C 2 n ∗ C 1 − niC 1 n ( ii ) C 2 n ∗ C 2 − niC 2 + + 2 1≤ i ≤ d 1≤ i ≤ d 1≤ i ≤ d
From O(n·log 2 n ) to O( n ·log n ) Compressed version Update O (1) counters for all T 1 of T 2 of size O( n v ) colors through node 0 v 2 ) H ( T w j 1 i d n i = n i ∙ log nv n v 2 | log | T Colored path lengths n i n i 2≤ i ≤ d 2≤ i ≤ d Total cost for updating counters a (5) a (4) log na ( j +1) = n · log n T 1 a (3) na ( j ) leaf l ∈ T 1 a (2) ancestor a ( j ) a (1) not heavy child l = a (0)
Counting Quartets... Root T 1 and T 2 arbitrary Keep up to 15+38 d different counters per node in H ( T 2 )... Bottleneck in computing disagreeing resolved-resolved quartets T 1 T 2 0 G 1 G 2 v i j i j n ( ij ) G 1 · n ( ij ) G 2 j 1 i d 1≤ i < d i < j ≤ d double-sum factor d time
Distance Computation Triplet-dist( T 1 , T 2 ) = n 3 – A – E = B + C + D T 2 Resolved Unresolved A : Agree Resolved k C i j k B : Disagree i k i j j i j j k i k T 1 Unresolved D E i k k j i j i k j i j i k j k A + B + C + D + E = n 3 D + E and C + E unresolved in one tree Sufficient to compute A and E or A and B
Summary Rooted Unrooted Triplet distance Quartet distance 5 2 3 3 1 5 4 O( n 3 ) D 1985 4 2 1 Binary O( n 2 ) O( n 2 ) CPQ 1996 BTKL 2000 O( n log n ) O( n log 2 n ) [SODA 2013] BFP 2001 O( n log n ) o( n ·log n ) ? BFP 2003 10 5 9 3 1 8 7 3 1 5 12 13 11 6 7 Degrees d 4 6 2 O( d 9 n log n ) SPMBF 2007 O( n 2 ) BDF 2011 O( n 2.688 ) NKMP 2011 O( n log n ) [SODA 2013] O( d n log n ) [SODA 2013] d = maximal degree of any node in T 1 and T 2 O( n ·log n ) ?
Recommend
More recommend