algorithms in bioinformatics a practical introduction
play

Algorithms in Bioinformatics: A Practical Introduction Phylogenetic - PowerPoint PPT Presentation

Algorithms in Bioinformatics: A Practical Introduction Phylogenetic Tree comparison and Consensus Trees Phylogenetic Tree comparison Why tree comparison? Different phylogenies are resulted using different Kind of data (different


  1. Algorithms in Bioinformatics: A Practical Introduction Phylogenetic Tree comparison and Consensus Trees

  2. Phylogenetic Tree comparison

  3. Why tree comparison?  Different phylogenies are resulted using different  Kind of data (different segments of the genomes)  Kind of model (CF model, Jukes-Cantor Model)  Kind of reconstruction algorithm  Tree comparison helps us to gain information from multiple trees.

  4. Two types of comparsions  Similarity measurement  Find the common structure among the given trees  Maximum Agreement Subtree  Dissimilarity measurement  Determine the differences among the given trees  Robinson-Foulds distance  Nearest neighbor interchange  Subtree Transfer Distance  Quartet Distance

  5. Restricted subtree  Consider a trees T  Evolution information Restricted on of X 1 , X 3 , X 5 x 4 x 5 x 5 X 1 , X 3 , X 5 x 2 x 3 x 3 x 1 x 1 Evolution Simplify information of X 1 , x 5 X 2 , X 3 , X 4 , X 5 x 3 x 1

  6. Agreement subtree T x 4 x 5 x 4 x 5 x 2 x 3 x 2 x 1 x 1 Restricted on Simplify x 1 , x 2 , x 4 , x 5 x 5 T ’ x 1 x 2 x 4 Agreement subtree of x 4 x 1 x 2 x 4 x 1 x 2 T and T ’ x 5 x 3 x 5

  7. Maximum agreement subtree (MAST)  Given two trees T 1 and T 2  Agreement subtree of T 1 and T 2 is the common information agreed by both trees.  Since it is agreed by both trees, the evolution of the agreement subtree is more reliable!  Maximum agreement subtree problem  Find the agreement subtree with the largest possible number of leaves.  Such agreement subtree is called the maximum agreement subtree

  8. MAST for rooted trees  MAST of two degree-d rooted trees T 1 and T 2 with n leaves can be computed in ( log( n )) time O d n (Journal of Algorithm 2001)  d  This lecture considers an O(n 2 )-time algorithm which compute the maximum agreement subtree of two binary trees with n leaves.

  9. Computing MAST by dynamic programming  For any two binary rooted trees T 1 and T 2 , denote MAST(T 1 , T 2 ) be the number of leaves in the maximum agreement subtree  Some definition:  For a tree T and a node u, T u is the subtree of T rooted at u

  10. Not complete!  For any node pair (u,v) ∈ T 1 × T 2 ,  let a and b be two children of u  let c and d be two children of v  Let R be the maximum agreement subtree of T 1 and T 2 .  We have the following cases: a  R is an agreement subtree of T 1 b  R is an agreement subtree of T 1

  11. Recurrence = u v ( , ) MAST T T 1 2  + a c b d ( , ) ( , ) MAST T T MAST T T 1 2 1 2  +  a d b c ( , ) ( , ) MAST T T MAST T T 1 2 1 2  a v  ( , ) MAST T T 1 2  max u v b v  ( , ) MAST T T 1 2  a b c d u c ( , ) MAST T T  1 2  u d  ( , ) MAST T T 1 2 T 1 T 2

  12. Recurrence (II) = u v ( , ) MAST T T 1 2  + a c b d ( , ) ( , ) MAST T T MAST T T 1 2 1 2  +  a d b c ( , ) ( , ) MAST T T MAST T T 1 2 1 2  a v  ( , ) MAST T T 1 2  max u v b v  ( , ) MAST T T 1 2  a b c d u c ( , ) MAST T T  1 2  u d   ( , ) MAST T T 1 2 T 1 T 2

  13. Recurrence (III) = u v ( , ) MAST T T 1 2  + a c b d ( , ) ( , ) MAST T T MAST T T 1 2 1 2  +  a d b c ( , ) ( , ) MAST T T MAST T T 1 2 1 2  a v  ( , ) MAST T T 1 2  max u v b v  ( , ) MAST T T 1 2  a b c d u c  ( , ) MAST T T  1 2  u d  ( , ) MAST T T 1 2 T 1 T 2

  14. Recurrence (IV) = u v ( , ) MAST T T 1 2  + a c b d ( , ) ( , ) MAST T T MAST T T 1 2 1 2  +  a d b c ( , ) ( , ) MAST T T MAST T T 1 2 1 2  a v  ( , ) MAST T T 1 2  max u v b v   ( , ) MAST T T 1 2  a b c d u c ( , ) MAST T T  1 2  u d  ( , ) MAST T T 1 2 T 1 T 2

  15. Recurrence (V) = u v ( , ) MAST T T 1 2  + a c b d ( , ) ( , ) MAST T T MAST T T 1 2 1 2  +  a d b c ( , ) ( , ) MAST T T MAST T T 1 2 1 2  a v   ( , ) MAST T T 1 2  max u v b v  ( , ) MAST T T 1 2  a b c d u c ( , ) MAST T T  1 2  u d  ( , ) MAST T T 1 2 T 1 T 2

  16. Recurrence (VI) = u v ( , ) MAST T T 1 2  + a c b d ( , ) ( , ) MAST T T MAST T T 1 2 1 2  +  a d b c  ( , ) ( , ) MAST T T MAST T T 1 2 1 2  a v  ( , ) MAST T T 1 2  max u v b v  ( , ) MAST T T 1 2  a b c d u c ( , ) MAST T T  1 2  u d  ( , ) MAST T T 1 2 T 1 T 2

  17. Recurrence (VII) = u v ( , ) MAST T T 1 2  + a c b d  ( , ) ( , ) MAST T T MAST T T 1 2 1 2  +  a d b c ( , ) ( , ) MAST T T MAST T T 1 2 1 2  a v  ( , ) MAST T T 1 2  max u v b v  ( , ) MAST T T 1 2  a b c d u c ( , ) MAST T T  1 2  u d  ( , ) MAST T T 1 2 T 1 T 2

  18. Time complexity  Suppose T 1 and T 2 are rooted phylogenies for n species. u , T 2 v ) for  We have to compute MAST(T 1 every u in T 1 and v in T 2 .  Thus, we need to fill in n 2 entries. Each entry can be computed in O(1) time.  In total, the time complexity is O(n 2 ).

  19. MAST for unrooted trees  In real life, we normally want to compute MAST for unrooted trees.  For unrooted degree-3 trees U 1 and U 2 , MAST(U 1 , U 2 ) can be computed in O(n log n) time. (STOC 97)  For general unrooted trees U 1 and U 2 , MAST(U 1 , U 2 ) can be computed in O(n 1.5 log n) time. (SIAM J. of Comp 2000)  This lecture shows the relationship between unrooted MAST and rooted MAST!

  20. Relating rooted and unrooted trees (I)  Definition:  For an unrooted tree U, for any edge e in U, U e is the rooted tree rooted at the edge e. x 1 x 4 e  rooted at x 4 x 1 x 2 edge e x 2 x 3 x 5 x 3 x 5

  21. Relating rooted and unrooted trees (II)  Consider two unrooted trees U 1 and U 2  Lemma: For any edge e of U 1 , = e f ( , ) max{ ( , ) | is an edge of } MAST U U MAST U U f U 1 2 1 2 2  Proof: Exercise!  Based on the above lemma, we can relate rooted MAST and unrooted MAST!

  22. Robinson-Foulds distance  Given two phylogenies T 1 and T 2 ,  Intuitively, this method tries to count the number of edges which are not agreed by T 1 and T 2 .  First, we need to have some definitions!

  23. Partitioning of a tree  Each edge can partition the set of species  In the following tree, the red edge partition the species into { a, b, c} and { d, e} c d e a b

  24. Good and bad edges Consider two unrooted trees T and T ’ , an edge x in T is called a  good edge if there exists an edge x ’ in T ’ such that both of them form the same partitions! Similarly, x ’ is also called a good edge. Otherwise, the edge is called a bad edge!  c a T ’ d e T x x ’ e d a b b c

  25. Leaf edges are always good c a T ’ d x ’ e T x e d a b b c

  26. Robinson-Foulds (RF) distance  Robinson-Foulds distance = (number of bad edges in T w.r.t T ’ + number of bad edges in T ’ w.r.t. T)/2  T and T ’ looks similar if RF-dist(T, T ’ ) is small.  For example, the robinson-foulds distance of T and T ’ = (1+ 1)/2 = 1. c a T ’ d e T e d a b b c Bad edges!

  27. Degree-3 trees T and T ’  When both T and T ’ are of degree-3, number of bad edges in T w.r.t. T ’ = number of bad edges in T ’ w.r.t. T  Proof:  Since both T and T ’ are of degree-3, T and T ’ have the same number of edges  Number of good edges in T w.r.t. T ’ = number of good edges in T ’ w.r.t. T  Lemma follows.

  28. How to find the set of good edges in T w.r.t. T ’ ?  Brute-force algorithm:  For every edge e in T,  If the partition formed by e is the same as the partition formed by some edge e ’ in T ’ , e is a good edge!  Time analysis:  For every edge e in T, the checking takes O(n) time.  In total, the time complexity is O(n 2 )!  Can we do better?

  29. Day ’ s algorithm  Yes! The problem can be solved in O(n) time based on Day ’ s algorithm.  Input: two unrooted phylogenies T 1 and T 2 for the same set of species  Output: the set of good edges in T 1 w.r.t. T 2  Idea:  Build data-structure which enables constant time checking whether a particular partition of leaves exists in T 1 .

  30. Step 1  Root T 1 and T 2 at the leaves with label n.  This step takes O(n) time. n n T 2 T 1

  31. Example for step 1 3 1 T 2 4 5 T 1 5 4 1 2 2 3 ↓ 5 5 T 2 T 1 3 1 2 4 1 2 3 4

  32. Step 2  Relabel the leaves of T 1 in increasing order.  Note: for every internal node x of T 1 , the set of leaf labels in the subtree of x form an interval [i..j].  This step takes O(n) time. n n T 2 T 1 x 1 i j n-1

  33. Example for step 2 5 5 T 2 T 1 3 1 2 4 1 2 3 4 ↓ 5 5 T 2 T 1 [2..3] 1 2 3 4 2 3 1 4

Recommend


More recommend