algorithm summary
play

Algorithm Summary Method Input Output Sankoffs & Fitchs - PDF document

2/4/09 CSCI1950Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hHp://cs.brown.edu/courses/csci1950z/ Algorithm Summary Method Input Output Sankoffs & Fitchs Characters, T A, B Parsimony Alg.


  1. 2/4/09 CSCI1950‐Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hHp://cs.brown.edu/courses/csci1950‐z/ Algorithm Summary Method Input Output Sankoff’s & Fitch’s Characters, T A, B Parsimony Alg. Perfect Phylogeny Characters A, B, T Probabilis4c Felsenstein Characters, T, B A T = tree topology B = branch lengths A = ancestral states 1

  2. 2/4/09 Pairwise Compa4bility Test (Wilson 1965) Binary characters i and j are pairwise compa4ble if and only if: j is homogenous w.r.t i 0 or i 1 . Equivalently: i 1 and j 1 are disjoint or one contains the other Equivalently: i j k A 0 A 0 A 0 all 4 rows do not exist B 0 B 0 B 0 i 0 C 0 C 1 C 1 (0,0), (0,1), (1,0), (1,1) D 1 D 0 D 1 i 1 E 1 E 0 E 0 Pairwise Compa4bility Theorem (Estabrook et al. 1976) A set S of binary characters is mutually compa4ble if and only if all pairs c and c ’ of characters in S are pairwise compa4ble. Pairwise compa4bility  mutual compa4bility. 2

  3. 2/4/09 Perfect Phylogeny traits A set of mutually compa4ble binary 1 2 3 4 5 characters gives a perfect phylogeny : A 1 1 0 0 0 species B 0 0 1 0 0 C 1 1 0 1 0 1. Evolu4onary model D 0 0 1 0 1 – Binary characters {0,1} E 1 0 0 0 0 – Each character changes state only once in evolu4onary history (no homoplasy!). 2. Tree in which every muta4on is on an edge of the tree. 1 – All the species in one sub‐tree contain a 0, and all species in the other contain a 1. – For simplicity, assume root = (0, 0, 0, 0, 0) Last )me: algorithm to reconstruct a tree. 1 0 Trees and Splits • Given a set X, a split is a par44on of X into two non‐empty subsets A and B. X = A | B. • For a phylogene4c tree T with leaves L , each edge e defines a split L e = A | B , where A and B are the leaves in the subtrees obtained by removing e . i In perfect phylogeny, edges where binary character changes state gave split i 0 and i 1 . We will return to splits in a future lecture. i 1 i 0 3

  4. 2/4/09 Splits Equivalence Theorem A phylogene4c tree T defines a collec4on of splits Σ(T) = { L e | e is edge in T}. Splits A 1 | B 1 and A 2 | B 2 are pairwise compa3ble if at least one of A 1 ∩ A 2 , A 1 ∩ B 2 , B 1 ∩ A 2 , and B 1 ∩ B 2 is the empty set. Splits Equivalence Theorem : Let Σ be a collec4on of splits. There is a phylogene4c tree such that Σ(T) = Σ if and only if the splits in Σ are pairwise compa4ble. The Pairwise Compa4bility Theorem (for binary characters) follows from this theorem. Outline Distance‐based methods for phylogene4c tree reconstruc4on. • Review of distances/metrics. • Tree distances and addi4ve distances – Small and large phylogeny problems. • Non‐addi4ve distances and clustering – UPGMA and ultrametric distances. 4

  5. 2/4/09 Distances A distance on a set X is a func4on d: X  R sa4sfying: d( x , y ) ≥ 0, with equality iff x = y . For all x , y ∈ X, d( x , y ) = d( y , x ) [symmetry] For all x , y , z ∈ X, d( x , z ) ≤ d( x , y ) + d( y , z ) [triangle inequality] Examples: X = real numbers, d( x , y ) = | x – y | is distance. X = strings over some alphabet. d H ( s , t ) = number of posi4ons where s and t differ is called Hamming distance. Distances in Biological Data • String distances (e.g. Hamming distance, edit distance) on DNA/protein sequence data • Subs4tu4on model (Jukes‐Cantor, Kimura, etc.): scores for par4cular changes A  T, C  G, etc. Rat: ACAGTCACGCCCCACACGT Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGAGGTAGCAAACGA CCTGTGAGGTAGCACACGA Human: 5

  6. 2/4/09 Distance Matrix • For n species, form n x n distance matrix D ij • Example: D ij = edit distance between a gene in species i and species j . 0 7 11 10 Mouse: ACAGTGACGCCACACACGT 7 0 4 6 Gorilla: CCTGCGACGTAACAAACGC 11 4 0 2 Chimpanzee: CCTGCCAGTTAGCAAACGC 10 6 2 0 CCTGCCAGTTAGCACACGA Human: Alignment vs. Distance Matrix Mouse: ACAGTGACGCCACACACGT Sequence a gene of Gorilla: CCTGCGACGTAACAAACGC length m in n Chimpanzee: CCTGCCAGTTAGCAAACGC species  n x m CCTGCCAGTTAGCACACGA Human: alignment matrix. Reverse Transform transforma4on not possible due to loss into… of informa4on . 0 7 11 10 n x n distance matrix 7 0 4 6 11 4 0 2 10 6 2 0 6

  7. 2/4/09 Distances in Trees Given a tree T with a posi4ve weight w ( e ) on each edge, we define the tree distance d T on the set L of leaves by: d T ( i , j ) = sum of weights of edges on unique path from i to j. In evolu4onary biology, weights are some4mes called branch lengths . Distance in Trees: an Example j i d T (1,4) = 12 + 13 + 14 + 17 + 13 = 69 7

  8. 2/4/09 Distance vs. Tree Distance • n x n distance matrix for n species • Note that d T ( i , j ), tree distance between i and j, not necessarily equal to D ij as given by distance matrix. Rat: ACAGTGACGCCCCAAACGT Mouse: ACAGTGACGCTACAAACGT Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA Fivng a Distance Matrix • Given n species, we can compute the n x n distance matrix D ij • Evolu4on of these species is described by a tree that we don’t know . • We need an algorithm to construct a tree that best fits the distance matrix D ij Find a tree T such that: Lengths of path in an ( unknown ) tree T D ij = d T (i,j ) Distance between species ( known ) 8

  9. 2/4/09 Distance Based Phylogeny Problem Goal: Reconstruct an evolu4onary tree from a distance matrix Input: n x n distance matrix D ij Output: weighted tree T with n leaves fivng D Unknown topology of tree makes evolu4onary tree reconstruc4on hard ! # unrooted binary trees n leaves: T(n) = (2n‐3)! / ((n‐2)! 2 n‐2 ) 26 n = 24: T(n) = 5.74 x 10 If D is addi3ve , this problem has a solu4on and there is a simple algorithm to solve it Distance‐based vs. character‐based Key difference: Distance‐based methods do not reconstruct ancestral states. A B C D A 0 1 2 2 B 1 0 1 1 C 2 1 0 0 D 2 1 0 0 Note that C and D are iden4cal. 9

  10. 2/4/09 Reconstruc4ng a 3 Leaved Tree • Tree reconstruc4on for a 3x3 matrix is straighxorward • We have 3 leaves i, j, k and a center vertex c Observe: d ic + d jc = D ij d ic + d kc = D ik d jc + d kc = D jk Reconstruc4ng a 3 Leaved Tree (cont’d) d ic + d jc = D ij + d ic + d kc = D ik 2d ic + d jc + d kc = D ij + D ik 2d ic + D jk = D ij + D ik d ic = (D ij + D ik – D jk )/2 Similarly, d jc = (D ij + D jk – D ik )/2 d kc = (D ki + D kj – D ij )/2 10

  11. 2/4/09 Trees with > 3 Leaves • A binary tree with n leaves has 2n‐3 edges • Fivng a given tree to a distance matrix D requires solving a system with n ( n ‐1)/2 equa4ons and 2n‐3 variables • Solu4on not always possible for n > 3. Addi4ve Distance Matrices Matrix D is ADDITIVE if there exists a tree T with d ij ( T ) = D ij NON-ADDITIVE otherwise 11

  12. 2/4/09 Addi4ve Distance Phylogeny Small Addi>ve Distance Phylogeny : Given phylogene4c tree T and distance matrix D, determine branch lengths such that d T (i,j ) = D ij . Large Addi>ve Distance Phylogeny : Given distance matrix D, find T and branch lengths such that d T (i,j ) = D ij . Both of these problems can be solved efficiently. Reconstruc4ng Addi4ve Distances Given T x T D y 5 4 v w x y z 3 z v 0 10 17 16 16 3 4 7 w w 0 15 14 14 6 x 0 9 15 v y 0 14 If we know T and D, but do not know the length of each edge, we z 0 can reconstruct those lengths 12

  13. 2/4/09 Reconstruc4ng Addi4ve Distances Given T x T D y v w x y z z v 0 10 17 16 16 w w 0 15 14 14 x 0 9 15 v y 0 14 z 0 Reconstruc4ng Addi4ve Distances Given T x v w x y z Find neighbors v, w v 0 10 17 16 16 y (common parent) D w 0 15 14 14 x 0 9 15 z y 0 14 a w z 0 v a x y z d ax = ½ (d vx + d wx – d vw ) a 0 11 10 10 d ay = ½ (d vy + d wy – d vw ) D 1 x 0 9 15 y 0 14 d az = ½ (d vz + d wz – d vw ) z 0 13

  14. 2/4/09 Reconstruc4ng Addi4ve Distances Given T x a x y z Neighbors x, y y 5 a 0 11 10 10 (common parent) 4 D 1 x 0 9 15 b 3 y 0 14 z 3 a 4 c 7 w z 0 6 d(a, c) = 3 v d(b, c) = d(a, b) – d(a, c) = 3 a b z D 3 d(c, z) = d(a, z) – d(a, c) = 7 d(b, x) = d(a, x) – d(a, b) = 5 a 0 6 10 a c D 2 d(b, y) = d(a, y) – d(a, b) = 4 d(a, w) = d(z, w) – d(a, z) = 4 a 0 3 b 0 10 d(a, v) = d(z, v) – d(a, z) = 6 c 0 Correct!!! z 0 Trees and Neighbors Previous algorithm relied only on finding neighboring leaves: 1. Find neighboring leaves i and j with parent k 2. Remove the rows and columns of i and j 3. Add a new row and column corresponding to k , where the distance from k to any other leaf m can be computed as: D km = (D im + D jm – D ij )/2 Compress i and j into k , iterate algorithm for rest of tree 14

  15. 2/4/09 Finding Neighboring Leaves To find neighboring leaves we simply select a pair of closest leaves. WRONG! i j k l i 0 13 21 22 j 0 12 13 k 0 13 l 0 i and j are neighbors, but ( d ij = 13) > ( d jk = 12). Finding a pair of neighboring leaves is a nontrivial problem! Degenerate Triples • A degenerate triple is a set of three dis4nct elements 1 ≤ i, j, k ≤ n where D ij + D jk = D ik • Element j in a degenerate triple i,j,k lies on the evolu4onary path from i to k (or is aHached to this path by an edge of length 0). 15

Recommend


More recommend