outline
play

Outline Probabilis3c Models of Phylogeny 1. Models of nucleo3de - PDF document

2/15/09 CSCI1950Z Computa3onal Methods for Biology Lecture 7 Ben Raphael February 9, 2009 hHp://cs.brown.edu/courses/csci1950z/ Outline Probabilis3c Models of Phylogeny 1. Models of nucleo3de change 2. Compu3ng likelihood of trees 1


  1. 2/15/09 CSCI1950‐Z Computa3onal Methods for Biology Lecture 7 Ben Raphael February 9, 2009 hHp://cs.brown.edu/courses/csci1950‐z/ Outline Probabilis3c Models of Phylogeny 1. Models of nucleo3de change 2. Compu3ng likelihood of trees 1

  2. 2/15/09 Distances from Sequences convergent mul3ple Chimpanzee: CCTGCCAGTTAGCAAACGG G T Ancestor: CCCGCGACTTAACAAACGC G CCTGCGAGTTAACAAACGA Human: parallel back coincident Hamming distance = 3. D S = differences per site = 3/20 12 total muta3ons. Jukes‐Cantor Model A C G T A ‐3α α α α C α ‐3α α α Q = G α α ‐3α α T α α α ‐3α P(t) = e Qt p xy (t) = Pr[X t = x | X 0 = y] = ¼ + ¾ e ‐4αt if x = y ¼ ‐ ¼ e ‐4αt if x ≠ y 2

  3. 2/15/09 Observed Differences vs. Jukes‐Cantor = α t Other Models In biology, not all subs3tu3ons are equally likely. {A, G} purines Kimura 2 parameter model transversion A C G T A ‐2β‐ α β α β {C, T} pyrimidines C β ‐2β‐ α β α transi3on G α β ‐2β‐ α β T β α β ‐2β‐ α Other models • HKY model for DNA • Other models for protein sequences (20 x 20 matrices) 3

  4. 2/15/09 Probabilis3c Model y Pr[ x | y , t ] = probability that y mutates to x in 3me t t x Given a character matrix M, what is the “most likely” tree T that generated M? Rat: ACAGTGACGCCCCAAACGT Mouse: ACAGTGACGCTACAAACGT Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA CCTGTGACGTAGCAAACGA Human: Likelihood Given data D and a model (hypothesis) H for genera3on of D, the likelihood of H is the quan3ty L [ H ; D ] = Pr[ D | H ]. Note: likelihood of H is not the probability of H. 4

  5. 2/15/09 Probabilis3c Model y Pr[ x | y , t ] = probability that y mutates to x in 3me t t x Given a character matrix M, what is the “most likely” tree T that generated M? Assume: 1. Characters evolve independently. 2. Constant rate of muta3on on each branch. 3. State of a node depends only on parent and branch length: i.e. Pr[ x | y , t ] depends only on y and t . (Markov process) Maximum Likelihood Given a character matrix M , a tree T with branch lengths t * = t 1 , …, t 2 n ‐2 : L( T , t * ) = Pr[ M | T , t * ] is called the likelihood . Maximum likelihood : Find argmax T , t L( T , t * ) First: How to compute L(T, t * )? 5

  6. 2/15/09 Probabilis3c Model y Pr[ x | y , t ] = probability that y mutates to x in 3me t t x Given a tree (T, t * ) with leaves labeled by characters in M , Pr[ M | T , t * ] is the probability of a labeling of ancestral nodes. Assume: 1. Characters evolve independently: Pr[ M | T , t * ] = Π i Pr[ M i | T , t * ] so consider each character separately 2. Constant rate of muta3on on each branch. 3. State of a vertex depends only on parent and branch length: i.e. Pr[ x | y , t ] depends only on y and t . (Markov process) Example x t 5 t 6 y z t 2 t 3 G t 1 t 4 C A T 6

  7. 2/15/09 Probabilis3c Model y Pr[ x | y , t ] = probability that y t mutates to x in 3me t x Two species a T = tree topology t 1 t 2 x 1 , x 2 : characters for each species a : character for ancestor x 1 x 2 Pr[ x 1 , x 2 , a | T , t 1 , t 2 ] = q a Pr[ x 1 | a , t 1 ] Pr[ x 2 | a , t 2 ] q a = Pr[ ancestor has character a] Probabilis3c Model Two species a T = tree topology t 1 t 2 x 1 , x 2 : characters for each species a : character for ancestor x 1 x 2 Pr[ x 1 , x 2 , a | T , t 1 , t 2 ] = q a Pr[ x 1 | a , t 1 ] Pr[ x 2 | a , t 2 ] q a = Pr[ ancestor has character a] Pr[ x 1 , x 2 | T , t 1 , t 2 ] = Σ a q a Pr[ x 1 | a , t 1 ] Pr[ x 2 | a , t 2 ] Follows from Law of Total Probability: P( X ) = Σ P( X | Y i ) P( Y i ). 7

  8. 2/15/09 Probabilis3c Model n species: x 1 , x 2 , …, x n Let α( i ) = ancestor of node i . Let a n +1 , a n +2 , …, a 2 n ‐1 = characters on internal nodes, where nodes are number from internal ver3ces up to root. Pr [ x 1 , ..., x n | T, t 1 , ..., t 2 n − 2 ] = 2 n − 2 n � � Pr [ a i | a α ( i ) , t i ] � Pr [ x i | a α ( i ) , t i ] q a 2 n − 1 i = n +1 i =1 a n +1 ,a n +2 ,..,a 2 n − 1 Follows from Law of Total Probability: P( X ) = Σ P( X | Y i ) P( Y i ). Felsenstein’s Algorithm Let Pr[ T k | a ] = probability of leaf nodes “below” node k , given a k = a. a Compute via dynamic programming b c � � Pr [ T k | a ] = Pr [ b | a, t i ] Pr [ T i | b ] Pr [ c | a, t j ] Pr [ T j | c ] c b Ini3al condi3ons. For k = 1, …, n (leaf nodes) Pr[ T k | a ] = 1, if a = x k 0, otherwise. 8

  9. 2/15/09 Compu3ng the Likelihood Let Pr[ T k | a ] = probability of leaf nodes “below” node k , given a k = a. � Pr [ x 1 , . . . , x n | T, t ∗ ] = Pr [ T 2 n − 1 | a ] q a a Note: Root is node 2 n ‐1 Maximum Likelihood when T unknown Find T, t* that maximize: � Pr [ x 1 , . . . , x n | T, t ∗ ] = Pr [ T 2 n − 1 | a ] q a a Must search over all trees T. Complexity unknown un3l recently: – Felsenstein book (2004): “There has also been no proof that the problem is NP‐hard (as there has been for many other methods” – Shamir notes (2000): “[Maximum likelihood] not proven to be NP‐complete.” • ML is NP‐hard (B. Chor and T. Tuller, RECOMB 2005). 9

  10. 2/15/09 Unknown branch lengths • T fixed, branch lengths t * are unknown. • Use local op3miza3on rou3ne: e.g. Newton’s method or Expecta3on Maximiza3on Finding Ancestral States Let Pr[ T k | a ] = probability of best assignment of ancestral states to nodes “below” node k , given a k = a. � � � � Pr [ T k | a ] = max Pr [ b | a, t i ] Pr [ T i | b ] max Pr [ c | a, t j ] Pr [ T j | c ] c b Traceback as before with Sankoff’s algorithm. 10

  11. 2/15/09 Max. Parsimony vs. Max. Likelihood • Set δ ij = ‐log P( j | i ) in weighted parsimony (Sankoff algorithm) • Weighted parsimony produces “maximum probability” assignments, ignoring branch lengths Searching over Tree Space • How to find T with maximum likelihood? • How to find T with maximum parsimony? 11

Recommend


More recommend