CS481: Bioinformatics Algorithms Can Alkan EA224 calkan@cs.bilkent.edu.tr http://www.cs.bilkent.edu.tr/~calkan/teaching/cs481/
HMM for Fair Bet Casino (cont’d) HMM model for the HMM model for the Fair Bet Casino Fair Bet Casino Problem Problem
Hidden Paths A A path path π = π π = π 1 … π … π n in the HMM in the HMM is defined as a is defined as a sequence of states. sequence of states. Consider path Consider path π π = = FFFBBBBBFFF and sequence FFFBBBBBFFF and sequence x x = = 01011101001 01011101001 Probability that x i was emitted from state π i x 0 1 0 1 1 1 0 1 0 0 1 π π = F = F F F B B B B B B B F F F F P( P(x i |π |π i ) ½ ½ ½ ¾ ¾ ¾ ½ ½ ½ ¾ ¾ ¾ ¼ ¼ ¾ ½ ½ ½ ¾ ½ ½ ½ P(π P(π i-1 1 π i ) ½ ½ 9 / 10 10 10 9 / 10 1 / 10 9 / 10 9 / 10 9 9 9 / 10 9 / 10 1 / 10 9 / 10 9 / 10 10 10 10 10 10 10 10 10 Transition probability from state π i-1 1 to state π to state π i
P( x | π ) Calculation P( P( x | π ): ): Probability that sequence Probability that sequence x was was generated by the path generated by the path π: π: n P( P( x | π ) = P( ) = P( π 0 → π → π 1 ) ) · Π P Π P (x (x i | π | π i ) ) · · P( P( π i i → π → π i+1 i+1 ) i=1 =1 = = a π 0, 1 · Π Π e π i i (x (x i i ) ) · a π i, i+1 , π 1 , π i+1
Decoding Problem Goal: Goal: Find an optimal hidden path of states Find an optimal hidden path of states given observations. given observations. Input: Input: Sequence of observations Sequence of observations x = x x = x 1 …x …x n generated by an HMM generated by an HMM M ( Σ , Q, A, E , Q, A, E ) Output: Output: A path that maximizes A path that maximizes P(x| P(x| π ) over over all possible paths all possible paths π. π.
Building Manhattan for Building Manhattan for Decoding Problem Andrew Viterbi used the Manhattan grid Andrew Viterbi used the Manhattan grid model to solve the model to solve the Decoding Problem Decoding Problem . Every choice of Every choice of π = π π = π 1 … π … π n n corresponds to a corresponds to a path in the graph. path in the graph. The only valid direction in the graph is The only valid direction in the graph is eastward. eastward. This graph has | This graph has | Q | 2 (n (n-1) 1) edges. edges.
Edit Graph for Decoding Problem
Decoding Problem vs. Alignment Problem Valid directions in the Valid directions in the Valid directions in the Valid directions in the alignment problem. alignment problem. decoding problem. decoding problem.
Decoding Problem as Finding a Longest Path in a DAG • The The Decoding Problem Decoding Problem is reduced to finding is reduced to finding a longest path in the a longest path in the directed acyclic graph directed acyclic graph (DAG) (DAG) above. above. • Notes: Notes: the length of the path is defined as the length of the path is defined as the the product product of its edges’ weights, not the of its edges’ weights, not the sum. sum.
Decoding Problem (cont’d) • Every path in the graph has the probability Every path in the graph has the probability P(x| P(x| π ) . • The Viterbi algorithm finds the path that The Viterbi algorithm finds the path that maximizes maximizes P(x| P(x| π ) among all possible paths. ) among all possible paths. • The Viterbi algorithm runs in The Viterbi algorithm runs in O(n|Q| O(n|Q| 2 ) time. time.
Decoding Problem: weights of edges w (k, i) (l, i+1) The weight w is given by: ???
Decoding Problem: weights of edges n P( P( x | π ) = ) = Π Π e π i+1 i+1 (x (x i+1 i+1 ) . a . a π i, i+1 , π i+1 i=0 i=0 w (k, i) (l, i+1) The weight w is given by: ??
Decoding Problem: weights of edges i -th term i+1 th term = e e π i+1 i+1 (x (x i+1 i+1 ) . a . a π i, , π i+1 w (k, i) (l, i+1) The weight w is given by: ?
Decoding Problem: weights of edges i -th term th term = e π i (x (x i ) . a . a π i, i+1 = = e l (x i+1 ). a kl for π i i =k, π =k, π i+1 i+1 = l , π i+1 w (k, i) (l, i+1) The weight w= e l (x i+1 ). a kl
Decoding Problem and Dynamic Programming s l,i+1 l,i+1 = = max max k Є Q Є Q { s k,i k,i · · weight of edge between weight of edge between (k,i) (k,i) and and (l,i+1) (l,i+1) }= }= max max k Є Q Є Q { s k,i k,i · a · a kl kl · e · e l (x (x i+1 i+1 ) ) }= }= e e l (x (x i+1 i+1 ) · ) · max max k Є Q Є Q { s k,i k,i · a · a kl kl }
Decoding Problem (cont’d) • Initialization: Initialization: begin,0 = 1 • s begin,0 = 1 k,0 = 0 for = 0 for k ≠ begin k ≠ begin . • s k,0 • Let Let π * be the optimal path. Then, be the optimal path. Then, P( P( x | π * ) = max ) = max k Є Q Є Q { s k,n k,n . . a k,end k,end }
Decoding Problem (cont’d) • Initialization: Initialization: begin,0 = 1 • s begin,0 = 1 k,0 = 0 for = 0 for k ≠ begin k ≠ begin . • s k,0 • Let Let π * be the optimal path. Then, be the optimal path. Then, P( P( x | π * ) = ) = max max k Є Q Є Q { s k,n k,n . . a k,end k,end } Is there a problem here?
Viterbi Algorithm The value of the product can become The value of the product can become extremely small, which leads to overflowing. extremely small, which leads to overflowing.
Viterbi Algorithm The value of the product can become The value of the product can become extremely small, which leads to overflowing. extremely small, which leads to overflowing. To avoid overflowing, use log value instead. To avoid overflowing, use log value instead. s k,i+1 k,i+1 = = log log e l (x i+1 ) + max Є Q { s k,i + log( a kl )} k Є Q k,i
FORWARD/BACKWARD
Forward-Backward Problem Given: Given: a sequence of coin tosses generated a sequence of coin tosses generated by an HMM by an HMM . Goal: Goal: find the probability that the dealer was find the probability that the dealer was using a biased coin at a particular time. using a biased coin at a particular time.
Forward Algorithm Define Define f k,i k,i ( forward probability forward probability ) as the ) as the probability of emitting the prefix probability of emitting the prefix x 1 …x …x i i and and reaching the state reaching the state π = k = k . The recurrence for the forward algorithm: The recurrence for the forward algorithm: . Σ f l,i f k,i k,i = = e k (x (x i ) . l,i- 1 . a . a lk lk l Є Q l Є Q
Backward Algorithm However, However, forward probability forward probability is not the only is not the only factor affecting factor affecting P( P( π i = k|x = k|x) . The sequence of transitions and emissions The sequence of transitions and emissions that the HMM undergoes between that the HMM undergoes between π i+1 i+1 and and π n also affect also affect P( = k|x) . P( π i = k|x forward x i backward
Backward Algorithm (cont’d) Define Define backward probability backward probability b k,i k,i as the as the probability of being in state probability of being in state π i i = k = k and emitting and emitting the the suffix suffix x i+1 i+1 …x …x n . The recurrence for the The recurrence for the backward algorithm backward algorithm : = Σ e l (x b k,i k,i = (x i+1 i+1 ) . . b l,i+1 l,i+1 . a . a kl kl l Є Q l Є Q
Forward-Backward Algorithm The probability that the dealer used a The probability that the dealer used a biased coin at any moment biased coin at any moment i : P(x, P(x, π i = k = k) f ) f k (i) . b (i) . b k (i) (i) ) = _______________ _______________ = = ______________ ______________ P( P( π i = k|x = k|x) = P(x) P(x) P(x) P(x) P(x) is the sum of P(x, π P(x, π i = k) = k) over all k
PROFILE HMM
Finding Distant Members of a Protein Family A distant cousin of functionally related sequences in A distant cousin of functionally related sequences in a protein family may have weak pairwise similarities a protein family may have weak pairwise similarities with each member of the family and thus fail with each member of the family and thus fail significance test. significance test. However, they may have weak similarities with However, they may have weak similarities with many members of the family. many members of the family. The goal is to align a sequence to The goal is to align a sequence to all all members of members of the family at once. the family at once. Family of related proteins can be represented by Family of related proteins can be represented by their multiple alignment and the corresponding their multiple alignment and the corresponding profile. profile.
Profile Representation of Protein Families Aligned DNA sequences can be represented by a 4 ·n profile matrix reflecting the frequencies of nucleotides in every aligned position. Protein family can be represented by a Protein family can be represented by a 20·n profile profile representing frequencies of amino acids. representing frequencies of amino acids.
Recommend
More recommend