0. Hidden Markov Models Based on • “Foundations of Statistical NLP” by C. Manning & H. Sch¨ utze, ch. 9, MIT Press, 2002 • “Biological Sequence Analysis”, R. Durbin et al., ch. 3 and 11.6, Cambridge University Press, 1998
1. PLAN 1 Markov Models Markov assumptions 2 Hidden Markov Models 3 Fundamental questions for HMMs 3.1 Probability of an observation sequence: the Forward algorithm, the Backward algorithm 3.2 Finding the “best” sequence: the Viterbi algorithm 3.3 HMM parameter estimation: the Forward-Backward (EM) algorithm 4 HMM extensions 5 Applications
2. 1 Markov Models (generally) Markov Models are used to model a sequence of ran- dom variables in which each element depends on pre- vious elements. X = � X 1 . . . X T � X t ∈ S = { s 1 , . . . , s N } X is also called a Markov Process or Markov Chain. S = set of states Π = initial state probabilities π i = P ( X 1 = s i ) ; � N i =1 π i = 1 A = transition probabilities: a ij = P ( X t +1 = s j | X t = s i ) ; � N j =1 a ij = 1 ∀ i
3. Markov assumptions • Limited Horizon: P ( X t +1 = s i | X 1 . . . X t ) = P ( X t +1 = s i | X t ) (first-order Markov model) • Time Invariance: P ( X t +1 = s j | X t = s i ) = p ij ∀ t Probability of a Markov Chain P ( X 1 . . . X T ) = P ( X 1 ) P ( X 2 | X 1 ) P ( X 3 | X 1 X 2 ) . . . P ( X T | X 1 X 2 . . . X T − 1 ) = P ( X 1 ) P ( X 2 | X 1 ) P ( X 3 | X 2 ) . . . P ( X T | X T − 1 ) = π X 1 Π T − 1 t =1 a X t X t +1
4. A 1st Markov chain example: DNA (from [ Durbin et al., 1998 ] ) A T Note: Here we leave transition probabilities unspecified. C G
5. A 2nd Markov chain example: CpG islands in DNA sequences Maximum Likelihood estimation of parameters using real data (+ and -) c + c − a + st a − st st = st = t ′ c + t ′ c − � � st ′ st ′ + A − A C G T C G T A 0 . 180 0 . 274 0 . 426 0 . 120 A 0 . 300 0 . 205 0 . 285 0 . 210 C 0 . 171 0 . 368 0 . 274 0 . 188 C 0 . 322 0 . 298 0 . 078 0 . 302 G 0 . 161 0 . 339 0 . 375 0 . 125 G 0 . 248 0 . 246 0 . 298 0 . 208 0 . 079 0 . 355 0 . 384 0 . 182 0 . 177 0 . 239 0 . 292 0 . 292 T T
6. Using log likelihoood (log-odds) ratios for discrimination L L a + P ( x | model + ) x i − 1 x i � � S ( x ) = log 2 P ( x | model − ) = log 2 = β x i − 1 x i a − x i − 1 x i i =1 i =1 β A C G T A − 0 . 740 0 . 419 0 . 580 − 0 . 803 C − 0 . 913 0 . 302 1 . 812 − 0 . 685 G − 0 . 624 0 . 461 0 . 331 − 0 . 730 − 1 . 169 0 . 573 0 . 393 − 0 . 679 T
7. 2 Hidden Markov Models K = output alphabet = { k 1 , . . ., k M } B = output emission probabilities: b ijk = P ( O t = k | X t = s i , X t +1 = s j ) Notice that b ijk does not depend on t . In HMMs we only observe a probabilistic function of the state sequence: � O 1 . . . O T � When the state sequence � X 1 . . . X T � is also observable: Visible Markov Model (VMM) Remark: In all our subsequent examples b ijk is independent of j .
8. A program for a HMM t = 1; start in state s i with probability π i (i.e., X 1 = i ); forever do move from state s i to state s j with prob. a ij (i.e., X t +1 = j ); emit observation symbol O t = k with probability b ijk ; t = t + 1 ;
9. A 1st HMM example: CpG islands (from [ Durbin et al., 1998 ] ) Notes: 1. In addition to the tran- sitions shown, there is also C + G a complete set of transitions + within each set (+ respec- T A + trively -). + 2. Transition probabilities in this model are set so that within each group they are close to the transition proba- bilities of the original model, but there is also a small T A chance of switching into the − − other component. Over- G − C − all, there is more chance of switching from ’+’ to ’-’ than viceversa.
10. A 2nd HMM example: The occasionally dishonest casino (from [ Durbin et al., 1998 ] ) 1: 1/6 1: 1/10 2: 1/6 2: 1/10 3: 1/6 3: 1/10 4: 1/6 4: 1/10 0.05 5: 1/6 5: 1/10 6: 1/6 6: 1/2 0.95 0.9 L F 0.1 0.99 0.01
11. A 2rd HMM example: The crazy soft drink machine (from [ Manning & Sch¨ utze, 2000 ] ) P(Coke) = 0.6 Ice tea = 0.1 Lemon = 0.3 0.3 Coke Ice tea 0.7 0.5 Preference Preference 0.5 P(Coke) = 0.1 π Ice tea = 0.7 =1 CP Lemon = 0.2
12. A 4th example: A tiny HMM for 5’ splice site recognition (from [ Eddy, 2004 ] )
13. 3 Three fundamental questions for HMMs 1. Probability of an Observation Sequence: Given a model µ = ( A, B, Π) over S, K , how do we (effi- ciently) compute the likelihood of a particular sequence, P ( O | µ ) ? 2. Finding the “Best” State Sequence: Given an observation sequence and a model, how do we choose a state sequence ( X 1 , . . ., X T +1 ) to best explain the observation sequence? 3. HMM Parameter Estimation: Given an observation sequence (or corpus thereof), how do we acquire a model µ = ( A, B, Π) that best explains the data?
14. 3.1 Probability of an observation sequence P ( O | X, µ ) = Π T t =1 P ( O t | X t , X t +1 , µ ) = b X 1 X 2 O 1 b X 2 X 3 O 2 . . . b X T X T +1 O T � � π X 1 Π T P ( O, µ ) = P ( O | X, µ ) P ( X, µ ) = t =1 a X t X t +1 b X t X t +1 O t X X 1 ...X T +1 (2 T + 1) N T +1 , too inefficient Complexity : better : use dynamic prog. to store partial results α i ( t ) = P ( O 1 O 2 . . . O t − 1 , X t = s i | µ ) .
15. 3.1.1 Probability of an observation sequence: The Forward algorithm 1. Initialization: α i (1) = π i , for 1 ≤ i ≤ N 2. Induction: α j ( t + 1) = � N i =1 α i ( t ) a ij b ijO t , 1 ≤ t ≤ T , 1 ≤ j ≤ N 3. Total: P ( O | µ ) = � N i =1 α i ( T + 1) . Complexity: 2 N 2 T
16. Proof of induction step: α j ( t + 1) = P ( O 1 O 2 . . . O t − 1 O t , X t +1 = j | µ ) N � = P ( O 1 O 2 . . . O t − 1 O t , X t = i, X t +1 = j | µ ) i =1 N � = P ( O t , X t +1 = j | O 1 O 2 . . . O t − 1 , X t = i, µ ) P ( O 1 O 2 . . . O t − 1 , X t = i | µ ) i =1 N � = P ( O 1 O 2 . . . O t − 1 , X t = i | µ ) P ( O t , X t +1 = j | O 1 O 2 . . . O t − 1 , X t = i, µ ) i =1 N � = α i ( t ) P ( O t , X t +1 = j | X t = i, µ ) i =1 N N � � = α i ( t ) P ( O t | X t = i, X t +1 = j, µ ) P ( X t +1 = j | X t = i, µ ) = α i ( t ) b ijO t a ij i =1 i =1
17. Closeup of the Forward update step s 1 α 1 (t) a 1j b 1jO t s 2 α 2 (t) µ P(O ... O , X = s | ) t+1 1 t j a 2j b 2jO s j t α j (t+1) µ P(O ... O , X = s | ) t 1 t−1 i s N a Nj b NjO α N (t) t t t+1
18. Trellis s 1 Each node ( s i , t ) stores informa- s 2 tion about paths through s i at time s 3 t . State s N 1 2 Time t T+1
19. 3.1.2 Probability of an observation sequence: The Backward algorithm β i ( t ) = P ( O t . . . O T | X t = i, µ ) 1. Initialization: β i ( T + 1) = 1 , for 1 ≤ i ≤ N 2. Induction: β i ( t ) = � N j =1 a ij b ijO t β j ( t + 1) , 1 ≤ t ≤ T , 1 ≤ i ≤ N 3. Total: P ( O | µ ) = � N i =1 π i β i (1) Complexity: 2 N 2 T
The Backward algorithm: Proofs 20. Induction: β i ( t ) = P ( O t O t +1 . . . O T | X t = i, µ ) N � = P ( O t O t +1 . . . O T , X t +1 = j | X t = i, µ ) j =1 N � = P ( O t O t +1 . . . O T | X t = i, X t +1 = j, µ ) P ( X t +1 = j | X t = i, µ ) j =1 N � = P ( O t +1 . . . O T | O t , X t = i, X t +1 = j, µ ) P ( O t | X t = i, X t +1 = j, µ ) a ij j =1 N N � � = P ( O t +1 . . . O T | X t +1 = j, µ ) b ijO t a ij = β j ( t + 1) b ijO t a ij j =1 j =1 N N � � Total: P ( O | µ ) = P ( O 1 O 2 . . . O T | X 1 = i, µ ) P ( X 1 = i | µ ) = β i (1) π i i =1 i =1
21. Combining Forward and Backward probabilities P ( O, X t = i | µ ) = α i ( t ) β i ( t ) N � P ( O | µ ) = α i ( t ) β i ( t ) for 1 ≤ t ≤ T + 1 i =1 Proofs: P ( O, X t = i | µ ) = P ( O 1 . . . O T , X t = i | µ ) = P ( O 1 . . . O t − 1 , X t = i, O t . . . O T | µ ) = P ( O 1 . . . O t − 1 , X t = i | µ ) P ( O t . . . O T | O 1 . . . O t − 1 , X t = i, µ ) = α i ( t ) P ( O t . . . O T | X t = i, µ ) = α i ( t ) β i ( t ) N N � � P ( O | µ ) = P ( O, X t = i | µ ) = α i ( t ) β i ( t ) i =1 i =1 Note: The “total” forward and backward formulae are special cases of the above one (for t = T + 1 and respectively t = 1 ).
22. 3.2 Finding the “best” state sequence 3.2.1 Posterior decoding One way to find the most likely state sequence underlying the observation sequence: choose the states individually γ i ( t ) = P ( X t = i | O, µ ) ˆ X t = argmax γ i ( t ) for 1 ≤ t ≤ T + 1 1 ≤ i ≤ N Computing γ i ( t ) : γ i ( t ) = P ( X t = i | O, µ ) = P ( X t = i, O | µ ) α i ( t ) β i ( t ) = � N P ( O | µ ) j =1 α j ( t ) β j ( t ) Remark: ˆ X maximizes the expected number of states that will be guessed cor- rectly. However, it may yield a quite unlikely/unnatural state se- quence.
23. Note Sometimes not the state itself is of interest, but some other property derived from it. For instance, in the CpG islands example, let g be a function defined on the set of states: g takes the value 1 for A + , C + , G + , T + and 0 for A − , C − , G − , T − . Then � P ( π t = s j | O ) g ( s j ) j designates the posterior probability that the symbol O t come from a state in the + set. Thus it is possible to find the most probable label of the state at each position in the output sequence O .
Recommend
More recommend