hidden markov models
play

Hidden Markov Models Bhiksha Raj 10 Nov 2016 11755/18797 1 - PowerPoint PPT Presentation

Machine Learning for Signal Processing Hidden Markov Models Bhiksha Raj 10 Nov 2016 11755/18797 1 Prediction : a holy grail Physical trajectories Automobiles, rockets, heavenly bodies Natural phenomena Weather Financial


  1. Probability that the HMM will follow a particular state sequence  ( , , ,...) ( ) ( | ) ( | )... P s s s P s P s s P s s 1 2 3 1 2 1 3 2 • P ( s 1 ) is the probability that the process will initially be in state s 1 • P ( s i | s i ) is the transition probability of moving to state s i at the next time instant when the system is currently in s i – Also denoted by T ij earlier 11755/18797 28

  2. Generating Observations from States HMM assumed to be generating data state sequence state distributions observation sequence • At each time it generates an observation from the state it is in at that time 11755/18797 29

  3. Probability that the HMM will generate a particular observation sequence given a state sequence (state sequence known )  ( , , ,...| , , ,...) ( | ) ( | ) ( | )... P o o o s s s P o s P o s P o s 1 2 3 1 2 3 1 1 2 2 3 3 Computed from the Gaussian or Gaussian mixture for state s 1 • P ( o i | s i ) is the probability of generating observation o i when the system is in state s i 11755/18797 30

  4. Proceeding through States and Producing Observations HMM assumed to be generating data state sequence state distributions observation sequence • At each time it produces an observation and makes a transition 11755/18797 31

  5. Probability that the HMM will generate a particular state sequence and from it, a particular observation sequence  ( , , ,..., , , ,...) P o o o s s s 1 2 3 1 2 3  ( , , ,...| , , ,...) ( , , ,...) P o o o s s s P s s s 1 2 3 1 2 3 1 2 3 ( | ) ( | ) ( | )... ( ) ( | ) ( | )... P o s P o s P o s P s P s s P s s 1 1 2 2 3 3 1 2 1 3 2 11755/18797 32

  6. Probability of Generating an Observation Sequence • The precise state sequence is not known • All possible state sequences must be considered    ( , , ,..., , , ,...) P o o o s s s ( , , ,...) P o o o 1 2 3 1 2 3 1 2 3 . all possible . state sequences  ( | ) ( | ) ( | )... ( ) ( | ) ( | )... P o s P o s P o s P s P s s P s s 1 1 2 2 3 3 1 2 1 3 2 . all possible . state sequences 11755/18797 33

  7. Computing it Efficiently • Explicit summing over all state sequences is not tractable – A very large number of possible state sequences • Instead we use the forward algorithm • A dynamic programming technique. 11755/18797 34

  8. Illustrative Example • Example: a generic HMM with 5 states and a “terminating state”. – Left to right topology • P ( s i ) = 1 for state 1 and 0 for others – The arrows represent transition for which the probability is not 0 • Notation: – P ( s i | s i ) = T ij – We represent P ( o t | s i ) = b i ( t ) for brevity 11755/18797 35

  9. Diversion: The Trellis State index a ( s , t ) s Feature vectors t-1 t (time) • The trellis is a graphical representation of all possible paths through the HMM to produce a given observation • The Y-axis represents HMM states, X axis represents observations • Every edge in the graph represents a valid transition in the HMM over a single time step • Every node represents the event of a particular observation being generated from a particular state 11755/18797 36

  10. The Forward Algorithm a   ( , ) ( , ,..., , ( ) ) s t P x x x state t s 1 2 t State index a ( s , t ) s time t-1 t • a ( s , t ) is the total probability of ALL state sequences that end at state s at time t , and all observations until x t 11755/18797 37

  11. The Forward Algorithm a   ( , ) ( , ,..., , ( ) ) s t P x x x state t s 1 2 t Can be recursively State index estimated starting from the first time a ( s , t-1 ) a ( s , t ) instant s (forward recursion) a ( 1 , t-1 ) time t-1 t  a  a  ( , ) ( ' , 1 ) ( | ' ) ( | ) s t s t P s s P x t s ' s • a ( s , t ) can be recursively computed in terms of a ( s’ , t ’ ), the forward probabilities at time t-1 11755/18797 38

  12. The Forward Algorithm   a ( , ) Totalprob s T s State index time T • In the final observation the alpha at each state gives the probability of all state sequences ending at that state • General model: The total probability of the observation is the sum of the alpha values at all states 11755/18797 39

  13. The absorbing state • Observation sequences are assumed to end only when the process arrives at an absorbing state – No observations are produced from the absorbing state 11755/18797 40

  14. The Forward Algorithm  a  ( , 1 ) Totalprob s T absorbing State index time T  a   a ( , 1 ) ( ' , ) ( | ' ) s T s T P s s absorbing absorbing ' s • Absorbing state model: The total probability is the alpha computed at the absorbing state after the final observation 11755/18797 41

  15. Problem 2: State segmentation • Given only a sequence of observations, how do we determine which sequence of states was followed in producing it? 11755/18797 42

  16. The HMM as a generator HMM assumed to be generating data state sequence state distributions observation sequence • The process goes through a series of states and produces observations from them 11755/18797 43

  17. States are hidden HMM assumed to be generating data state sequence state distributions observation sequence • The observations do not reveal the underlying state 11755/18797 44

  18. The state segmentation problem HMM assumed to be generating data state sequence state distributions observation sequence • State segmentation: Estimate state sequence given observations 11755/18797 45

  19. Estimating the State Sequence • Many different state sequences are capable of producing the observation • Solution: Identify the most probable state sequence – The state sequence for which the probability of progressing through that sequence and generating the observation sequence is maximum  – i.e is maximum ( , , ,..., , , ,...) P o o o s s s 1 2 3 1 2 3 11755/18797 46

  20. Estimating the state sequence • Once again, exhaustive evaluation is impossibly expensive • But once again a simple dynamic-programming solution is available  ( , , ,..., , , ,...) P o o o s s s 1 2 3 1 2 3 ( | ) ( | ) ( | )... ( ) ( | ) ( | )... P o s P o s P o s P s P s s P s s 1 1 2 2 3 3 1 2 1 3 2 • Needed: arg max ( | ) ( ) ( | ) ( | ) ( | ) ( | ) P o s P s P o s P s s P o s P s s , , ,... 1 1 1 2 2 2 1 3 3 3 2 s s s 1 2 3 11755/18797 47

  21. Estimating the state sequence • Once again, exhaustive evaluation is impossibly expensive • But once again a simple dynamic-programming solution is available  ( , , ,..., , , ,...) P o o o s s s 1 2 3 1 2 3 ( | ) ( | ) ( | )... ( ) ( | ) ( | )... P o s P o s P o s P s P s s P s s 1 1 2 2 3 3 1 2 1 3 2 • Needed: arg max ( | ) ( ) ( | ) ( | ) ( | ) ( | ) P o s P s P o s P s s P o s P s s , , ,... 1 1 1 2 2 2 1 3 3 3 2 s s s 1 2 3 11755/18797 48

  22. The HMM as a generator HMM assumed to be generating data state sequence state distributions observation sequence • Each enclosed term represents one forward transition and a subsequent emission 11755/18797 49

  23. The state sequence • The probability of a state sequence ?,?,?,?,s x ,s y ending at time t , and producing all observations until o t – P( o 1..t-1 , ?,?,?,?, s x , o t ,s y ) = P( o 1..t-1 ,?,?,?,?, s x ) P( o t | s y )P( s y | s x ) • The best state sequence that ends with s x , s y at t will have a probability equal to the probability of the best state sequence ending at t-1 at s x times P( o t | s y )P( s y | s x ) 11755/18797 50

  24. Extending the state sequence s x s y state sequence state distributions observation sequence t • The probability of a state sequence ?,?,?,?,s x ,s y ending at time t and producing observations until o t – P( o 1..t-1 , o t , ?,?,?,?, s x ,s y ) = P( o 1..t-1 ,?,?,?,?, s x )P( o t | s y )P( s y | s x ) 11755/18797 51

  25. Trellis • The graph below shows the set of all possible state sequences through this HMM in five time instants time t 11755/18797 52

  26. The cost of extending a state sequence • The cost of extending a state sequence ending at s x is only dependent on the transition from s x to s y , and the observation probability at s y P( o t | s y )P( s y | s x ) s y s x time t 11755/18797 53

  27. The cost of extending a state sequence • The best path to s y through s x is simply an extension of the best path to s x BestP( o 1..t-1 ,?,?,?,?, s x ) P( o t | s y )P( s y | s x ) s y s x time t 11755/18797 54

  28. The Recursion • The overall best path to s y is an extension of the best path to one of the states at the previous time s y time t 11755/18797 55

  29. The Recursion  Prob. of best path to s y = Max sx BestP( o 1..t-1 ,?,?,?,?, s x ) P( o t | s y )P( s y | s x ) s y time t 11755/18797 56

  30. Finding the best state sequence • The simple algorithm just presented is called the VITERBI algorithm in the literature – After A.J.Viterbi, who invented this dynamic programming algorithm for a completely different purpose: decoding error correction codes! 11755/18797 57

  31. Viterbi Search (contd.) Initial state initialized with path-score = P ( s 1 ) b 1 ( 1 ) time In this example all other states have score 0 since P ( s i ) = 0 for 11755/18797 58 them

  32. Viterbi Search (contd.) State with best path-score State with path-score < best State without a valid path-score P ( t ) = max [ P ( t-1 ) t b ( t )] j i ij j i State transition probability, i to j Score for state j , given the input at time t Total path-score ending up at state j at time t time 11755/18797 59

  33. Viterbi Search (contd.) P ( t ) = max [ P ( t-1 ) t b ( t )] j i ij j i State transition probability, i to j Score for state j , given the input at time t Total path-score ending up at state j at time t time 11755/18797 60

  34. Viterbi Search (contd.) time 11755/18797 61

  35. Viterbi Search (contd.) time 11755/18797 62

  36. Viterbi Search (contd.) time 11755/18797 63

  37. Viterbi Search (contd.) time 11755/18797 64

  38. Viterbi Search (contd.) time 11755/18797 65

  39. Viterbi Search (contd.) time 11755/18797 66

  40. Viterbi Search (contd.) time 11755/18797 67

  41. Viterbi Search (contd.) THE BEST STATE SEQUENCE IS THE ESTIMATE OF THE STATE SEQUENCE FOLLOWED IN GENERATING THE OBSERVATION time 11755/18797 68

  42. Problem3: Training HMM parameters • We can compute the probability of an observation, and the best state sequence given an observation, using the HMM’s parameters • But where do the HMM parameters come from? • They must be learned from a collection of observation sequences 11755/18797 69

  43. Learning HMM parameters: Simple procedure – counting • Given a set of training instances • Iteratively: 1. Initialize HMM parameters 2. Segment all training instances 3. Estimate transition probabilities and state output probability parameters by counting 11755/18797 70

  44. Learning by counting example • Explanation by example in next few slides • 2-state HMM, Gaussian PDF at states, 3 observation sequences • Example shows ONE iteration – How to count after state sequences are obtained 11755/18797 71

  45. Example: Learning HMM Parameters • We have an HMM with two states s1 and s2. • Observations are vectors x ij – i-th sequence, j-th vector • We are given the following three observation sequences – And have already estimated state sequences Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 72

  46. Example: Learning HMM Parameters Initial state probabilities (usually denoted as p ): • – We have 3 observations – 2 of these begin with S1, and one with S2 p (S1) = 2/3, p (S2) = 1/3 – Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 73

  47. Example: Learning HMM Parameters • Transition probabilities: – State S1 occurs 11 times in non-terminal locations – Of these, it is followed by S1 X times – It is followed by S2 Y times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 74

  48. Example: Learning HMM Parameters • Transition probabilities: – State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed by S2 Y times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 75

  49. Example: Learning HMM Parameters • Transition probabilities: – State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 76

  50. Example: Learning HMM Parameters • Transition probabilities: – State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1 ) = 6/ 11; P(S2 | S1 ) = 5 / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 77

  51. Example: Learning HMM Parameters • Transition probabilities: – State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1 ) = 6/ 11; P(S2 | S1 ) = 5 / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs. X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 78

  52. Example: Learning HMM Parameters • Transition probabilities: – State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 5 times – P(S1 | S1 ) = 6/ 11; P(S2 | S1 ) = 5 / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 79

  53. Example: Learning HMM Parameters • Transition probabilities: – State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 8 times – P(S1 | S1 ) = 6/ 11; P(S2 | S1 ) = 5 / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 80

  54. Example: Learning HMM Parameters • Transition probabilities: – State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 8 times – P(S1 | S2 ) = 5 / 13; P(S2 | S2 ) = 8 / 13 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 81

  55. Parameters learnt so far • State initial probabilities, often denoted as p – p (S1) = 2/3 = 0.66 – p (S2) = 1/3 = 0.33 • State transition probabilities – P(S1 | S1) = 6/11 = 0.545; P(S2 | S1) = 5/11 = 0.455 – P(S1 | S2) = 5/13 = 0.385; P(S2 | S2) = 8/13 = 0.615 – Represented as a transition matrix     ( 1 | 1 ) ( 2 | 1 ) 0 . 545 0 . 455 P S S P S S       A         ( 1 | 2 ) ( 2 | 2 ) 0 . 385 0 . 615 P S S P S S Each row of this matrix must sum to 1.0 11755/18797 82

  56. Example: Learning HMM Parameters • State output probability for S1 – There are 13 observations in S1 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 83

  57. Example: Learning HMM Parameters • State output probability for S1 – There are 13 observations in S1 – Segregate them out and count • Compute parameters (mean and variance) of Gaussian output density for state S1   Time 1 2 6 7 9 10 1  X    m Q  m T 1 ( | ) exp 0 . 5 ( ) ( ) P X S X 1 1 1 1 state S1 S1 S1 S1 S1 S1 p Q d ( 2 ) | | 1 Obs X a1 X a2 X a6 X a7 X a9 X a10          X X X X X X X Time 3 4 9 1 m    1 2 6 7 9 10 3 a a a a a a b   state S1 S1 S1      1   13 X X X X X X Obs X b3 X b4 X b9 4 9 1 2 4 5 b b c c c c          m  m   m  m  T T ... X X X X   a 1 1 a 1 1 a 2 1 a 2 1 1       Time 1 3 4 5   Q   m  m   m  m  T T ... X X X X 1 b 3 1 b 3 1 b 4 1 b 4 1 state S1 S1 S1 S1   13          m  m   m  m  T T ... X X X X Obs X c1 X c2 X c4 X c5   c 1 1 c 1 1 c 2 1 c 2 1 11755/18797 84

  58. Example: Learning HMM Parameters • State output probability for S2 – There are 14 observations in S2 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 85

  59. Example: Learning HMM Parameters • State output probability for S2 – There are 14 observations in S2 – Segregate them out and count • Compute parameters (mean and variance) of Gaussian output density for state S2 Time 3 4 5 8   1 state S2 S2 S2 S2  X    m Q  m 1 T ( | ) exp 0 . 5 ( ) ( ) P X S X 2 2 2 2 p Q Obs X a3 X a4 X a5 X a8 d ( 2 ) | | 2 Time 1 2 5 6 7 8 state S2 S2 S2 S2 S2 S2 Obs X b1 X b2 X b5 X b6 X b7 X b8          X X X X X X X 1   m  3 4 5 8 1 2 5 a a a a b b b         2 14   X X X X X X X Time 2 6 7 8 6 7 8 2 6 7 8 b b b c c c c state S2 S2 S2 S2   1    Obs X c2 X c6 X c7 X c8 Q   m  m  T ... X X 1 3 2 3 2 a a 14 11755/18797 86

  60. We have learnt all the HMM parmeters • State initial probabilities, often denoted as p – p (S1) = 0.66 p (S2) = 1/3 = 0.33 • State transition probabilities   0 . 545 0 . 455    A     0 . 385 0 . 615 • State output probabilities State output probability for S1 State output probability for S2     1 1  X    m Q  m  X 1 T    m Q  m 1 ( | ) exp 0 . 5 ( ) ( ) T P X S X ( | ) exp 0 . 5 ( ) ( ) P X S X 2 2 2 2 p Q 1 1 1 1 p Q d ( 2 ) | | d ( 2 ) | | 2 1 11755/18797 87

  61. Update rules at each iteration No. of observatio n sequences that start at state s p  ( ) i s i Total no. of observatio n sequences     1 X , obs t     : ( ) . &. ( 1 ) m   obs t state t s state t s : ( ) obs t state t s ( | ) i j i P s s     i j i 1 1   : ( ) . : ( ) . obs t state t s obs t state t s i i    m  m T ( )( ) X X , , obs t i obs t i  Q  : ( ) obs t state t s i   i 1  : ( ) . obs t state t s i • Assumes state output PDF = Gaussian – For GMMs, estimate GMM parameters from collection of observations at any state 11755/18797 88

  62. Training by segmentation: Viterbi training yes Initial Segmentations Models Converged? models no  Initialize all HMM parameters  Segment all training observation sequences into states using the Viterbi algorithm with the current models  Using estimated state sequences and training observation sequences, reestimate the HMM parameters  This method is also called a “segmental k - means” learning procedure 11755/18797

  63. Alternative to counting: SOFT counting • Expectation maximization • Every observation contributes to every state 11755/18797 90

  64. Update rules at each iteration    ( ( 1 ) | ) P state t s Obs i p  ( ) Obs s i Total no. of observatio n sequences     ( ( ) , ( 1 ) | ) P state t s state t s Obs i j  Obs t ( | ) P s s   j i ( ( ) | ) P state t s Obs i Obs t   ( ( ) | ) P state t s Obs X i Obs , t m  Obs t   i ( ( ) | ) P state t s Obs i Obs t    m  m T ( ( ) | )( )( ) P state t s Obs X X , , i Obs t i Obs t i Q  Obs t   i ( ( ) | ) P state t s Obs i Obs t • Every observation contributes to every state 11755/18797 91

  65. Update rules at each iteration    ( ( 1 ) | ) P state t s Obs i p  ( ) Obs s i Total no. of observatio n sequences     ( ( ) , ( 1 ) | ) P state t s state t s Obs i j  Obs t ( | ) P s s   j i ( ( ) | ) P state t s Obs i Obs t   ( ( ) | ) P state t s Obs X i Obs , t m  Obs t   i ( ( ) | ) P state t s Obs i Obs t    m  m T ( ( ) | )( )( ) P state t s Obs X X , , i Obs t i Obs t i Q  Obs t   i ( ( ) | ) P state t s Obs i Obs t • Where did these terms come from? 11755/18797 92

  66.  ( ( ) | ) P state t s Obs • The probability that the process was at s when it generated X t given the entire observation • Dropping the “ Obs ” subscript for brevity    ( ( ) | , ,..., ) ( ( ) , , ,..., ) P state t s X X X P state t s X X X 1 2 1 2 T T  • We will compute ( ( ) , , ,..., ) P state t s x x x 1 2 i T first – This is the probability that the process visited s at time t while producing the entire observation 11755/18797 93

  67.  ( ( ) , , ,..., ) P state t s x x x 1 2 T • The probability that the HMM was in a particular state s when generating the observation sequence is the probability that it followed a state sequence that passed through s at time t s time t 11755/18797 94

  68.  ( ( ) , , ,..., ) P state t s x x x 1 2 T • This can be decomposed into two multiplicative sections – The section of the lattice leading into state s at time t and the section leading out of it s time t 11755/18797 95

  69. The Forward Paths • The probability of the red section is the total probability of all state sequences ending at state s at time t – This is simply a ( s,t ) – Can be computed using the forward algorithm s time t 11755/18797 96

  70. The Backward Paths • The blue portion represents the probability of all state sequences that began at state s at time t – Like the red portion it can be computed using a backward recursion time t 11755/18797 97

  71. The Backward Recursion b   ( , ) ( , ,..., | ( ) ) s t P x x x state t s   1 2 t t T b ( N , t ) Can be recursively estimated starting from the final time b ( s , t ) time instant b ( s , t ) s (backward recursion) time t t+1  b  b  ( , ) ( ' , 1 ) ( ' | ) ( | ' ) s t s t P s s P x s  1 t ' s • b ( s , t ) is the total probability of ALL state sequences that depart from s at time t , and all observations after x t – b ( s,T ) = 1 at the final time instant for all valid final states 11755/18797 98

  72. The complete probability a b   ( , ) ( , ) ( , ,..., , ( ) ) s t s t P x x x state t s   1 2 t t T b ( N , t ) b ( s , t ) a ( s , t-1 ) s a ( s 1 , t-1 ) time t-1 t t+1 11755/18797 99

  73. Posterior probability of a state • The probability that the process was in state s at time t , given that we have observed the data is obtained by simple normalization  a b ( ( ) , , ,..., ) ( , ) ( , ) P state t s x x x s t s t    1 2 ( ( ) | ) T P state t s Obs    a b ( ( ) , , ,..., ) ( ' , ) ( ' , ) P state t s x x x s t s t 1 2 T ' ' s s • This term is often referred to as the gamma term and denoted by g s,t 11755/18797 100

Recommend


More recommend