Probability that the HMM will follow a particular state sequence ( , , ,...) ( ) ( | ) ( | )... P s s s P s P s s P s s 1 2 3 1 2 1 3 2 • P ( s 1 ) is the probability that the process will initially be in state s 1 • P ( s i | s i ) is the transition probability of moving to state s i at the next time instant when the system is currently in s i – Also denoted by T ij earlier 11755/18797 28
Generating Observations from States HMM assumed to be generating data state sequence state distributions observation sequence • At each time it generates an observation from the state it is in at that time 11755/18797 29
Probability that the HMM will generate a particular observation sequence given a state sequence (state sequence known ) ( , , ,...| , , ,...) ( | ) ( | ) ( | )... P o o o s s s P o s P o s P o s 1 2 3 1 2 3 1 1 2 2 3 3 Computed from the Gaussian or Gaussian mixture for state s 1 • P ( o i | s i ) is the probability of generating observation o i when the system is in state s i 11755/18797 30
Proceeding through States and Producing Observations HMM assumed to be generating data state sequence state distributions observation sequence • At each time it produces an observation and makes a transition 11755/18797 31
Probability that the HMM will generate a particular state sequence and from it, a particular observation sequence ( , , ,..., , , ,...) P o o o s s s 1 2 3 1 2 3 ( , , ,...| , , ,...) ( , , ,...) P o o o s s s P s s s 1 2 3 1 2 3 1 2 3 ( | ) ( | ) ( | )... ( ) ( | ) ( | )... P o s P o s P o s P s P s s P s s 1 1 2 2 3 3 1 2 1 3 2 11755/18797 32
Probability of Generating an Observation Sequence • The precise state sequence is not known • All possible state sequences must be considered ( , , ,..., , , ,...) P o o o s s s ( , , ,...) P o o o 1 2 3 1 2 3 1 2 3 . all possible . state sequences ( | ) ( | ) ( | )... ( ) ( | ) ( | )... P o s P o s P o s P s P s s P s s 1 1 2 2 3 3 1 2 1 3 2 . all possible . state sequences 11755/18797 33
Computing it Efficiently • Explicit summing over all state sequences is not tractable – A very large number of possible state sequences • Instead we use the forward algorithm • A dynamic programming technique. 11755/18797 34
Illustrative Example • Example: a generic HMM with 5 states and a “terminating state”. – Left to right topology • P ( s i ) = 1 for state 1 and 0 for others – The arrows represent transition for which the probability is not 0 • Notation: – P ( s i | s i ) = T ij – We represent P ( o t | s i ) = b i ( t ) for brevity 11755/18797 35
Diversion: The Trellis State index a ( s , t ) s Feature vectors t-1 t (time) • The trellis is a graphical representation of all possible paths through the HMM to produce a given observation • The Y-axis represents HMM states, X axis represents observations • Every edge in the graph represents a valid transition in the HMM over a single time step • Every node represents the event of a particular observation being generated from a particular state 11755/18797 36
The Forward Algorithm a ( , ) ( , ,..., , ( ) ) s t P x x x state t s 1 2 t State index a ( s , t ) s time t-1 t • a ( s , t ) is the total probability of ALL state sequences that end at state s at time t , and all observations until x t 11755/18797 37
The Forward Algorithm a ( , ) ( , ,..., , ( ) ) s t P x x x state t s 1 2 t Can be recursively State index estimated starting from the first time a ( s , t-1 ) a ( s , t ) instant s (forward recursion) a ( 1 , t-1 ) time t-1 t a a ( , ) ( ' , 1 ) ( | ' ) ( | ) s t s t P s s P x t s ' s • a ( s , t ) can be recursively computed in terms of a ( s’ , t ’ ), the forward probabilities at time t-1 11755/18797 38
The Forward Algorithm a ( , ) Totalprob s T s State index time T • In the final observation the alpha at each state gives the probability of all state sequences ending at that state • General model: The total probability of the observation is the sum of the alpha values at all states 11755/18797 39
The absorbing state • Observation sequences are assumed to end only when the process arrives at an absorbing state – No observations are produced from the absorbing state 11755/18797 40
The Forward Algorithm a ( , 1 ) Totalprob s T absorbing State index time T a a ( , 1 ) ( ' , ) ( | ' ) s T s T P s s absorbing absorbing ' s • Absorbing state model: The total probability is the alpha computed at the absorbing state after the final observation 11755/18797 41
Problem 2: State segmentation • Given only a sequence of observations, how do we determine which sequence of states was followed in producing it? 11755/18797 42
The HMM as a generator HMM assumed to be generating data state sequence state distributions observation sequence • The process goes through a series of states and produces observations from them 11755/18797 43
States are hidden HMM assumed to be generating data state sequence state distributions observation sequence • The observations do not reveal the underlying state 11755/18797 44
The state segmentation problem HMM assumed to be generating data state sequence state distributions observation sequence • State segmentation: Estimate state sequence given observations 11755/18797 45
Estimating the State Sequence • Many different state sequences are capable of producing the observation • Solution: Identify the most probable state sequence – The state sequence for which the probability of progressing through that sequence and generating the observation sequence is maximum – i.e is maximum ( , , ,..., , , ,...) P o o o s s s 1 2 3 1 2 3 11755/18797 46
Estimating the state sequence • Once again, exhaustive evaluation is impossibly expensive • But once again a simple dynamic-programming solution is available ( , , ,..., , , ,...) P o o o s s s 1 2 3 1 2 3 ( | ) ( | ) ( | )... ( ) ( | ) ( | )... P o s P o s P o s P s P s s P s s 1 1 2 2 3 3 1 2 1 3 2 • Needed: arg max ( | ) ( ) ( | ) ( | ) ( | ) ( | ) P o s P s P o s P s s P o s P s s , , ,... 1 1 1 2 2 2 1 3 3 3 2 s s s 1 2 3 11755/18797 47
Estimating the state sequence • Once again, exhaustive evaluation is impossibly expensive • But once again a simple dynamic-programming solution is available ( , , ,..., , , ,...) P o o o s s s 1 2 3 1 2 3 ( | ) ( | ) ( | )... ( ) ( | ) ( | )... P o s P o s P o s P s P s s P s s 1 1 2 2 3 3 1 2 1 3 2 • Needed: arg max ( | ) ( ) ( | ) ( | ) ( | ) ( | ) P o s P s P o s P s s P o s P s s , , ,... 1 1 1 2 2 2 1 3 3 3 2 s s s 1 2 3 11755/18797 48
The HMM as a generator HMM assumed to be generating data state sequence state distributions observation sequence • Each enclosed term represents one forward transition and a subsequent emission 11755/18797 49
The state sequence • The probability of a state sequence ?,?,?,?,s x ,s y ending at time t , and producing all observations until o t – P( o 1..t-1 , ?,?,?,?, s x , o t ,s y ) = P( o 1..t-1 ,?,?,?,?, s x ) P( o t | s y )P( s y | s x ) • The best state sequence that ends with s x , s y at t will have a probability equal to the probability of the best state sequence ending at t-1 at s x times P( o t | s y )P( s y | s x ) 11755/18797 50
Extending the state sequence s x s y state sequence state distributions observation sequence t • The probability of a state sequence ?,?,?,?,s x ,s y ending at time t and producing observations until o t – P( o 1..t-1 , o t , ?,?,?,?, s x ,s y ) = P( o 1..t-1 ,?,?,?,?, s x )P( o t | s y )P( s y | s x ) 11755/18797 51
Trellis • The graph below shows the set of all possible state sequences through this HMM in five time instants time t 11755/18797 52
The cost of extending a state sequence • The cost of extending a state sequence ending at s x is only dependent on the transition from s x to s y , and the observation probability at s y P( o t | s y )P( s y | s x ) s y s x time t 11755/18797 53
The cost of extending a state sequence • The best path to s y through s x is simply an extension of the best path to s x BestP( o 1..t-1 ,?,?,?,?, s x ) P( o t | s y )P( s y | s x ) s y s x time t 11755/18797 54
The Recursion • The overall best path to s y is an extension of the best path to one of the states at the previous time s y time t 11755/18797 55
The Recursion Prob. of best path to s y = Max sx BestP( o 1..t-1 ,?,?,?,?, s x ) P( o t | s y )P( s y | s x ) s y time t 11755/18797 56
Finding the best state sequence • The simple algorithm just presented is called the VITERBI algorithm in the literature – After A.J.Viterbi, who invented this dynamic programming algorithm for a completely different purpose: decoding error correction codes! 11755/18797 57
Viterbi Search (contd.) Initial state initialized with path-score = P ( s 1 ) b 1 ( 1 ) time In this example all other states have score 0 since P ( s i ) = 0 for 11755/18797 58 them
Viterbi Search (contd.) State with best path-score State with path-score < best State without a valid path-score P ( t ) = max [ P ( t-1 ) t b ( t )] j i ij j i State transition probability, i to j Score for state j , given the input at time t Total path-score ending up at state j at time t time 11755/18797 59
Viterbi Search (contd.) P ( t ) = max [ P ( t-1 ) t b ( t )] j i ij j i State transition probability, i to j Score for state j , given the input at time t Total path-score ending up at state j at time t time 11755/18797 60
Viterbi Search (contd.) time 11755/18797 61
Viterbi Search (contd.) time 11755/18797 62
Viterbi Search (contd.) time 11755/18797 63
Viterbi Search (contd.) time 11755/18797 64
Viterbi Search (contd.) time 11755/18797 65
Viterbi Search (contd.) time 11755/18797 66
Viterbi Search (contd.) time 11755/18797 67
Viterbi Search (contd.) THE BEST STATE SEQUENCE IS THE ESTIMATE OF THE STATE SEQUENCE FOLLOWED IN GENERATING THE OBSERVATION time 11755/18797 68
Problem3: Training HMM parameters • We can compute the probability of an observation, and the best state sequence given an observation, using the HMM’s parameters • But where do the HMM parameters come from? • They must be learned from a collection of observation sequences 11755/18797 69
Learning HMM parameters: Simple procedure – counting • Given a set of training instances • Iteratively: 1. Initialize HMM parameters 2. Segment all training instances 3. Estimate transition probabilities and state output probability parameters by counting 11755/18797 70
Learning by counting example • Explanation by example in next few slides • 2-state HMM, Gaussian PDF at states, 3 observation sequences • Example shows ONE iteration – How to count after state sequences are obtained 11755/18797 71
Example: Learning HMM Parameters • We have an HMM with two states s1 and s2. • Observations are vectors x ij – i-th sequence, j-th vector • We are given the following three observation sequences – And have already estimated state sequences Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 72
Example: Learning HMM Parameters Initial state probabilities (usually denoted as p ): • – We have 3 observations – 2 of these begin with S1, and one with S2 p (S1) = 2/3, p (S2) = 1/3 – Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 73
Example: Learning HMM Parameters • Transition probabilities: – State S1 occurs 11 times in non-terminal locations – Of these, it is followed by S1 X times – It is followed by S2 Y times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 74
Example: Learning HMM Parameters • Transition probabilities: – State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed by S2 Y times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 75
Example: Learning HMM Parameters • Transition probabilities: – State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 76
Example: Learning HMM Parameters • Transition probabilities: – State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1 ) = 6/ 11; P(S2 | S1 ) = 5 / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 77
Example: Learning HMM Parameters • Transition probabilities: – State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1 ) = 6/ 11; P(S2 | S1 ) = 5 / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs. X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 78
Example: Learning HMM Parameters • Transition probabilities: – State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 5 times – P(S1 | S1 ) = 6/ 11; P(S2 | S1 ) = 5 / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 79
Example: Learning HMM Parameters • Transition probabilities: – State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 8 times – P(S1 | S1 ) = 6/ 11; P(S2 | S1 ) = 5 / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 80
Example: Learning HMM Parameters • Transition probabilities: – State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 8 times – P(S1 | S2 ) = 5 / 13; P(S2 | S2 ) = 8 / 13 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 81
Parameters learnt so far • State initial probabilities, often denoted as p – p (S1) = 2/3 = 0.66 – p (S2) = 1/3 = 0.33 • State transition probabilities – P(S1 | S1) = 6/11 = 0.545; P(S2 | S1) = 5/11 = 0.455 – P(S1 | S2) = 5/13 = 0.385; P(S2 | S2) = 8/13 = 0.615 – Represented as a transition matrix ( 1 | 1 ) ( 2 | 1 ) 0 . 545 0 . 455 P S S P S S A ( 1 | 2 ) ( 2 | 2 ) 0 . 385 0 . 615 P S S P S S Each row of this matrix must sum to 1.0 11755/18797 82
Example: Learning HMM Parameters • State output probability for S1 – There are 13 observations in S1 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 83
Example: Learning HMM Parameters • State output probability for S1 – There are 13 observations in S1 – Segregate them out and count • Compute parameters (mean and variance) of Gaussian output density for state S1 Time 1 2 6 7 9 10 1 X m Q m T 1 ( | ) exp 0 . 5 ( ) ( ) P X S X 1 1 1 1 state S1 S1 S1 S1 S1 S1 p Q d ( 2 ) | | 1 Obs X a1 X a2 X a6 X a7 X a9 X a10 X X X X X X X Time 3 4 9 1 m 1 2 6 7 9 10 3 a a a a a a b state S1 S1 S1 1 13 X X X X X X Obs X b3 X b4 X b9 4 9 1 2 4 5 b b c c c c m m m m T T ... X X X X a 1 1 a 1 1 a 2 1 a 2 1 1 Time 1 3 4 5 Q m m m m T T ... X X X X 1 b 3 1 b 3 1 b 4 1 b 4 1 state S1 S1 S1 S1 13 m m m m T T ... X X X X Obs X c1 X c2 X c4 X c5 c 1 1 c 1 1 c 2 1 c 2 1 11755/18797 84
Example: Learning HMM Parameters • State output probability for S2 – There are 14 observations in S2 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 85
Example: Learning HMM Parameters • State output probability for S2 – There are 14 observations in S2 – Segregate them out and count • Compute parameters (mean and variance) of Gaussian output density for state S2 Time 3 4 5 8 1 state S2 S2 S2 S2 X m Q m 1 T ( | ) exp 0 . 5 ( ) ( ) P X S X 2 2 2 2 p Q Obs X a3 X a4 X a5 X a8 d ( 2 ) | | 2 Time 1 2 5 6 7 8 state S2 S2 S2 S2 S2 S2 Obs X b1 X b2 X b5 X b6 X b7 X b8 X X X X X X X 1 m 3 4 5 8 1 2 5 a a a a b b b 2 14 X X X X X X X Time 2 6 7 8 6 7 8 2 6 7 8 b b b c c c c state S2 S2 S2 S2 1 Obs X c2 X c6 X c7 X c8 Q m m T ... X X 1 3 2 3 2 a a 14 11755/18797 86
We have learnt all the HMM parmeters • State initial probabilities, often denoted as p – p (S1) = 0.66 p (S2) = 1/3 = 0.33 • State transition probabilities 0 . 545 0 . 455 A 0 . 385 0 . 615 • State output probabilities State output probability for S1 State output probability for S2 1 1 X m Q m X 1 T m Q m 1 ( | ) exp 0 . 5 ( ) ( ) T P X S X ( | ) exp 0 . 5 ( ) ( ) P X S X 2 2 2 2 p Q 1 1 1 1 p Q d ( 2 ) | | d ( 2 ) | | 2 1 11755/18797 87
Update rules at each iteration No. of observatio n sequences that start at state s p ( ) i s i Total no. of observatio n sequences 1 X , obs t : ( ) . &. ( 1 ) m obs t state t s state t s : ( ) obs t state t s ( | ) i j i P s s i j i 1 1 : ( ) . : ( ) . obs t state t s obs t state t s i i m m T ( )( ) X X , , obs t i obs t i Q : ( ) obs t state t s i i 1 : ( ) . obs t state t s i • Assumes state output PDF = Gaussian – For GMMs, estimate GMM parameters from collection of observations at any state 11755/18797 88
Training by segmentation: Viterbi training yes Initial Segmentations Models Converged? models no Initialize all HMM parameters Segment all training observation sequences into states using the Viterbi algorithm with the current models Using estimated state sequences and training observation sequences, reestimate the HMM parameters This method is also called a “segmental k - means” learning procedure 11755/18797
Alternative to counting: SOFT counting • Expectation maximization • Every observation contributes to every state 11755/18797 90
Update rules at each iteration ( ( 1 ) | ) P state t s Obs i p ( ) Obs s i Total no. of observatio n sequences ( ( ) , ( 1 ) | ) P state t s state t s Obs i j Obs t ( | ) P s s j i ( ( ) | ) P state t s Obs i Obs t ( ( ) | ) P state t s Obs X i Obs , t m Obs t i ( ( ) | ) P state t s Obs i Obs t m m T ( ( ) | )( )( ) P state t s Obs X X , , i Obs t i Obs t i Q Obs t i ( ( ) | ) P state t s Obs i Obs t • Every observation contributes to every state 11755/18797 91
Update rules at each iteration ( ( 1 ) | ) P state t s Obs i p ( ) Obs s i Total no. of observatio n sequences ( ( ) , ( 1 ) | ) P state t s state t s Obs i j Obs t ( | ) P s s j i ( ( ) | ) P state t s Obs i Obs t ( ( ) | ) P state t s Obs X i Obs , t m Obs t i ( ( ) | ) P state t s Obs i Obs t m m T ( ( ) | )( )( ) P state t s Obs X X , , i Obs t i Obs t i Q Obs t i ( ( ) | ) P state t s Obs i Obs t • Where did these terms come from? 11755/18797 92
( ( ) | ) P state t s Obs • The probability that the process was at s when it generated X t given the entire observation • Dropping the “ Obs ” subscript for brevity ( ( ) | , ,..., ) ( ( ) , , ,..., ) P state t s X X X P state t s X X X 1 2 1 2 T T • We will compute ( ( ) , , ,..., ) P state t s x x x 1 2 i T first – This is the probability that the process visited s at time t while producing the entire observation 11755/18797 93
( ( ) , , ,..., ) P state t s x x x 1 2 T • The probability that the HMM was in a particular state s when generating the observation sequence is the probability that it followed a state sequence that passed through s at time t s time t 11755/18797 94
( ( ) , , ,..., ) P state t s x x x 1 2 T • This can be decomposed into two multiplicative sections – The section of the lattice leading into state s at time t and the section leading out of it s time t 11755/18797 95
The Forward Paths • The probability of the red section is the total probability of all state sequences ending at state s at time t – This is simply a ( s,t ) – Can be computed using the forward algorithm s time t 11755/18797 96
The Backward Paths • The blue portion represents the probability of all state sequences that began at state s at time t – Like the red portion it can be computed using a backward recursion time t 11755/18797 97
The Backward Recursion b ( , ) ( , ,..., | ( ) ) s t P x x x state t s 1 2 t t T b ( N , t ) Can be recursively estimated starting from the final time b ( s , t ) time instant b ( s , t ) s (backward recursion) time t t+1 b b ( , ) ( ' , 1 ) ( ' | ) ( | ' ) s t s t P s s P x s 1 t ' s • b ( s , t ) is the total probability of ALL state sequences that depart from s at time t , and all observations after x t – b ( s,T ) = 1 at the final time instant for all valid final states 11755/18797 98
The complete probability a b ( , ) ( , ) ( , ,..., , ( ) ) s t s t P x x x state t s 1 2 t t T b ( N , t ) b ( s , t ) a ( s , t-1 ) s a ( s 1 , t-1 ) time t-1 t t+1 11755/18797 99
Posterior probability of a state • The probability that the process was in state s at time t , given that we have observed the data is obtained by simple normalization a b ( ( ) , , ,..., ) ( , ) ( , ) P state t s x x x s t s t 1 2 ( ( ) | ) T P state t s Obs a b ( ( ) , , ,..., ) ( ' , ) ( ' , ) P state t s x x x s t s t 1 2 T ' ' s s • This term is often referred to as the gamma term and denoted by g s,t 11755/18797 100
Recommend
More recommend