COMS 4721: Machine Learning for Data Science Lecture 21, 4/13/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University
H IDDEN M ARKOV M ODELS
O VERVIEW Motivation We have seen how Markov models can model sequential data. ◮ We assumed the observation was the sequence of states. ◮ Instead, each state may define a distribution on observations. Hidden Markov model A hidden Markov model treats a sequence of data slightly differently. ◮ Assume a hidden (i.e., unobserved, latent) sequence of states. ◮ An observation is drawn from the distribution associated with its state. s 1 s 2 s n−1 s n s n+1 s 1 s 2 s 3 s 4 x n−1 x n+1 x 1 x 2 x n Markov model hidden Markov model
M ARKOV TO H IDDEN M ARKOV MODELS Markov models A 22 Imagine we have three possible states in R 2 . A 21 The data is a sequence of these positions. A 12 k = 2 Since there are only three unique positions, A 32 A 23 A 11 k = 1 we can give an index in place of coordinates. k = 3 A 31 A 13 For example, the sequence ( 1 , 2 , 1 , 3 , 2 , . . . ) would map to a sequence of 2-D vectors. A 33 Using the notation of the figure, A is a 3 × 3 transition matrix . A ij is the probability of transitioning from state i to state j .
M ARKOV TO H IDDEN M ARKOV MODELS Hidden Markov models 1 Now imagine the same three states, but each k = 2 time the coordinates are randomly permuted. k = 1 0.5 The state sequence is still a set of indexes, e.g., ( 1 , 2 , 1 , 3 , 2 , . . . ) of positions in R 2 . k = 3 However, if µ 1 is the position of state # 1, 0 then we observe x i = µ 1 + ǫ i if s i = 1. 0 0.5 1 Exactly as before, we have a state transition matrix A (in this case 3 × 3). However, the observed data is a sequence ( x 1 , x 2 , x 3 , . . . ) where each x ∈ R 2 is a random perturbation of the state it’s assigned to { µ 1 , µ 2 , µ 3 } .
M ARKOV TO H IDDEN M ARKOV MODELS A 22 1 1 A 21 k = 2 A 12 k = 2 k = 1 A 32 A 23 k = 1 A 11 0.5 0.5 k = 3 k = 3 A 31 A 13 0 0 A 33 0 0.5 1 0 0.5 1 A continuous hidden Markov model This HMM is continuous because each x ∈ R 2 in the sequence ( x 1 , . . . , x T ) . (left) A Markov state transition distribution for an unobserved sequence (middle) The state-dependent distributions used to generate observations (right) The data sequence. Colors indicate the distribution (state) used.
H IDDEN M ARKOV M ODELS Definition A hidden Markov model (HMM) consists of: ◮ An S × S Markov transition matrix A for transitioning between S states. ◮ An initial state distribution π for selecting the first state. ◮ A state-dependent emission distribution , Prob ( x i | s i = k ) = p ( x i | θ s i ) . The model generates a sequence ( x 1 , x 2 , x 3 . . . ) by: 1. Sampling the first state s 1 ∼ Discrete ( π ) and x 1 ∼ p ( x | θ s 1 ) . 2. Sampling the Markov chain of states, s i |{ s i − 1 = k } ∼ Discrete ( A k , : ) , followed by the observation x i | s i ∼ p ( x | θ s i ) . Continuous HMM : p ( x | θ s ) is a continuous distribution, often Gaussian. Discrete HMM : p ( x | θ s ) is a discrete distribution, θ s a vector of probabilities. We focus on discrete case. Let B be a matrix, where B k , : = θ k (from above).
E XAMPLE : D ISHONEST C ASINO Problem Here is an example of a discrete hidden Markov model. ◮ Consider two dice, one is fair and one is unfair. ◮ At each roll, we either keep the current dice, or switch to the other one. ◮ The observation is the sequence of numbers rolled. The transition matrix is � 0 . 95 ���� ���� � 0 . 05 A = 0 . 10 0 . 90 ������� �������� ��� ������� �������� The emission matrix is ������� �������� � � 1 1 1 1 1 1 ������� �������� ���� 6 6 6 6 6 6 B = ������� �������� 1 1 1 1 1 1 10 10 10 10 10 2 ������� �������� Let π = [ 1 1 2 ] . 2
S OME ESTIMATION PROBLEMS State estimation ◮ Given: An HMM { π, A , B } and observation sequence ( x 1 , . . . , x T ) ◮ Estimate: State probability for x i using “forward-backward algorithm,” p ( s i = k | x 1 , . . . , x T , π, A , B ) . State sequence ◮ Given: An HMM { π, A , B } and observation sequence ( x 1 , . . . , x T ) ◮ Estimate: Most probable state sequence using the “Viterbi algorithm,” s 1 , . . . , s T = arg max p ( s 1 , . . . , s T | x 1 , . . . , x T , π, A , B ) . s Learn an HMM ◮ Given: An observation sequence ( x 1 , . . . , x T ) ◮ Estimate: HMM parameters π, A , B using maximum likelihood π ML , A ML , B ML = arg max π, A , B p ( x 1 , . . . , x T | π, A , B )
E XAMPLES Before we look at the details, here are examples for the dishonest casino. ◮ Not shown is that π, A , B were learned first in order to calculate this. ◮ Notice that the right plot isn’t just a rounding of the left plot. filtered Viterbi 1 1 MAP state (0=fair,1=loaded) p(loaded) 0.5 0.5 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 roll number roll number State estimation result State sequence result Gray bars: Loaded dice used Gray bars: Loaded dice used Blue: Probability p ( s i = loaded | x 1 : T , π, A , B ) Blue: Most probable state sequence
L EARNING THE HMM
L EARNING THE HMM: T HE LIKELIHOOD We focus on the discrete HMM. To learn the HMM parameters, maximize S S � � p ( x | π, A , B ) = · · · p ( x , s 1 , . . . , s T | π, A , B ) s 1 = 1 s T = 1 S S T � � � = · · · p ( x i | s i , B ) p ( s i | s i − 1 , π, A ) s 1 = 1 s T = 1 i = 1 ◮ p ( x i | s i , B ) = B s i , x i ← s i indexes the distribution, x i is the observation ◮ p ( s i | s i − 1 , π, A ) = A s i − 1 , s i ( or π s 1 ) ← since s 1 , . . . , s T is a Markov chain
L EARNING THE HMM: T HE LOG LIKELIHOOD ◮ Maximizing p ( x | π, A , B ) is hard since the objective has log-sum form S S T � � � ln p ( x | π, A , B ) = ln · · · p ( x i | s i , B ) p ( s i | s i − 1 , π, A ) s 1 = 1 s T = 1 i = 1 ◮ However, if we had or learned s it would be easy (remove the sums). ◮ In addition, we can calculate p ( s | x , π, A , B ) , though it’s much more complicated than in previous models. ◮ Therefore, we can use the EM algorithm! The following is high-level.
L EARNING THE HMM: T HE LOG LIKELIHOOD E-step : Using q ( s ) = p ( s | x , π, A , B ) , calculate L ( x , π, A , B ) = E q [ ln p ( x , s | π, A , B )] . M-Step : Maximize L with respect to π, A , B . This part is tricky since we need to take the expectation using q ( s ) of T S S � � � ln p ( x , s | π, A , B ) = 1 ( s i = k ) ln B k , x i + 1 ( s 1 = k ) ln π k � �� � � �� � i = 1 k = 1 k = 1 observations initial state T S S � � � + 1 ( s i − 1 = j , s i = k ) ln A j , k � �� � i = 2 j = 1 k = 1 Markov chain The following is an overview to help you better navigate the books/tutorials. 1 1 See the classic tutorial: Rabiner, L.R. (1989). “A tutorial on hidden Markov models and selected applications in speech recognition.” Proceedings of the IEEE 77 (2), 257–285.
L EARNING THE HMM WITH EM E-Step Let’s define the following conditional posterior quantities: γ i ( k ) = the posterior probability that s i = k ξ i ( j , k ) = the posterior probability that s i − 1 = j and s i = k Therefore, γ i is a vector and ξ i is a matrix, both varying over i . We can calculate both of these using the “forward-backward” algorithm. (We won’t cover it in this class, but Rabiner’s tutorial is good.) Given these values the E-step is: S T S S T S � � � � � � L = γ 1 ( k ) ln π k + ξ i ( j , k ) ln A j , k + γ i ( k ) ln B k , x i k = 1 i = 2 j = 1 k = 1 i = 1 k = 1 This gives us everything we need to update π, A , B .
L EARNING THE HMM WITH EM M-Step The updates for the HMM parameters are: � T � T γ 1 ( k ) i = 2 ξ i ( j , k ) i = 1 γ i ( k ) 1 { x i = v } π k = j γ 1 ( j ) , A j , k = , B k , v = � � T � S � T l = 1 ξ i ( j , l ) i = 1 γ i ( k ) i = 2 The updates can be understood as follows: ◮ A j , k is the expected fraction of transitions j → k when we start at j ◮ Numerator: Expected count of transitions j → k ◮ Denominator: Expected total number of transitions from j ◮ B k , v is the expected fraction of data coming from state k and equal to v ◮ Numerator: Expected number of observations = v from state k ◮ Denominator: Expected total number of observations from state k ◮ π has interpretation similar to A
L EARNING THE HMM WITH EM M-Step: N sequences Usually we’ll have multiple sequences that are modeled by an HMM. In this case, the updates for the HMM parameters with N sequences are: � N � N � T n n = 1 γ n i = 2 ξ n 1 ( k ) i ( j , k ) n = 1 π k = , A j , k = , � N � � N � T n � S j γ n 1 ( j ) l = 1 ξ n i ( j , l ) n = 1 n = 1 i = 2 � N � T n i = 1 γ n i ( k ) 1 { x i = v } n = 1 B k , v = � N � T n i = 1 γ n i ( k ) n = 1 The modifications are: ◮ Each sequence can be of different length, T n ◮ Each sequence has its own set of γ and ξ values ◮ Using this we sum over the sequences, with the interpretation the same.
A PPLICATION : S PEECH RECOGNITION
Recommend
More recommend