Learning for Hidden Markov Models & Course Recap Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018
Recap ◮ We can decompose the log marginal of any joint distribution into a sum of two terms: ◮ the free energy and ◮ the KL divergence between the variational and the conditional distribution ◮ Variational principle: Maximising the free energy with respect to the variational distribution allows us to (approximately) compute the (log) marginal and the conditional from the joint. ◮ We applied the variational principle to inference and learning problems. ◮ For parameter estimation in presence of unobserved variables: Coordinate ascent on the free energy leads to the (variational) EM algorithm. Michael Gutmann Learning for Hidden Markov Models 2 / 28
Program 1. EM algorithm to learn the parameters of HMMs 2. Course recap Michael Gutmann Learning for Hidden Markov Models 3 / 28
Program 1. EM algorithm to learn the parameters of HMMs Problem statement Learning by gradient ascent on the log-likelihood or by EM EM update equations 2. Course recap Michael Gutmann Learning for Hidden Markov Models 4 / 28
Hidden Markov model Specified by ◮ DAG (representing the independence assumptions) h 1 h 2 h 3 h 4 v 1 v 2 v 3 v 4 ◮ Transition distribution p ( h i | h i − 1 ) ◮ Emission distribution p ( v i | h i ) ◮ Initial state distribution p ( h 1 ) Michael Gutmann Learning for Hidden Markov Models 5 / 28
The classical inference problems ◮ Classical inference problems: ◮ Filtering: p ( h t | v 1: t ) ◮ Smoothing: p ( h t | v 1: u ) where t < u ◮ Prediction: p ( h t | v 1: u ) and/or p ( v t | v 1: u ) where t > u ◮ Most likely hidden path (Viterbi alignment): argmax h 1: t p ( h 1: t | v 1: t ) ◮ Inference problems can be solved by message passing. ◮ Requires that the transition, emission, and initial state distributions are known. Michael Gutmann Learning for Hidden Markov Models 6 / 28
Learning problem ◮ Data: D = {D 1 , . . . , D n } , where each D j is a sequence of visibles of length d , i.e. D j = ( v ( j ) 1 , . . . , v ( j ) d ) ◮ Assumptions: ◮ All variables are discrete: h i ∈ { 1 , . . . K } , v i ∈ { 1 , . . . , M } . ◮ Stationarity ◮ Parametrisation: ◮ Transition distribution is parametrised by the matrix A p ( h i = k | h i − 1 = k ′ ; A ) = A k , k ′ ◮ Emission distribution is parametrised by the matrix B p ( v i = m | h i = k ; B ) = B m , k ◮ Initial state distribution is parametrised by the vector a p ( h 1 = k ; a ) = a k ◮ Task: Use the data D to learn A , B , and a Michael Gutmann Learning for Hidden Markov Models 7 / 28
Learning problem ◮ Since A , B , and a represent (conditional) distributions, the parameters are constrained to be non-negative and to satisfy K K p ( h i = k | h i − 1 = k ′ ) = � � A k , k ′ = 1 k =1 k =1 M M � � p ( v i = m | h i = k ) = B m , k = 1 m =1 m =1 k K � � p ( h 1 = k ) = a k = 1 k =1 k =1 ◮ Note: Much of what follows holds more generally for HMMs and does not use the stationarity assumption or that the h i and v i are discrete random variables. ◮ The parameters together will be denoted by θ . Michael Gutmann Learning for Hidden Markov Models 8 / 28
Options for learning the parameters ◮ The model p ( h , v ; θ ) is normalised but we have unobserved variables. ◮ Option 1: Simple gradient ascent on the log-likelihood n � � � � � θ new = θ old + ǫ ∇ θ log p ( h , D j ; θ ) E p ( h |D j ; θ old ) � � θ old j =1 see slides Intractable Likelihood Functions ◮ Option 2: EM algorithm n � θ new = argmax E p ( h |D j ; θ old ) [log p ( h , D j ; θ )] θ j =1 see slides Variational Inference and Learning ◮ For HMMs, both are possible thanks to sum-product message passing. Michael Gutmann Learning for Hidden Markov Models 9 / 28
Options for learning the parameters � � � Option 1: θ new = θ old + ǫ � n ∇ θ log p ( h , D j ; θ ) j =1 E p ( h |D j ; θ old ) � � θ old � n Option 2: θ new = argmax θ j =1 E p ( h |D j ; θ old ) [log p ( h , D j ; θ )] ◮ Similarities: ◮ Both require computation of the posterior expectation. ◮ Assume the “M” step is performed by gradient ascent, n � � � θ ′ = θ + ǫ � � ∇ θ log p ( h , D j ; θ ) E p ( h |D j ; θ old ) � � θ j =1 where θ is initialised with θ old , and the final θ ′ gives θ new . If only one gradient step is taken, option 2 becomes option 1. ◮ Differences: ◮ Unlike option 2, option 1 requires re-computation of the posterior after each ǫ update of θ , which may be costly. ◮ In some cases (including HMMs), the “M”/argmax step can be performed analytically in closed form. Michael Gutmann Learning for Hidden Markov Models 10 / 28
Expected complete data log-likelihood ◮ Denote the objective in the EM algorithm by J ( θ , θ old ), n � J ( θ , θ old ) = E p ( h |D j ; θ old ) [log p ( h , D j ; θ )] j =1 ◮ We show on the next slide that in general for the HMM model, the full posteriors p ( h |D j ; θ old ) are not needed but just p ( h i | h i − 1 , D j ; θ old ) p ( h i |D j ; θ old ) . They can be obtained by the alpha-beta recursion (sum-product algorithm). ◮ Posteriors need to be computed for each observed sequence D j , and need to be re-computed after updating θ . Michael Gutmann Learning for Hidden Markov Models 11 / 28
Expected complete data log-likelihood ◮ The HMM model factorises as d � p ( h , v ; θ ) = p ( h 1 ; a ) p ( v 1 | h 1 ; B ) p ( h i | h i − 1 ; A ) p ( v i | h i ; B ) i =2 ◮ For sequence D j , we have log p ( h , D j ; θ ) = log p ( h 1 ; a ) + log p ( v ( j ) 1 | h 1 ; B )+ d log p ( h i | h i − 1 ; A ) + log p ( v ( j ) � | h i ; B ) i i =2 ◮ Since E p ( h |D j ; θ old ) [log p ( h 1 ; a )] = E p ( h 1 |D j ; θ old ) [log p ( h 1 ; a )] E p ( h |D j ; θ old ) [log p ( h i | h i − 1 ; A )] = E p ( h i , h i − 1 |D j ; θ old ) [log p ( h i | h i − 1 ; A )] � � � � log p ( v ( j ) log p ( v ( j ) | h i ; B ) = E p ( h i |D j ; θ old ) | h i ; B ) E p ( h |D j ; θ old ) i i we do not need the full posterior but only the marginal posteriors and the joint of the neighbouring variables. Michael Gutmann Learning for Hidden Markov Models 12 / 28
Expected complete data log-likelihood With the factorisation (independencies) in the HMM model, the objective function thus becomes n � J ( θ , θ old ) = E p ( h |D j ; θ old ) [log p ( h , D j ; θ )] j =1 n � = E p ( h 1 |D j ; θ old ) [log p ( h 1 ; a )]+ j =1 n d � � E p ( h i , h i − 1 |D j ; θ old ) [log p ( h i | h i − 1 ; A )]+ j =1 i =2 n d � log p ( v ( j ) � � � | h i ; B ) E p ( h i |D j ; θ old ) i j =1 i =1 In the derivation so far we have not yet used the assumed parametrisation of the model. We insert these assumptions next. Michael Gutmann Learning for Hidden Markov Models 13 / 28
The term for the initial state distribution ◮ We have assumed that p ( h 1 = k ; a ) = a k k = 1 , . . . , K which we can write as a ✶ ( h 1 = k ) � p ( h 1 ; a ) = k k (like for the Bernoulli model, see slides Basics of Model-Based Learning and Tutorial 7) ◮ The log pmf is thus � log p ( h 1 ; a ) = ✶ ( h 1 = k ) log a k k ◮ Hence � E p ( h 1 |D j ; θ old ) [log p ( h 1 ; a )] = E p ( h 1 |D j ; θ old ) [ ✶ ( h 1 = k )] log a k k � = p ( h 1 = k |D j ; θ old ) log a k k Michael Gutmann Learning for Hidden Markov Models 14 / 28
The term for the transition distribution ◮ We have assumed that k , k ′ = 1 , . . . K p ( h i = k | h i − 1 = k ′ ; A ) = A k , k ′ which we can write as A ✶ ( h i = k , h i − 1 = k ′ ) � p ( h i | h i − 1 ; A ) = k , k ′ k , k ′ (see slides Basics of Model-Based Learning and Tutorial 7) ◮ Further: ✶ ( h i = k , h i − 1 = k ′ ) log A k , k ′ � log p ( h i | h i − 1 ; A ) = k , k ′ ◮ Hence E p ( h i , h i − 1 |D j ; θ old ) [log p ( h i | h i − 1 ; A )] equals � log A k , k ′ � ✶ ( h i = k , h i − 1 = k ′ ) � E p ( h i , h i − 1 |D j ; θ old ) k , k ′ p ( h i = k , h i − 1 = k ′ |D j ; θ old ) log A k , k ′ � = k , k ′ Michael Gutmann Learning for Hidden Markov Models 15 / 28
The term for the emission distribution We can do the same for the emission distribution. With B ✶ ( v i = m , h i = k ) B ✶ ( v i = m ) ✶ ( h i = k ) � � p ( v i | h i ; B ) = = m , k m , k m , k m , k we have � log p ( v ( j ) � ✶ ( v ( j ) � | h i ; B ) = = m ) p ( h i = k |D j , θ old ) log B m , k E p ( h i |D j ; θ old ) i i m , k Michael Gutmann Learning for Hidden Markov Models 16 / 28
E-step for discrete-valued HMM ◮ Putting all together, we obtain the complete data log likelihood for the HMM with discrete visibles and hiddens. n � � J ( θ , θ old ) = p ( h 1 = k |D j ; θ old ) log a k + j =1 k n d p ( h i = k , h i − 1 = k ′ |D j ; θ old ) log A k , k ′ + � � � k , k ′ j =1 i =2 n d ✶ ( v ( j ) � � � = m ) p ( h i = k |D j , θ old ) log B m , k i j =1 i =1 m , k ◮ The objectives for a , and the columns of A and B decouple. ◮ Does not completely decouple because of the constraint that the elements of a have to sum to one, and that the columns of A and B have to sum to one. Michael Gutmann Learning for Hidden Markov Models 17 / 28
Recommend
More recommend