learning for hidden markov models course recap
play

Learning for Hidden Markov Models & Course Recap Michael - PowerPoint PPT Presentation

Learning for Hidden Markov Models & Course Recap Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018 Recap We can decompose the log marginal of any joint


  1. Learning for Hidden Markov Models & Course Recap Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018

  2. Recap ◮ We can decompose the log marginal of any joint distribution into a sum of two terms: ◮ the free energy and ◮ the KL divergence between the variational and the conditional distribution ◮ Variational principle: Maximising the free energy with respect to the variational distribution allows us to (approximately) compute the (log) marginal and the conditional from the joint. ◮ We applied the variational principle to inference and learning problems. ◮ For parameter estimation in presence of unobserved variables: Coordinate ascent on the free energy leads to the (variational) EM algorithm. Michael Gutmann Learning for Hidden Markov Models 2 / 28

  3. Program 1. EM algorithm to learn the parameters of HMMs 2. Course recap Michael Gutmann Learning for Hidden Markov Models 3 / 28

  4. Program 1. EM algorithm to learn the parameters of HMMs Problem statement Learning by gradient ascent on the log-likelihood or by EM EM update equations 2. Course recap Michael Gutmann Learning for Hidden Markov Models 4 / 28

  5. Hidden Markov model Specified by ◮ DAG (representing the independence assumptions) h 1 h 2 h 3 h 4 v 1 v 2 v 3 v 4 ◮ Transition distribution p ( h i | h i − 1 ) ◮ Emission distribution p ( v i | h i ) ◮ Initial state distribution p ( h 1 ) Michael Gutmann Learning for Hidden Markov Models 5 / 28

  6. The classical inference problems ◮ Classical inference problems: ◮ Filtering: p ( h t | v 1: t ) ◮ Smoothing: p ( h t | v 1: u ) where t < u ◮ Prediction: p ( h t | v 1: u ) and/or p ( v t | v 1: u ) where t > u ◮ Most likely hidden path (Viterbi alignment): argmax h 1: t p ( h 1: t | v 1: t ) ◮ Inference problems can be solved by message passing. ◮ Requires that the transition, emission, and initial state distributions are known. Michael Gutmann Learning for Hidden Markov Models 6 / 28

  7. Learning problem ◮ Data: D = {D 1 , . . . , D n } , where each D j is a sequence of visibles of length d , i.e. D j = ( v ( j ) 1 , . . . , v ( j ) d ) ◮ Assumptions: ◮ All variables are discrete: h i ∈ { 1 , . . . K } , v i ∈ { 1 , . . . , M } . ◮ Stationarity ◮ Parametrisation: ◮ Transition distribution is parametrised by the matrix A p ( h i = k | h i − 1 = k ′ ; A ) = A k , k ′ ◮ Emission distribution is parametrised by the matrix B p ( v i = m | h i = k ; B ) = B m , k ◮ Initial state distribution is parametrised by the vector a p ( h 1 = k ; a ) = a k ◮ Task: Use the data D to learn A , B , and a Michael Gutmann Learning for Hidden Markov Models 7 / 28

  8. Learning problem ◮ Since A , B , and a represent (conditional) distributions, the parameters are constrained to be non-negative and to satisfy K K p ( h i = k | h i − 1 = k ′ ) = � � A k , k ′ = 1 k =1 k =1 M M � � p ( v i = m | h i = k ) = B m , k = 1 m =1 m =1 k K � � p ( h 1 = k ) = a k = 1 k =1 k =1 ◮ Note: Much of what follows holds more generally for HMMs and does not use the stationarity assumption or that the h i and v i are discrete random variables. ◮ The parameters together will be denoted by θ . Michael Gutmann Learning for Hidden Markov Models 8 / 28

  9. Options for learning the parameters ◮ The model p ( h , v ; θ ) is normalised but we have unobserved variables. ◮ Option 1: Simple gradient ascent on the log-likelihood n � � � � � θ new = θ old + ǫ ∇ θ log p ( h , D j ; θ ) E p ( h |D j ; θ old ) � � θ old j =1 see slides Intractable Likelihood Functions ◮ Option 2: EM algorithm n � θ new = argmax E p ( h |D j ; θ old ) [log p ( h , D j ; θ )] θ j =1 see slides Variational Inference and Learning ◮ For HMMs, both are possible thanks to sum-product message passing. Michael Gutmann Learning for Hidden Markov Models 9 / 28

  10. Options for learning the parameters � � � Option 1: θ new = θ old + ǫ � n ∇ θ log p ( h , D j ; θ ) j =1 E p ( h |D j ; θ old ) � � θ old � n Option 2: θ new = argmax θ j =1 E p ( h |D j ; θ old ) [log p ( h , D j ; θ )] ◮ Similarities: ◮ Both require computation of the posterior expectation. ◮ Assume the “M” step is performed by gradient ascent, n � � � θ ′ = θ + ǫ � � ∇ θ log p ( h , D j ; θ ) E p ( h |D j ; θ old ) � � θ j =1 where θ is initialised with θ old , and the final θ ′ gives θ new . If only one gradient step is taken, option 2 becomes option 1. ◮ Differences: ◮ Unlike option 2, option 1 requires re-computation of the posterior after each ǫ update of θ , which may be costly. ◮ In some cases (including HMMs), the “M”/argmax step can be performed analytically in closed form. Michael Gutmann Learning for Hidden Markov Models 10 / 28

  11. Expected complete data log-likelihood ◮ Denote the objective in the EM algorithm by J ( θ , θ old ), n � J ( θ , θ old ) = E p ( h |D j ; θ old ) [log p ( h , D j ; θ )] j =1 ◮ We show on the next slide that in general for the HMM model, the full posteriors p ( h |D j ; θ old ) are not needed but just p ( h i | h i − 1 , D j ; θ old ) p ( h i |D j ; θ old ) . They can be obtained by the alpha-beta recursion (sum-product algorithm). ◮ Posteriors need to be computed for each observed sequence D j , and need to be re-computed after updating θ . Michael Gutmann Learning for Hidden Markov Models 11 / 28

  12. Expected complete data log-likelihood ◮ The HMM model factorises as d � p ( h , v ; θ ) = p ( h 1 ; a ) p ( v 1 | h 1 ; B ) p ( h i | h i − 1 ; A ) p ( v i | h i ; B ) i =2 ◮ For sequence D j , we have log p ( h , D j ; θ ) = log p ( h 1 ; a ) + log p ( v ( j ) 1 | h 1 ; B )+ d log p ( h i | h i − 1 ; A ) + log p ( v ( j ) � | h i ; B ) i i =2 ◮ Since E p ( h |D j ; θ old ) [log p ( h 1 ; a )] = E p ( h 1 |D j ; θ old ) [log p ( h 1 ; a )] E p ( h |D j ; θ old ) [log p ( h i | h i − 1 ; A )] = E p ( h i , h i − 1 |D j ; θ old ) [log p ( h i | h i − 1 ; A )] � � � � log p ( v ( j ) log p ( v ( j ) | h i ; B ) = E p ( h i |D j ; θ old ) | h i ; B ) E p ( h |D j ; θ old ) i i we do not need the full posterior but only the marginal posteriors and the joint of the neighbouring variables. Michael Gutmann Learning for Hidden Markov Models 12 / 28

  13. Expected complete data log-likelihood With the factorisation (independencies) in the HMM model, the objective function thus becomes n � J ( θ , θ old ) = E p ( h |D j ; θ old ) [log p ( h , D j ; θ )] j =1 n � = E p ( h 1 |D j ; θ old ) [log p ( h 1 ; a )]+ j =1 n d � � E p ( h i , h i − 1 |D j ; θ old ) [log p ( h i | h i − 1 ; A )]+ j =1 i =2 n d � log p ( v ( j ) � � � | h i ; B ) E p ( h i |D j ; θ old ) i j =1 i =1 In the derivation so far we have not yet used the assumed parametrisation of the model. We insert these assumptions next. Michael Gutmann Learning for Hidden Markov Models 13 / 28

  14. The term for the initial state distribution ◮ We have assumed that p ( h 1 = k ; a ) = a k k = 1 , . . . , K which we can write as a ✶ ( h 1 = k ) � p ( h 1 ; a ) = k k (like for the Bernoulli model, see slides Basics of Model-Based Learning and Tutorial 7) ◮ The log pmf is thus � log p ( h 1 ; a ) = ✶ ( h 1 = k ) log a k k ◮ Hence � E p ( h 1 |D j ; θ old ) [log p ( h 1 ; a )] = E p ( h 1 |D j ; θ old ) [ ✶ ( h 1 = k )] log a k k � = p ( h 1 = k |D j ; θ old ) log a k k Michael Gutmann Learning for Hidden Markov Models 14 / 28

  15. The term for the transition distribution ◮ We have assumed that k , k ′ = 1 , . . . K p ( h i = k | h i − 1 = k ′ ; A ) = A k , k ′ which we can write as A ✶ ( h i = k , h i − 1 = k ′ ) � p ( h i | h i − 1 ; A ) = k , k ′ k , k ′ (see slides Basics of Model-Based Learning and Tutorial 7) ◮ Further: ✶ ( h i = k , h i − 1 = k ′ ) log A k , k ′ � log p ( h i | h i − 1 ; A ) = k , k ′ ◮ Hence E p ( h i , h i − 1 |D j ; θ old ) [log p ( h i | h i − 1 ; A )] equals � log A k , k ′ � ✶ ( h i = k , h i − 1 = k ′ ) � E p ( h i , h i − 1 |D j ; θ old ) k , k ′ p ( h i = k , h i − 1 = k ′ |D j ; θ old ) log A k , k ′ � = k , k ′ Michael Gutmann Learning for Hidden Markov Models 15 / 28

  16. The term for the emission distribution We can do the same for the emission distribution. With B ✶ ( v i = m , h i = k ) B ✶ ( v i = m ) ✶ ( h i = k ) � � p ( v i | h i ; B ) = = m , k m , k m , k m , k we have � log p ( v ( j ) � ✶ ( v ( j ) � | h i ; B ) = = m ) p ( h i = k |D j , θ old ) log B m , k E p ( h i |D j ; θ old ) i i m , k Michael Gutmann Learning for Hidden Markov Models 16 / 28

  17. E-step for discrete-valued HMM ◮ Putting all together, we obtain the complete data log likelihood for the HMM with discrete visibles and hiddens. n � � J ( θ , θ old ) = p ( h 1 = k |D j ; θ old ) log a k + j =1 k n d p ( h i = k , h i − 1 = k ′ |D j ; θ old ) log A k , k ′ + � � � k , k ′ j =1 i =2 n d ✶ ( v ( j ) � � � = m ) p ( h i = k |D j , θ old ) log B m , k i j =1 i =1 m , k ◮ The objectives for a , and the columns of A and B decouple. ◮ Does not completely decouple because of the constraint that the elements of a have to sum to one, and that the columns of A and B have to sum to one. Michael Gutmann Learning for Hidden Markov Models 17 / 28

Recommend


More recommend