Learning for Hidden Markov Models & Course Recap Michael - PowerPoint PPT Presentation

Learning for Hidden Markov Models & Course Recap Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018

Recap ◮ We can decompose the log marginal of any joint distribution into a sum of two terms: ◮ the free energy and ◮ the KL divergence between the variational and the conditional distribution ◮ Variational principle: Maximising the free energy with respect to the variational distribution allows us to (approximately) compute the (log) marginal and the conditional from the joint. ◮ We applied the variational principle to inference and learning problems. ◮ For parameter estimation in presence of unobserved variables: Coordinate ascent on the free energy leads to the (variational) EM algorithm. Michael Gutmann Learning for Hidden Markov Models 2 / 28

Program 1. EM algorithm to learn the parameters of HMMs 2. Course recap Michael Gutmann Learning for Hidden Markov Models 3 / 28

Program 1. EM algorithm to learn the parameters of HMMs Problem statement Learning by gradient ascent on the log-likelihood or by EM EM update equations 2. Course recap Michael Gutmann Learning for Hidden Markov Models 4 / 28

Hidden Markov model Specified by ◮ DAG (representing the independence assumptions) h 1 h 2 h 3 h 4 v 1 v 2 v 3 v 4 ◮ Transition distribution p ( h i | h i − 1 ) ◮ Emission distribution p ( v i | h i ) ◮ Initial state distribution p ( h 1 ) Michael Gutmann Learning for Hidden Markov Models 5 / 28

The classical inference problems ◮ Classical inference problems: ◮ Filtering: p ( h t | v 1: t ) ◮ Smoothing: p ( h t | v 1: u ) where t < u ◮ Prediction: p ( h t | v 1: u ) and/or p ( v t | v 1: u ) where t > u ◮ Most likely hidden path (Viterbi alignment): argmax h 1: t p ( h 1: t | v 1: t ) ◮ Inference problems can be solved by message passing. ◮ Requires that the transition, emission, and initial state distributions are known. Michael Gutmann Learning for Hidden Markov Models 6 / 28

Learning problem ◮ Data: D = {D 1 , . . . , D n } , where each D j is a sequence of visibles of length d , i.e. D j = ( v ( j ) 1 , . . . , v ( j ) d ) ◮ Assumptions: ◮ All variables are discrete: h i ∈ { 1 , . . . K } , v i ∈ { 1 , . . . , M } . ◮ Stationarity ◮ Parametrisation: ◮ Transition distribution is parametrised by the matrix A p ( h i = k | h i − 1 = k ′ ; A ) = A k , k ′ ◮ Emission distribution is parametrised by the matrix B p ( v i = m | h i = k ; B ) = B m , k ◮ Initial state distribution is parametrised by the vector a p ( h 1 = k ; a ) = a k ◮ Task: Use the data D to learn A , B , and a Michael Gutmann Learning for Hidden Markov Models 7 / 28

Learning problem ◮ Since A , B , and a represent (conditional) distributions, the parameters are constrained to be non-negative and to satisfy K K p ( h i = k | h i − 1 = k ′ ) = � � A k , k ′ = 1 k =1 k =1 M M � � p ( v i = m | h i = k ) = B m , k = 1 m =1 m =1 k K � � p ( h 1 = k ) = a k = 1 k =1 k =1 ◮ Note: Much of what follows holds more generally for HMMs and does not use the stationarity assumption or that the h i and v i are discrete random variables. ◮ The parameters together will be denoted by θ . Michael Gutmann Learning for Hidden Markov Models 8 / 28

Options for learning the parameters ◮ The model p ( h , v ; θ ) is normalised but we have unobserved variables. ◮ Option 1: Simple gradient ascent on the log-likelihood n � � � � � θ new = θ old + ǫ ∇ θ log p ( h , D j ; θ ) E p ( h |D j ; θ old ) � � θ old j =1 see slides Intractable Likelihood Functions ◮ Option 2: EM algorithm n � θ new = argmax E p ( h |D j ; θ old ) [log p ( h , D j ; θ )] θ j =1 see slides Variational Inference and Learning ◮ For HMMs, both are possible thanks to sum-product message passing. Michael Gutmann Learning for Hidden Markov Models 9 / 28

Options for learning the parameters � � � Option 1: θ new = θ old + ǫ � n ∇ θ log p ( h , D j ; θ ) j =1 E p ( h |D j ; θ old ) � � θ old � n Option 2: θ new = argmax θ j =1 E p ( h |D j ; θ old ) [log p ( h , D j ; θ )] ◮ Similarities: ◮ Both require computation of the posterior expectation. ◮ Assume the “M” step is performed by gradient ascent, n � � � θ ′ = θ + ǫ � � ∇ θ log p ( h , D j ; θ ) E p ( h |D j ; θ old ) � � θ j =1 where θ is initialised with θ old , and the final θ ′ gives θ new . If only one gradient step is taken, option 2 becomes option 1. ◮ Differences: ◮ Unlike option 2, option 1 requires re-computation of the posterior after each ǫ update of θ , which may be costly. ◮ In some cases (including HMMs), the “M”/argmax step can be performed analytically in closed form. Michael Gutmann Learning for Hidden Markov Models 10 / 28

Expected complete data log-likelihood ◮ Denote the objective in the EM algorithm by J ( θ , θ old ), n � J ( θ , θ old ) = E p ( h |D j ; θ old ) [log p ( h , D j ; θ )] j =1 ◮ We show on the next slide that in general for the HMM model, the full posteriors p ( h |D j ; θ old ) are not needed but just p ( h i | h i − 1 , D j ; θ old ) p ( h i |D j ; θ old ) . They can be obtained by the alpha-beta recursion (sum-product algorithm). ◮ Posteriors need to be computed for each observed sequence D j , and need to be re-computed after updating θ . Michael Gutmann Learning for Hidden Markov Models 11 / 28

Expected complete data log-likelihood ◮ The HMM model factorises as d � p ( h , v ; θ ) = p ( h 1 ; a ) p ( v 1 | h 1 ; B ) p ( h i | h i − 1 ; A ) p ( v i | h i ; B ) i =2 ◮ For sequence D j , we have log p ( h , D j ; θ ) = log p ( h 1 ; a ) + log p ( v ( j ) 1 | h 1 ; B )+ d log p ( h i | h i − 1 ; A ) + log p ( v ( j ) � | h i ; B ) i i =2 ◮ Since E p ( h |D j ; θ old ) [log p ( h 1 ; a )] = E p ( h 1 |D j ; θ old ) [log p ( h 1 ; a )] E p ( h |D j ; θ old ) [log p ( h i | h i − 1 ; A )] = E p ( h i , h i − 1 |D j ; θ old ) [log p ( h i | h i − 1 ; A )] � � � � log p ( v ( j ) log p ( v ( j ) | h i ; B ) = E p ( h i |D j ; θ old ) | h i ; B ) E p ( h |D j ; θ old ) i i we do not need the full posterior but only the marginal posteriors and the joint of the neighbouring variables. Michael Gutmann Learning for Hidden Markov Models 12 / 28

Expected complete data log-likelihood With the factorisation (independencies) in the HMM model, the objective function thus becomes n � J ( θ , θ old ) = E p ( h |D j ; θ old ) [log p ( h , D j ; θ )] j =1 n � = E p ( h 1 |D j ; θ old ) [log p ( h 1 ; a )]+ j =1 n d � � E p ( h i , h i − 1 |D j ; θ old ) [log p ( h i | h i − 1 ; A )]+ j =1 i =2 n d � log p ( v ( j ) � � � | h i ; B ) E p ( h i |D j ; θ old ) i j =1 i =1 In the derivation so far we have not yet used the assumed parametrisation of the model. We insert these assumptions next. Michael Gutmann Learning for Hidden Markov Models 13 / 28

The term for the initial state distribution ◮ We have assumed that p ( h 1 = k ; a ) = a k k = 1 , . . . , K which we can write as a ✶ ( h 1 = k ) � p ( h 1 ; a ) = k k (like for the Bernoulli model, see slides Basics of Model-Based Learning and Tutorial 7) ◮ The log pmf is thus � log p ( h 1 ; a ) = ✶ ( h 1 = k ) log a k k ◮ Hence � E p ( h 1 |D j ; θ old ) [log p ( h 1 ; a )] = E p ( h 1 |D j ; θ old ) [ ✶ ( h 1 = k )] log a k k � = p ( h 1 = k |D j ; θ old ) log a k k Michael Gutmann Learning for Hidden Markov Models 14 / 28

The term for the transition distribution ◮ We have assumed that k , k ′ = 1 , . . . K p ( h i = k | h i − 1 = k ′ ; A ) = A k , k ′ which we can write as A ✶ ( h i = k , h i − 1 = k ′ ) � p ( h i | h i − 1 ; A ) = k , k ′ k , k ′ (see slides Basics of Model-Based Learning and Tutorial 7) ◮ Further: ✶ ( h i = k , h i − 1 = k ′ ) log A k , k ′ � log p ( h i | h i − 1 ; A ) = k , k ′ ◮ Hence E p ( h i , h i − 1 |D j ; θ old ) [log p ( h i | h i − 1 ; A )] equals � log A k , k ′ � ✶ ( h i = k , h i − 1 = k ′ ) � E p ( h i , h i − 1 |D j ; θ old ) k , k ′ p ( h i = k , h i − 1 = k ′ |D j ; θ old ) log A k , k ′ � = k , k ′ Michael Gutmann Learning for Hidden Markov Models 15 / 28

The term for the emission distribution We can do the same for the emission distribution. With B ✶ ( v i = m , h i = k ) B ✶ ( v i = m ) ✶ ( h i = k ) � � p ( v i | h i ; B ) = = m , k m , k m , k m , k we have � log p ( v ( j ) � ✶ ( v ( j ) � | h i ; B ) = = m ) p ( h i = k |D j , θ old ) log B m , k E p ( h i |D j ; θ old ) i i m , k Michael Gutmann Learning for Hidden Markov Models 16 / 28

E-step for discrete-valued HMM ◮ Putting all together, we obtain the complete data log likelihood for the HMM with discrete visibles and hiddens. n � � J ( θ , θ old ) = p ( h 1 = k |D j ; θ old ) log a k + j =1 k n d p ( h i = k , h i − 1 = k ′ |D j ; θ old ) log A k , k ′ + � � � k , k ′ j =1 i =2 n d ✶ ( v ( j ) � � � = m ) p ( h i = k |D j , θ old ) log B m , k i j =1 i =1 m , k ◮ The objectives for a , and the columns of A and B decouple. ◮ Does not completely decouple because of the constraint that the elements of a have to sum to one, and that the columns of A and B have to sum to one. Michael Gutmann Learning for Hidden Markov Models 17 / 28

Learning for Hidden Markov Models & Course Recap Michael - PowerPoint PPT Presentation

Learning for Hidden Markov Models & Course Recap Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018 Recap We can decompose the log marginal of any joint

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Hidden Markov Models Markov Model (Finite State Machine with Probs) Modeling a sequence of

CS 4495 Computer Vision Hidden Markov Models Aaron Bobick School of Interactive Computing

Outline Sequential Data - Part 2 Greg Mori - CMPT 419/726 Hidden Markov Models - Most Likely

Semidefinite programming bounds for codes and anticodes in Cayley graphs Frank Vallentin

14.54 International Trade Lecture 10: Production Functions 14.54 Week 6 Fall 2016 14.54 (Week

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department

Optimization part 1 1 Changelog Changes made in this version not seen in fjrst lecture: 29 Feb

Se Sect ction ion 811 1 Pr Proj ojec ect t Ren ental al As Assi sistance ance Pr

Practical attacks on AES-like cryptographic hash functions Stefan K olbl, Christian Rechberger

SALMON SUPPLY OUTLOOK Sockeye harvests expected to decline in Alaska Down 23% Statewide,

Security Protocols Q: Why security protocols? Bob Alice A: To allow reliable communication over