probabilistic unsupervised learning latent variable
play

Probabilistic & Unsupervised Learning Latent Variable Models for - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Latent Variable Models for Time Series Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn


  1. HMMs and SSMs (Linear Gaussian) State space models are the continuous state analogue of hidden Markov models. s 1 s 2 s 3 • • • s T z 1 z 2 z 3 • • • z T ⇔ x 1 x 2 x 3 x T x 1 x 2 x 3 x T ◮ A continuous vector state is a very powerful representation. For an HMM to communicate N bits of information about the past, it needs 2 N states! But a real-valued state vector can store an arbitrary number of bits in principle. s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T ◮ Linear-Gaussian output/dynamics are very weak. The types of dynamics linear SSMs can capture is very limited. HMMs can in principle represent arbitrary stochastic dynamics and output mappings.

  2. Many Extensions 1 ◮ Constrained HMMs 64 1 64 ◮ Continuous state models with discrete outputs for time series and static data ◮ Hierarchical models ◮ Hybrid systems ⇔ Mixed continuous & discrete states, switching state-space models

  3. Richer state representations A t� A t+1 A t+2 B t� B t+1 B t+2 ... C t� C t+1 C t+2 D t� D t+1 D t+2 Factorial HMMs Dynamic Bayesian Networks ◮ These are hidden Markov models with many state variables (i.e. a distributed representation of the state). ◮ The state can capture many more bits of information about the sequence (linear in the number of state variables).

  4. Chain models: ML Learning with EM z 1 z 2 z 3 • • • z T s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T x 1 x 2 x 3 x T z 1 ∼ N ( µ 0 , Q 0 ) s 1 ∼ π z t | z t − 1 ∼ N ( A z t − 1 , Q ) s t | s t − 1 ∼ Φ s t − 1 , · x t | z t ∼ N ( C z t , R ) x t | s t ∼ A s t The structure of learning and inference for both models is dictated by the factored structure.

  5. Chain models: ML Learning with EM z 1 z 2 z 3 • • • z T s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T x 1 x 2 x 3 x T z 1 ∼ N ( µ 0 , Q 0 ) s 1 ∼ π z t | z t − 1 ∼ N ( A z t − 1 , Q ) s t | s t − 1 ∼ Φ s t − 1 , · x t | z t ∼ N ( C z t , R ) x t | s t ∼ A s t The structure of learning and inference for both models is dictated by the factored structure. T T � � P ( x 1 , . . . , x T , z 1 , . . . , z T ) = P ( z 1 ) P ( z t | z t − 1 ) P ( x t | z t ) t = 2 t = 1

  6. Chain models: ML Learning with EM z 1 z 2 z 3 • • • z T s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T x 1 x 2 x 3 x T z 1 ∼ N ( µ 0 , Q 0 ) s 1 ∼ π z t | z t − 1 ∼ N ( A z t − 1 , Q ) s t | s t − 1 ∼ Φ s t − 1 , · x t | z t ∼ N ( C z t , R ) x t | s t ∼ A s t The structure of learning and inference for both models is dictated by the factored structure. T T � � P ( x 1 , . . . , x T , z 1 , . . . , z T ) = P ( z 1 ) P ( z t | z t − 1 ) P ( x t | z t ) t = 2 t = 1 Learning (M-step) : argmax � log P ( x 1 , . . . , x T , z 1 , . . . , z T ) � q ( z 1 ,..., z T ) = � � � T � T � log P ( z 1 ) � q ( z 1 ) + � log P ( z t | z t − 1 ) � q ( z t , z t − 1 ) + � log P ( x t | z t ) � q ( z t ) argmax t = 2 t = 1

  7. Chain models: ML Learning with EM z 1 z 2 z 3 • • • z T s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T x 1 x 2 x 3 x T z 1 ∼ N ( µ 0 , Q 0 ) s 1 ∼ π z t | z t − 1 ∼ N ( A z t − 1 , Q ) s t | s t − 1 ∼ Φ s t − 1 , · x t | z t ∼ N ( C z t , R ) x t | s t ∼ A s t The structure of learning and inference for both models is dictated by the factored structure. T T � � P ( x 1 , . . . , x T , z 1 , . . . , z T ) = P ( z 1 ) P ( z t | z t − 1 ) P ( x t | z t ) t = 2 t = 1 Learning (M-step) : argmax � log P ( x 1 , . . . , x T , z 1 , . . . , z T ) � q ( z 1 ,..., z T ) = � � � T � T � log P ( z 1 ) � q ( z 1 ) + � log P ( z t | z t − 1 ) � q ( z t , z t − 1 ) + � log P ( x t | z t ) � q ( z t ) argmax t = 2 t = 1 So the expectations needed (in E-step) are derived from singleton and pairwise marginals.

  8. Chain models: Inference Three general inference problems : P ( z t | x 1 , . . . , x t ) Filtering: P ( z t | x 1 , . . . , x T ) (also P ( z t , z t − 1 | x 1 , . . . , x T ) for learning) Smoothing: P ( z t | x 1 , . . . , x t − ∆ t ) Prediction:

  9. Chain models: Inference Three general inference problems : P ( z t | x 1 , . . . , x t ) Filtering: P ( z t | x 1 , . . . , x T ) (also P ( z t , z t − 1 | x 1 , . . . , x T ) for learning) Smoothing: P ( z t | x 1 , . . . , x t − ∆ t ) Prediction: Naively, these marginal posteriors seem to require very large integrals (or sums) � � P ( z t | x 1 , . . . , x t ) = · · · d z 1 . . . d z t − 1 P ( z 1 , . . . , z t | x 1 , . . . , x t )

  10. Chain models: Inference Three general inference problems : P ( z t | x 1 , . . . , x t ) Filtering: P ( z t | x 1 , . . . , x T ) (also P ( z t , z t − 1 | x 1 , . . . , x T ) for learning) Smoothing: P ( z t | x 1 , . . . , x t − ∆ t ) Prediction: Naively, these marginal posteriors seem to require very large integrals (or sums) � � P ( z t | x 1 , . . . , x t ) = · · · d z 1 . . . d z t − 1 P ( z 1 , . . . , z t | x 1 , . . . , x t ) but again the factored structure of the distributions will help us. The algorithms rely on a form of temporal updating or message passing.

  11. Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x t ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) ∝ π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1

  12. Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x t ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) ∝ π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the k 1 = 1 . . . K states at t = 1 holding value π k 1 A k 1 ( x 1 )

  13. Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x t ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) ∝ π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the k 1 = 1 . . . K states at t = 1 holding value π k 1 A k 1 ( x 1 ) ◮ move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. × output emission prob.

  14. Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x t ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) ∝ π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the k 1 = 1 . . . K states at t = 1 holding value π k 1 A k 1 ( x 1 ) ◮ move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. × output emission prob. ◮ repeat

  15. Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x t ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) ∝ π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the k 1 = 1 . . . K states at t = 1 holding value π k 1 A k 1 ( x 1 ) ◮ move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. × output emission prob. ◮ repeat until all bugs have reached time t ◮ sum up values on all K t − 1 bugs that reach state s t = k (one bug per state path)

  16. Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x t ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) ∝ π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the k 1 = 1 . . . K states at t = 1 holding value π k 1 A k 1 ( x 1 ) ◮ move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. × output emission prob. ◮ repeat until all bugs have reached time t ◮ sum up values on all K t − 1 bugs that reach state s t = k (one bug per state path) Clever recursion: ◮ at every step, replace bugs at each node with a single bug carrying sum of values

  17. Probability updating: “Bayesian filtering” z 1 z 2 z 3 • • • z T x 1 x 2 x 3 x T � P ( z t | x 1 : t ) = P ( z t , z t − 1 | x 1 : t ) d z t − 1

  18. Probability updating: “Bayesian filtering” z 1 z 2 z 3 • • • z T x 1 x 2 x 3 x T � P ( z t | x 1 : t ) = P ( z t , z t − 1 | x t , x 1 : t − 1 ) d z t − 1

  19. Probability updating: “Bayesian filtering” z 1 z 2 z 3 • • • z T x 1 x 2 x 3 x T � P ( z t | x 1 : t ) = P ( z t , z t − 1 | x t , x 1 : t − 1 ) d z t − 1 � P ( x t , z t , z t − 1 | x 1 : t − 1 ) = d z t − 1 P ( x t | x 1 : t − 1 )

  20. Probability updating: “Bayesian filtering” z 1 z 2 z 3 • • • z T x 1 x 2 x 3 x T � P ( z t | x 1 : t ) = P ( z t , z t − 1 | x t , x 1 : t − 1 ) d z t − 1 � P ( x t , z t , z t − 1 | x 1 : t − 1 ) = d z t − 1 P ( x t | x 1 : t − 1 ) � ∝ P ( x t | z t , z t − 1 , x 1 : t − 1 ) P ( z t | z t − 1 , x 1 : t − 1 ) P ( z t − 1 | x 1 : t − 1 ) d z t − 1

  21. Probability updating: “Bayesian filtering” z 1 z 2 z 3 • • • z T x 1 x 2 x 3 x T � P ( z t | x 1 : t ) = P ( z t , z t − 1 | x t , x 1 : t − 1 ) d z t − 1 � P ( x t , z t , z t − 1 | x 1 : t − 1 ) = d z t − 1 P ( x t | x 1 : t − 1 ) � ∝ P ( x t | z t , z t − 1 , x 1 : t − 1 ) P ( z t | z t − 1 , x 1 : t − 1 ) P ( z t − 1 | x 1 : t − 1 ) d z t − 1 � = P ( x t | z t ) P ( z t | z t − 1 ) P ( z t − 1 | x 1 : t − 1 ) d z t − 1 Markov property

  22. Probability updating: “Bayesian filtering” z 1 z 2 z 3 • • • z T x 1 x 2 x 3 x T � P ( z t | x 1 : t ) = P ( z t , z t − 1 | x t , x 1 : t − 1 ) d z t − 1 � P ( x t , z t , z t − 1 | x 1 : t − 1 ) = d z t − 1 P ( x t | x 1 : t − 1 ) � ∝ P ( x t | z t , z t − 1 , x 1 : t − 1 ) P ( z t | z t − 1 , x 1 : t − 1 ) P ( z t − 1 | x 1 : t − 1 ) d z t − 1 � = P ( x t | z t ) P ( z t | z t − 1 ) P ( z t − 1 | x 1 : t − 1 ) d z t − 1 Markov property This is a forward recursion based on Bayes rule.

  23. The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming .

  24. The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ )

  25. The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ ) Then much like the Bayesian filtering updates, we have: � � K � α 1 ( i ) = π i A i ( x 1 ) α t + 1 ( i ) = α t ( j )Φ ji A i ( x t + 1 ) j = 1

  26. The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ ) Then much like the Bayesian filtering updates, we have: � � K � α 1 ( i ) = π i A i ( x 1 ) α t + 1 ( i ) = α t ( j )Φ ji A i ( x t + 1 ) j = 1 We’ve defined α t ( i ) to be a joint rather than a posterior.

  27. The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ ) Then much like the Bayesian filtering updates, we have: � � K � α 1 ( i ) = π i A i ( x 1 ) α t + 1 ( i ) = α t ( j )Φ ji A i ( x t + 1 ) j = 1 We’ve defined α t ( i ) to be a joint rather than a posterior. It’s easy to obtain the posterior by normalisation: α t ( i ) � P ( s t = i | x 1 , . . . , x t , θ ) = k α t ( k )

  28. The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ ) Then much like the Bayesian filtering updates, we have: � � K � α 1 ( i ) = π i A i ( x 1 ) α t + 1 ( i ) = α t ( j )Φ ji A i ( x t + 1 ) j = 1 We’ve defined α t ( i ) to be a joint rather than a posterior. It’s easy to obtain the posterior by normalisation: α t ( i ) � P ( s t = i | x 1 , . . . , x t , θ ) = k α t ( k ) This form enables us to compute the likelihood for θ = { A , Φ , π } efficiently in O ( TK 2 ) time: � � K P ( x 1 . . . x T | θ ) = P ( x 1 , . . . , x T , s 1 , . . . , s T , θ ) = α T ( k ) s 1 ,..., s T k = 1 ıve sum (number of paths = K T ). avoiding the exponential number of paths in the na¨

  29. The LGSSM: Kalman Filtering z 1 ∼ N ( µ 0 , Q 0 ) • • • z 1 z 2 z 3 z T z t | z t − 1 ∼ N ( A z t − 1 , Q ) x t | z t ∼ N ( C z t , R ) x 1 x 2 x 3 x T For the SSM, the sums become integrals.

  30. The LGSSM: Kalman Filtering z 1 ∼ N ( µ 0 , Q 0 ) • • • z 1 z 2 z 3 z T z t | z t − 1 ∼ N ( A z t − 1 , Q ) x t | z t ∼ N ( C z t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ z 0 V 0 1 = Q 0 ; then (cf. FA) For the SSM, the sums become integrals. Let ˆ � � i ) , ˆ 1 − K 1 C ˆ z 0 z 0 V 0 V 0 P ( z 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ ˆ 1

  31. The LGSSM: Kalman Filtering z 1 ∼ N ( µ 0 , Q 0 ) • • • z 1 z 2 z 3 z T z t | z t − 1 ∼ N ( A z t − 1 , Q ) x t | z t ∼ N ( C z t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ z 0 V 0 1 = Q 0 ; then (cf. FA) For the SSM, the sums become integrals. Let ˆ � � 1 C T + R ) − 1 i ) , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ z 0 z 0 V 0 V 0 V 0 V 0 P ( z 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ ˆ 1

  32. The LGSSM: Kalman Filtering z 1 ∼ N ( µ 0 , Q 0 ) • • • z 1 z 2 z 3 z T z t | z t − 1 ∼ N ( A z t − 1 , Q ) x t | z t ∼ N ( C z t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ z 0 V 0 1 = Q 0 ; then (cf. FA) For the SSM, the sums become integrals. Let ˆ � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ z 0 z 0 V 0 V 0 V 0 V 0 P ( z 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ z 1 V 1 ˆ 1 1

  33. The LGSSM: Kalman Filtering z 1 ∼ N ( µ 0 , Q 0 ) • • • z 1 z 2 z 3 z T z t | z t − 1 ∼ N ( A z t − 1 , Q ) x t | z t ∼ N ( C z t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ z 0 V 0 1 = Q 0 ; then (cf. FA) For the SSM, the sums become integrals. Let ˆ � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ z 0 z 0 V 0 V 0 V 0 V 0 P ( z 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ z 1 V 1 ˆ 1 1 t ≡ E [ z t | x 1 , . . . , x τ ] and ˆ z τ V τ t ≡ V [ z t | x 1 , . . . , x τ ] . In general, we define ˆ

  34. The LGSSM: Kalman Filtering z 1 ∼ N ( µ 0 , Q 0 ) • • • z 1 z 2 z 3 z T z t | z t − 1 ∼ N ( A z t − 1 , Q ) x t | z t ∼ N ( C z t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ z 0 V 0 1 = Q 0 ; then (cf. FA) For the SSM, the sums become integrals. Let ˆ � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ z 0 z 0 V 0 V 0 V 0 V 0 P ( z 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ z 1 V 1 ˆ 1 1 t ≡ E [ z t | x 1 , . . . , x τ ] and ˆ z τ V τ t ≡ V [ z t | x 1 , . . . , x τ ] . Then, In general, we define ˆ � P ( z t | x 1 : t − 1 ) = d z t − 1 P ( z t | z t − 1 ) P ( z t − 1 | x 1 : t − 1 )

  35. The LGSSM: Kalman Filtering z 1 ∼ N ( µ 0 , Q 0 ) • • • z 1 z 2 z 3 z T z t | z t − 1 ∼ N ( A z t − 1 , Q ) x t | z t ∼ N ( C z t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ z 0 V 0 1 = Q 0 ; then (cf. FA) For the SSM, the sums become integrals. Let ˆ � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ z 0 z 0 V 0 V 0 V 0 V 0 P ( z 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ z 1 V 1 ˆ 1 1 t ≡ E [ z t | x 1 , . . . , x τ ] and ˆ z τ V τ t ≡ V [ z t | x 1 , . . . , x τ ] . Then, In general, we define ˆ � t − 1 A T + Q z t − 1 , A ˆ V t − 1 P ( z t | x 1 : t − 1 ) = d z t − 1 P ( z t | z t − 1 ) P ( z t − 1 | x 1 : t − 1 ) = N ( A ˆ ) t − 1 � �� � � �� � z t − 1 V t − 1 ˆ ˆ t t

  36. The LGSSM: Kalman Filtering z 1 ∼ N ( µ 0 , Q 0 ) • • • z 1 z 2 z 3 z T z t | z t − 1 ∼ N ( A z t − 1 , Q ) x t | z t ∼ N ( C z t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ z 0 V 0 1 = Q 0 ; then (cf. FA) For the SSM, the sums become integrals. Let ˆ � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ z 0 z 0 V 0 V 0 V 0 V 0 P ( z 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ z 1 V 1 ˆ 1 1 t ≡ E [ z t | x 1 , . . . , x τ ] and ˆ z τ V τ t ≡ V [ z t | x 1 , . . . , x τ ] . Then, In general, we define ˆ � t − 1 A T + Q z t − 1 , A ˆ V t − 1 P ( z t | x 1 : t − 1 ) = d z t − 1 P ( z t | z t − 1 ) P ( z t − 1 | x 1 : t − 1 ) = N ( A ˆ ) t − 1 � �� � � �� � z t − 1 V t − 1 ˆ ˆ t t z t − 1 z t − 1 , ˆ V t − 1 − K t C ˆ V t − 1 P ( z t | x 1 : t ) = N ( ˆ + K t ( x t − C ˆ ) ) t t t t � �� � � �� � z t ˆ V t ˆ C T + R ) − 1 t t K t = ˆ V t − 1 C T ( C ˆ V t − 1 t t � �� � Kalman gain

  37. The LGSSM: Kalman Filtering z 1 ∼ N ( µ 0 , Q 0 ) • • • z 1 z 2 z 3 z T z t | z t − 1 ∼ N ( A z t − 1 , Q ) x t | z t ∼ N ( C z t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ z 0 V 0 1 = Q 0 ; then (cf. FA) For the SSM, the sums become integrals. Let ˆ � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ z 0 z 0 V 0 V 0 V 0 V 0 P ( z 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ z 1 V 1 ˆ 1 1 t ≡ E [ z t | x 1 , . . . , x τ ] and ˆ z τ V τ t ≡ V [ z t | x 1 , . . . , x τ ] . Then, In general, we define ˆ � t − 1 A T + Q z t − 1 , A ˆ V t − 1 P ( z t | x 1 : t − 1 ) = d z t − 1 P ( z t | z t − 1 ) P ( z t − 1 | x 1 : t − 1 ) = N ( A ˆ ) t − 1 � �� � � �� � z t − 1 V t − 1 ˆ ˆ t t z t − 1 z t − 1 , ˆ V t − 1 − K t C ˆ V t − 1 P ( z t | x 1 : t ) = N ( ˆ + K t ( x t − C ˆ ) ) t t t t � �� � � �� � � zx T � � xx T � z t ˆ V t ˆ � �� � � �� � C T + R ) − 1 t t V t − 1 ˆ C T ( C ˆ V t − 1 K t = t t � �� � Kalman gain

  38. The LGSSM: Kalman Filtering z 1 ∼ N ( µ 0 , Q 0 ) • • • z 1 z 2 z 3 z T z t | z t − 1 ∼ N ( A z t − 1 , Q ) x t | z t ∼ N ( C z t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ z 0 V 0 1 = Q 0 ; then (cf. FA) For the SSM, the sums become integrals. Let ˆ � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ z 0 z 0 V 0 V 0 V 0 V 0 P ( z 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ z 1 V 1 ˆ 1 1 t ≡ E [ z t | x 1 , . . . , x τ ] and ˆ z τ V τ t ≡ V [ z t | x 1 , . . . , x τ ] . Then, In general, we define ˆ � t − 1 A T + Q z t − 1 , A ˆ V t − 1 P ( z t | x 1 : t − 1 ) = d z t − 1 P ( z t | z t − 1 ) P ( z t − 1 | x 1 : t − 1 ) = N ( A ˆ ) t − 1 � �� � � �� � z t − 1 V t − 1 ˆ ˆ t t z t − 1 z t − 1 , ˆ V t − 1 − K t C ˆ V t − 1 P ( z t | x 1 : t ) = N ( ˆ + K t ( x t − C ˆ ) ) t t t t � �� � � �� � � zx T � � xx T � z t ˆ V t ˆ � �� � � �� � C T + R ) − 1 t t V t − 1 ˆ C T ( C ˆ V t − 1 K t = t t � �� � Kalman gain FA: β = ( I + Λ T Ψ − 1 Λ) − 1 Λ T Ψ − 1 mat. inv. lem. Λ T (ΛΛ T + Ψ) − 1 ; µ = β x n ; Σ = I − β Λ . =

  39. The marginal posterior: “Bayesian smoothing” z 1 z 2 z 3 • • • z T x 1 x 2 x 3 x T P ( z t | x 1 : T )

  40. The marginal posterior: “Bayesian smoothing” z 1 z 2 z 3 • • • z T x 1 x 2 x 3 x T P ( z t | x 1 : T ) = P ( z t , x t + 1 : T | x 1 : t ) P ( x t + 1 : T | x 1 : t )

  41. The marginal posterior: “Bayesian smoothing” z 1 z 2 z 3 • • • z T x 1 x 2 x 3 x T P ( z t | x 1 : T ) = P ( z t , x t + 1 : T | x 1 : t ) P ( x t + 1 : T | x 1 : t ) = P ( x t + 1 : T | z t ) P ( z t | x 1 : t ) P ( x t + 1 : T | x 1 : t ) The marginal combines a backward message with the forward message found by filtering.

  42. The HMM: Forward–Backward Algorithm State estimation: compute marginal posterior distribution over state at time t : γ t ( i ) ≡ P ( s t = i | x 1 : T ) = P ( s t = i , x 1 : t ) P ( x t + 1 : T | s t = i ) α t ( i ) β t ( i ) � = P ( x 1 : T ) j α t ( j ) β t ( j ) where there is a simple backward recursion for � K β t ( i ) ≡ P ( x t + 1 : T | s t = i ) = P ( s t + 1 = j , x t + 1 , x t + 2 : T | s t = i ) j = 1 � K � K = P ( s t + 1 = j | s t = i ) P ( x t + 1 | s t + 1 = j ) P ( x t + 2 : T | s t + 1 = j ) = Φ ij A j ( x t + 1 ) β t + 1 ( j ) j = 1 j = 1 α t ( i ) gives total inflow of probabilities to node ( t , i ) ; β t ( i ) gives total outflow of probabiilties. 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Bugs again: the bugs run forward from time 0 to t and backward from time T to t .

  43. Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time.

  44. Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states.

  45. Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero!

  46. Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! ◮ To find the single best path, we use the Viterbi decoding algorithm which is just Bellman’s dynamic programming algorithm applied to this problem. This is an inference P ( s 1 : T | x 1 : T , θ ) algorithm which computes the most probable state sequences: argmax s 1 : T

  47. Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! ◮ To find the single best path, we use the Viterbi decoding algorithm which is just Bellman’s dynamic programming algorithm applied to this problem. This is an inference P ( s 1 : T | x 1 : T , θ ) algorithm which computes the most probable state sequences: argmax s 1 : T ◮ The recursions look the same as forward-backward, except with max instead of � .

  48. Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! ◮ To find the single best path, we use the Viterbi decoding algorithm which is just Bellman’s dynamic programming algorithm applied to this problem. This is an inference P ( s 1 : T | x 1 : T , θ ) algorithm which computes the most probable state sequences: argmax s 1 : T ◮ The recursions look the same as forward-backward, except with max instead of � . ◮ Bugs once more: same trick except at each step kill all bugs but the one with the highest value at the node.

  49. Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! ◮ To find the single best path, we use the Viterbi decoding algorithm which is just Bellman’s dynamic programming algorithm applied to this problem. This is an inference P ( s 1 : T | x 1 : T , θ ) algorithm which computes the most probable state sequences: argmax s 1 : T ◮ The recursions look the same as forward-backward, except with max instead of � . ◮ Bugs once more: same trick except at each step kill all bugs but the one with the highest value at the node. ◮ There is also a modified EM training based on the Viterbi decoder (assignment).

  50. The LGSSM: Kalman smoothing z 1 z 2 z 3 • • • z T x 1 x 2 x 3 x T We use a slightly different decomposition: � P ( z t | x 1 : T ) = P ( z t , z t + 1 | x 1 : T ) d z t + 1 � = P ( z t | z t + 1 , x 1 : T ) P ( z t + 1 | x 1 : T ) d z t + 1 � = P ( z t | z t + 1 , x 1 : t ) P ( z t + 1 | x 1 : T ) d z t + 1 Markov property This gives the additional backward recursion : J t = ˆ t A T (ˆ t + 1 ) − 1 V t V t z T z t z T z t t + 1 − A ˆ ˆ t = ˆ t + J t ( ˆ t ) ˆ t = ˆ t + J t (ˆ t + 1 − ˆ V T V t V T V t T t + 1 ) J t

  51. ML Learning for SSMs using batch EM A A A A z 3 • • • z 1 z 2 z T C C C C x 1 x 2 x 3 x T Parameters: θ = { µ 0 , Q 0 , A , Q , C , R } Free energy: � F ( q , θ ) = d z 1 : T q ( z 1 : T )( log P ( x 1 : T , z 1 : T | θ ) − log q ( z 1 : T )) q ∗ ( z ) = p ( z | x , θ ) E-step: Maximise F w.r.t. q with θ fixed: This can be achieved with a two-state extension of the Kalman smoother. M-step: Maximize F w.r.t. θ with q fixed. This boils down to solving a few weighted least squares problems, since all the variables in: � T p ( z , x | θ ) = p ( z 1 ) p ( x 1 | z 1 ) p ( z t | z t − 1 ) p ( x t | z t ) t = 2 form a multivariate Gaussian.

  52. The M step for C � � 2 ( x t − C z t ) T R − 1 ( x t − C z t ) − 1 p ( x t | z t ) ∝ exp ⇒

  53. The M step for C � � 2 ( x t − C z t ) T R − 1 ( x t − C z t ) − 1 p ( x t | z t ) ∝ exp ⇒ �� � ln p ( x t | z t ) C new = argmax C t q

  54. The M step for C � � 2 ( x t − C z t ) T R − 1 ( x t − C z t ) − 1 p ( x t | z t ) ∝ exp ⇒ �� � ln p ( x t | z t ) C new = argmax C t q � � � − 1 ( x t − C z t ) T R − 1 ( x t − C z t ) = argmax + const 2 C t q

  55. The M step for C � � 2 ( x t − C z t ) T R − 1 ( x t − C z t ) − 1 p ( x t | z t ) ∝ exp ⇒ �� � ln p ( x t | z t ) C new = argmax C t q � � � − 1 ( x t − C z t ) T R − 1 ( x t − C z t ) = argmax + const 2 C t q � � � − 1 x T t R − 1 x t − 2 x T t R − 1 C � z t � + � z T t C T R − 1 C z t � = argmax 2 C t

  56. The M step for C � � 2 ( x t − C z t ) T R − 1 ( x t − C z t ) − 1 p ( x t | z t ) ∝ exp ⇒ �� � ln p ( x t | z t ) C new = argmax C t q � � � − 1 ( x t − C z t ) T R − 1 ( x t − C z t ) = argmax + const 2 C t q � � � − 1 x T t R − 1 x t − 2 x T t R − 1 C � z t � + � z T t C T R − 1 C z t � = argmax 2 C t � � � � �� ��� � − 1 � z t � x T t R − 1 C T R − 1 C z t z T = argmax Tr C 2Tr t C t t

  57. The M step for C � � 2 ( x t − C z t ) T R − 1 ( x t − C z t ) − 1 p ( x t | z t ) ∝ exp ⇒ �� � ln p ( x t | z t ) C new = argmax C t q � � � − 1 ( x t − C z t ) T R − 1 ( x t − C z t ) = argmax + const 2 C t q � � � − 1 x T t R − 1 x t − 2 x T t R − 1 C � z t � + � z T t C T R − 1 C z t � = argmax 2 C t � � � � �� ��� � − 1 � z t � x T t R − 1 C T R − 1 C z t z T = argmax Tr C 2Tr t C t t �� � ∂ C = R − 1 � = B T , we have ∂ {·} x t � z t � T − R − 1 C using ∂ Tr [ AB ] z t z T t ∂ A t t

  58. The M step for C � � 2 ( x t − C z t ) T R − 1 ( x t − C z t ) − 1 p ( x t | z t ) ∝ exp ⇒ �� � ln p ( x t | z t ) C new = argmax C t q � � � − 1 ( x t − C z t ) T R − 1 ( x t − C z t ) = argmax + const 2 C t q � � � − 1 x T t R − 1 x t − 2 x T t R − 1 C � z t � + � z T t C T R − 1 C z t � = argmax 2 C t � � � � �� ��� � − 1 � z t � x T t R − 1 C T R − 1 C z t z T = argmax Tr C 2Tr t C t t �� � ∂ C = R − 1 � = B T , we have ∂ {·} x t � z t � T − R − 1 C using ∂ Tr [ AB ] z t z T t ∂ A t t �� � �� �� − 1 � x t � z t � T z t z T ⇒ C new = t t t Note the connection to linear regression (and factor analysis).

  59. The M step for A � � − 1 2 ( z t + 1 − A z t ) T Q − 1 ( z t + 1 − A z t ) p ( z t + 1 | z t ) ∝ exp ⇒

  60. The M step for A � � − 1 2 ( z t + 1 − A z t ) T Q − 1 ( z t + 1 − A z t ) p ( z t + 1 | z t ) ∝ exp ⇒ �� � A new = argmax ln p ( z t + 1 | z t ) A t q

  61. The M step for A � � − 1 2 ( z t + 1 − A z t ) T Q − 1 ( z t + 1 − A z t ) p ( z t + 1 | z t ) ∝ exp ⇒ �� � A new = argmax ln p ( z t + 1 | z t ) A t q � � � − 1 ( z t + 1 − A z t ) T Q − 1 ( z t + 1 − A z t ) = argmax + const 2 A t q

  62. The M step for A � � − 1 2 ( z t + 1 − A z t ) T Q − 1 ( z t + 1 − A z t ) p ( z t + 1 | z t ) ∝ exp ⇒ �� � A new = argmax ln p ( z t + 1 | z t ) A t q � � � − 1 ( z t + 1 − A z t ) T Q − 1 ( z t + 1 − A z t ) = argmax + const 2 A t q � �� � � � � − 1 t + 1 Q − 1 z t + 1 − 2 t + 1 Q − 1 A z t t A T Q − 1 A z t z T z T z T = argmax + 2 A t

  63. The M step for A � � − 1 2 ( z t + 1 − A z t ) T Q − 1 ( z t + 1 − A z t ) p ( z t + 1 | z t ) ∝ exp ⇒ �� � A new = argmax ln p ( z t + 1 | z t ) A t q � � � − 1 ( z t + 1 − A z t ) T Q − 1 ( z t + 1 − A z t ) = argmax + const 2 A t q � �� � � � � − 1 t + 1 Q − 1 z t + 1 − 2 t + 1 Q − 1 A z t t A T Q − 1 A z t z T z T z T = argmax + 2 A t � � � � ��� � � � � � − 1 Q − 1 A T Q − 1 A z t z T z t z T = argmax Tr A 2Tr t + 1 t A t t

  64. The M step for A � � − 1 2 ( z t + 1 − A z t ) T Q − 1 ( z t + 1 − A z t ) p ( z t + 1 | z t ) ∝ exp ⇒ �� � A new = argmax ln p ( z t + 1 | z t ) A t q � � � − 1 ( z t + 1 − A z t ) T Q − 1 ( z t + 1 − A z t ) = argmax + const 2 A t q � �� � � � � − 1 t + 1 Q − 1 z t + 1 − 2 t + 1 Q − 1 A z t t A T Q − 1 A z t z T z T z T = argmax + 2 A t � � � � ��� � � � � � − 1 Q − 1 A T Q − 1 A z t z T z t z T = argmax Tr A 2Tr t + 1 t A t t � � � � ∂ A = Q − 1 � � = B T , we have ∂ {·} using ∂ Tr [ AB ] − Q − 1 A z t + 1 z T z t z T t t ∂ A t t

  65. The M step for A � � − 1 2 ( z t + 1 − A z t ) T Q − 1 ( z t + 1 − A z t ) p ( z t + 1 | z t ) ∝ exp ⇒ �� � A new = argmax ln p ( z t + 1 | z t ) A t q � � � − 1 ( z t + 1 − A z t ) T Q − 1 ( z t + 1 − A z t ) = argmax + const 2 A t q � �� � � � � − 1 t + 1 Q − 1 z t + 1 − 2 t + 1 Q − 1 A z t t A T Q − 1 A z t z T z T z T = argmax + 2 A t � � � � ��� � � � � � − 1 Q − 1 A T Q − 1 A z t z T z t z T = argmax Tr A 2Tr t + 1 t A t t � � � � ∂ A = Q − 1 � � = B T , we have ∂ {·} using ∂ Tr [ AB ] − Q − 1 A z t + 1 z T z t z T t t ∂ A t t �� �� �� �� − 1 � � z t + 1 z T z t z T ⇒ A new = t t t t This is still analagous to factor analysis and linear regression, with an extra expectation.

  66. Learning (online gradient) Time series data must often be processed in real-time, and we may want to update parameters online as observations arrive. We can do so by updating a local version of the likelihood based on the Kalman filter estimates. Consider the log likelihood contributed by each data point ( ℓ t ): � T � T ln p ( x t | x 1 , . . . , x t − 1 ) = ℓ = ℓ t t = 1 t = 1 Then, ℓ t = − D 2 ln 2 π − 1 2 ln | Σ | − 1 z t − 1 ) T Σ − 1 ( x t − C ˆ z t − 1 2 ( x t − C ˆ ) t t where D is dimension of x , and: z t − 1 z t − 1 ˆ = A ˆ t t − 1 C T + R Σ = C ˆ V t − 1 t t − 1 A T + Q V t − 1 ˆ = A ˆ V t − 1 t We differentiate ℓ t to obtain gradient rules for A , C , Q , R . The size of the gradient step (learning rate) reflects our expectation about nonstationarity.

  67. Learning HMMs using EM T T T T • • • s 1 s 2 s 3 s T A A A A x 1 x 2 x 3 x T Parameters: θ = { π , Φ , A } Free energy: � F ( q , θ ) = q ( s 1 : T )( log P ( x 1 : T , s 1 : T | θ ) − log q ( s 1 : T )) s 1 : T q ∗ ( s 1 : T ) = P ( s 1 : T | x 1 : T , θ ) E-step: Maximise F w.r.t. q with θ fixed: We will only need the marginal probabilities q ( s t , s t + 1 ) , which can also be obtained from the forward–backward algorithm. M-step: Maximize F w.r.t. θ with q fixed. We can re-estimate the parameters by computing the expected number of times the HMM was in state i , emitted symbol k and transitioned to state j . This is the Baum-Welch algorithm and it predates the (more general) EM algorithm.

  68. M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ .

  69. M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ . ◮ The initial state distribution is the expected number of times in state i at t = 1: ˆ π i = γ 1 ( i )

  70. M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ . ◮ The initial state distribution is the expected number of times in state i at t = 1: ˆ π i = γ 1 ( i ) ◮ The expected number of transitions from state i to j which begin at time t is: ξ t ( i → j ) ≡ P ( s t = i , s t + 1 = j | x 1 : T ) = α t ( i )Φ ij A j ( x t + 1 ) β t + 1 ( j ) / P ( x 1 : T ) so the estimated transition probabilities are: � T − 1 T − 1 � � � Φ ij = ξ t ( i → j ) γ t ( i ) t = 1 t = 1

  71. M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ . ◮ The initial state distribution is the expected number of times in state i at t = 1: ˆ π i = γ 1 ( i ) ◮ The expected number of transitions from state i to j which begin at time t is: ξ t ( i → j ) ≡ P ( s t = i , s t + 1 = j | x 1 : T ) = α t ( i )Φ ij A j ( x t + 1 ) β t + 1 ( j ) / P ( x 1 : T ) so the estimated transition probabilities are: � T − 1 T − 1 � � � Φ ij = ξ t ( i → j ) γ t ( i ) t = 1 t = 1 ◮ The output distributions are the expected number of times we observe a particular symbol in a particular state: � � � T � A ik = γ t ( i ) γ t ( i ) t : x t = k t = 1 (or the state-probability-weighted mean and variance for a Gaussian output model).

  72. HMM practicalities ◮ Numerical scaling: the conventional message definition is in terms of a large joint: α t ( i ) = P ( x 1 : t , s t = i ) → 0 as t grows, and so can easily underflow. Rescale: K � � α t ( i ) = A i ( x t ) α t − 1 ( j )Φ ji ˜ ρ t = α t ( i ) α t ( i ) = α t ( i ) /ρ t ˜ j i = 1 Exercise: show that: � T ρ t = P ( x t | x 1 : t − 1 , θ ) ρ t = P ( x 1 : T | θ ) t = 1 What does this make ˜ α t ( i ) ?

  73. HMM practicalities ◮ Numerical scaling: the conventional message definition is in terms of a large joint: α t ( i ) = P ( x 1 : t , s t = i ) → 0 as t grows, and so can easily underflow. Rescale: K � � α t ( i ) = A i ( x t ) α t − 1 ( j )Φ ji ˜ ρ t = α t ( i ) α t ( i ) = α t ( i ) /ρ t ˜ j i = 1 Exercise: show that: � T ρ t = P ( x t | x 1 : t − 1 , θ ) ρ t = P ( x 1 : T | θ ) t = 1 What does this make ˜ α t ( i ) ? ◮ Multiple observed sequences: average numerators and denominators in the ratios of updates.

  74. HMM practicalities ◮ Numerical scaling: the conventional message definition is in terms of a large joint: α t ( i ) = P ( x 1 : t , s t = i ) → 0 as t grows, and so can easily underflow. Rescale: K � � α t ( i ) = A i ( x t ) α t − 1 ( j )Φ ji ˜ ρ t = α t ( i ) α t ( i ) = α t ( i ) /ρ t ˜ j i = 1 Exercise: show that: � T ρ t = P ( x t | x 1 : t − 1 , θ ) ρ t = P ( x 1 : T | θ ) t = 1 What does this make ˜ α t ( i ) ? ◮ Multiple observed sequences: average numerators and denominators in the ratios of updates. ◮ Local optima (random restarts, annealing; see discussion later).

  75. HMM pseudocode: inference (E step) Forward-backward including scaling tricks. [ ◦ is the element-by-element (Hadamard/Schur) product: ‘ . ∗ ’ in matlab.] for t = 1 : T , i = 1 : K p t ( i ) = A i ( x t ) ρ 1 = � K α 1 = π ◦ p 1 i = 1 α 1 ( i ) α 1 = α 1 /ρ 1 ρ t = � K α t = (Φ T ∗ α t − 1 ) ◦ p t for t = 2 : T i = 1 α t ( i ) α t = α t /ρ t β T = 1 for t = T − 1 : 1 β t = Φ ∗ ( β t + 1 ◦ p t + 1 ) /ρ t + 1 log P ( x 1 : T ) = � T t = 1 log ( ρ t ) for t = 1 : T γ t = α t ◦ β t ξ t = Φ ◦ ( α t ∗ ( β t + 1 ◦ p t + 1 ) T ) /ρ t + 1 for t = 1 : T − 1

  76. HMM pseudocode: parameter re-estimation (M step) Baum-Welch parameter updates: For each sequence l = 1 : L , run forward–backward to get γ ( l ) and ξ ( l ) , then � L π i = 1 l = 1 γ ( l ) 1 ( i ) L � L � T ( l ) − 1 ξ ( l ) t ( ij ) l = 1 t = 1 Φ ij = � L � T ( l ) − 1 γ ( l ) t ( i ) l = 1 t = 1 � L � T ( l ) t = 1 δ ( x t = k ) γ ( l ) t ( i ) l = 1 A ik = � L � T ( l ) t = 1 γ ( l ) t ( i ) l = 1

  77. Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of z : P ( z ) = N ( 0 , I ) P ( x | z ) = N (Λ z , Ψ)

  78. Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of z : ˜ P ( z ) = N ( 0 , I ) z = U z & ˜ P ( x | z ) = N (Λ z , Ψ) Λ = Λ U T

  79. Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of z : � U 0 , UIU T � P (˜ z ) = N = N ( 0 , I ) ˜ P ( z ) = N ( 0 , I ) z = U z & ⇒ � � � � ˜ P ( x | z ) = N (Λ z , Ψ) Λ = Λ U T ˜ Λ U T U z , Ψ P ( x | ˜ z ) = N = N Λ˜ z , Ψ

  80. Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of z : � U 0 , UIU T � P (˜ z ) = N = N ( 0 , I ) ˜ P ( z ) = N ( 0 , I ) z = U z & ⇒ � � � � ˜ P ( x | z ) = N (Λ z , Ψ) Λ = Λ U T ˜ Λ U T U z , Ψ P ( x | ˜ z ) = N = N Λ˜ z , Ψ Similarly, a mixture model is invariant to permutations of the latent.

  81. Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of z : � U 0 , UIU T � P (˜ z ) = N = N ( 0 , I ) ˜ P ( z ) = N ( 0 , I ) z = U z & ⇒ � � � � ˜ P ( x | z ) = N (Λ z , Ψ) Λ = Λ U T ˜ Λ U T U z , Ψ P ( x | ˜ z ) = N = N Λ˜ z , Ψ Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P ( z t + 1 | z t ) = N ( A z t , Q ) P ( x t | z t ) = N ( C z , R )

Recommend


More recommend