HMMs and SSMs (Linear Gaussian) State space models are the continuous state analogue of hidden Markov models. s 1 s 2 s 3 • • • s T y 1 y 2 y 3 • • • y T ⇔ x 1 x 2 x 3 x T x 1 x 2 x 3 x T ◮ A continuous vector state is a very powerful representation. For an HMM to communicate N bits of information about the past, it needs 2 N states! But a real-valued state vector can store an arbitrary number of bits in principle. s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T ◮ Linear-Gaussian output/dynamics are very weak. The types of dynamics linear SSMs can capture is very limited. HMMs can in principle represent arbitrary stochastic dynamics and output mappings.
Many Extensions 1 ◮ Constrained HMMs 64 1 64 ◮ Continuous state models with discrete outputs for time series and static data ◮ Hierarchical models ◮ Hybrid systems ⇔ Mixed continuous & discrete states, switching state-space models
Richer state representations A t� A t+1 A t+2 B t� B t+1 B t+2 ... C t� C t+1 C t+2 D t� D t+1 D t+2 Factorial HMMs Dynamic Bayesian Networks ◮ These are hidden Markov models with many state variables (i.e. a distributed representation of the state). ◮ The state can capture many more bits of information about the sequence (linear in the number of state variables).
Chain models: ML Learning with EM y 1 y 2 y 3 • • • y T s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T x 1 x 2 x 3 x T y 1 ∼ N ( µ 0 , Q 0 ) s 1 ∼ π y t | y t − 1 ∼ N ( A y t − 1 , Q ) s t | s t − 1 ∼ Φ s t − 1 , · x t | y t ∼ N ( C y t , R ) x t | s t ∼ A s t The structure of learning and inference for both models is dictated by the factored structure.
Chain models: ML Learning with EM y 1 y 2 y 3 • • • y T s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T x 1 x 2 x 3 x T y 1 ∼ N ( µ 0 , Q 0 ) s 1 ∼ π y t | y t − 1 ∼ N ( A y t − 1 , Q ) s t | s t − 1 ∼ Φ s t − 1 , · x t | y t ∼ N ( C y t , R ) x t | s t ∼ A s t The structure of learning and inference for both models is dictated by the factored structure. T T � � P ( x 1 , . . . , x T , y 1 , . . . , y T ) = P ( y 1 ) P ( y t | y t − 1 ) P ( x t | y t ) t = 2 t = 1
Chain models: ML Learning with EM y 1 y 2 y 3 • • • y T s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T x 1 x 2 x 3 x T y 1 ∼ N ( µ 0 , Q 0 ) s 1 ∼ π y t | y t − 1 ∼ N ( A y t − 1 , Q ) s t | s t − 1 ∼ Φ s t − 1 , · x t | y t ∼ N ( C y t , R ) x t | s t ∼ A s t The structure of learning and inference for both models is dictated by the factored structure. T T � � P ( x 1 , . . . , x T , y 1 , . . . , y T ) = P ( y 1 ) P ( y t | y t − 1 ) P ( x t | y t ) t = 2 t = 1 Learning (M-step) : argmax � log P ( x 1 , . . . , x T , y 1 , . . . , y T ) � q ( y 1 ,..., y T ) = � � � T � T � log P ( y 1 ) � q ( y 1 ) + � log P ( y t | y t − 1 ) � q ( y t , y t − 1 ) + � log P ( x t | y t ) � q ( y t ) argmax t = 2 t = 1
Chain models: ML Learning with EM y 1 y 2 y 3 • • • y T s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T x 1 x 2 x 3 x T y 1 ∼ N ( µ 0 , Q 0 ) s 1 ∼ π y t | y t − 1 ∼ N ( A y t − 1 , Q ) s t | s t − 1 ∼ Φ s t − 1 , · x t | y t ∼ N ( C y t , R ) x t | s t ∼ A s t The structure of learning and inference for both models is dictated by the factored structure. T T � � P ( x 1 , . . . , x T , y 1 , . . . , y T ) = P ( y 1 ) P ( y t | y t − 1 ) P ( x t | y t ) t = 2 t = 1 Learning (M-step) : argmax � log P ( x 1 , . . . , x T , y 1 , . . . , y T ) � q ( y 1 ,..., y T ) = � � � T � T � log P ( y 1 ) � q ( y 1 ) + � log P ( y t | y t − 1 ) � q ( y t , y t − 1 ) + � log P ( x t | y t ) � q ( y t ) argmax t = 2 t = 1 So the expectations needed (in E-step) are derived from singleton and pairwise marginals.
Chain models: Inference Three general inference problems : P ( y t | x 1 , . . . , x t ) Filtering: P ( y t | x 1 , . . . , x T ) (also P ( y t , y t − 1 | x 1 , . . . , x T ) for learning) Smoothing: P ( y t | x 1 , . . . , x t − ∆ t ) Prediction:
Chain models: Inference Three general inference problems : P ( y t | x 1 , . . . , x t ) Filtering: P ( y t | x 1 , . . . , x T ) (also P ( y t , y t − 1 | x 1 , . . . , x T ) for learning) Smoothing: P ( y t | x 1 , . . . , x t − ∆ t ) Prediction: Naively, these marginal posteriors seem to require very large integrals (or sums) � � P ( y t | x 1 , . . . , x t ) = · · · d y 1 . . . d y t − 1 P ( y 1 , . . . , y t | x 1 , . . . , x t )
Chain models: Inference Three general inference problems : P ( y t | x 1 , . . . , x t ) Filtering: P ( y t | x 1 , . . . , x T ) (also P ( y t , y t − 1 | x 1 , . . . , x T ) for learning) Smoothing: P ( y t | x 1 , . . . , x t − ∆ t ) Prediction: Naively, these marginal posteriors seem to require very large integrals (or sums) � � P ( y t | x 1 , . . . , x t ) = · · · d y 1 . . . d y t − 1 P ( y 1 , . . . , y t | x 1 , . . . , x t ) but again the factored structure of the distributions will help us. The algorithms rely on a form of temporal updating or message passing.
Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x k ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) = π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1
Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x k ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) = π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the K states at t = 1 holding value 1
Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x k ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) = π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the K states at t = 1 holding value 1 ◮ move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. × output emission prob.
Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x k ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) = π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the K states at t = 1 holding value 1 ◮ move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. × output emission prob. ◮ repeat
Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x k ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) = π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the K states at t = 1 holding value 1 ◮ move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. × output emission prob. ◮ repeat until all bugs have reached time t ◮ sum up values on all K t − 1 bugs that reach state s t = k (one bug per state path)
Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x k ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) = π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the K states at t = 1 holding value 1 ◮ move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. × output emission prob. ◮ repeat until all bugs have reached time t ◮ sum up values on all K t − 1 bugs that reach state s t = k (one bug per state path) Clever recursion: ◮ at every step, replace bugs at each node with a single bug carrying sum of values
Probability updating: “Bayesian filtering” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T � P ( y t | x 1 : t ) = P ( y t , y t − 1 | x 1 : t ) d y t − 1
Probability updating: “Bayesian filtering” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T � P ( y t | x 1 : t ) = P ( y t , y t − 1 | x t , x 1 : t − 1 ) d y t − 1
Probability updating: “Bayesian filtering” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T � P ( y t | x 1 : t ) = P ( y t , y t − 1 | x t , x 1 : t − 1 ) d y t − 1 � P ( x t , y t , y t − 1 | x 1 : t − 1 ) = d y t − 1 P ( x t | x 1 : t − 1 )
Probability updating: “Bayesian filtering” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T � P ( y t | x 1 : t ) = P ( y t , y t − 1 | x t , x 1 : t − 1 ) d y t − 1 � P ( x t , y t , y t − 1 | x 1 : t − 1 ) = d y t − 1 P ( x t | x 1 : t − 1 ) � ∝ P ( x t | y t , y t − 1 , x 1 : t − 1 ) P ( y t | y t − 1 , x 1 : t − 1 ) P ( y t − 1 | x 1 : t − 1 ) d y t − 1
Probability updating: “Bayesian filtering” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T � P ( y t | x 1 : t ) = P ( y t , y t − 1 | x t , x 1 : t − 1 ) d y t − 1 � P ( x t , y t , y t − 1 | x 1 : t − 1 ) = d y t − 1 P ( x t | x 1 : t − 1 ) � ∝ P ( x t | y t , y t − 1 , x 1 : t − 1 ) P ( y t | y t − 1 , x 1 : t − 1 ) P ( y t − 1 | x 1 : t − 1 ) d y t − 1 � = P ( x t | y t ) P ( y t | y t − 1 ) P ( y t − 1 | x 1 : t − 1 ) d y t − 1 Markov property
Probability updating: “Bayesian filtering” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T � P ( y t | x 1 : t ) = P ( y t , y t − 1 | x t , x 1 : t − 1 ) d y t − 1 � P ( x t , y t , y t − 1 | x 1 : t − 1 ) = d y t − 1 P ( x t | x 1 : t − 1 ) � ∝ P ( x t | y t , y t − 1 , x 1 : t − 1 ) P ( y t | y t − 1 , x 1 : t − 1 ) P ( y t − 1 | x 1 : t − 1 ) d y t − 1 � = P ( x t | y t ) P ( y t | y t − 1 ) P ( y t − 1 | x 1 : t − 1 ) d y t − 1 Markov property This is a forward recursion based on Bayes rule.
The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming .
The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ )
The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ ) Then much like the Bayesian filtering updates, we have: � � K � α 1 ( i ) = π i A i ( x 1 ) α t + 1 ( i ) = α t ( j )Φ ji A i ( x t + 1 ) j = 1
The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ ) Then much like the Bayesian filtering updates, we have: � � K � α 1 ( i ) = π i A i ( x 1 ) α t + 1 ( i ) = α t ( j )Φ ji A i ( x t + 1 ) j = 1 We’ve defined α t ( i ) to be a joint rather than a posterior.
The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ ) Then much like the Bayesian filtering updates, we have: � � K � α 1 ( i ) = π i A i ( x 1 ) α t + 1 ( i ) = α t ( j )Φ ji A i ( x t + 1 ) j = 1 We’ve defined α t ( i ) to be a joint rather than a posterior. It’s easy to obtain the posterior by normalisation: α t ( i ) � P ( s t = i | x 1 , . . . , x t , θ ) = k α t ( k )
The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ ) Then much like the Bayesian filtering updates, we have: � � K � α 1 ( i ) = π i A i ( x 1 ) α t + 1 ( i ) = α t ( j )Φ ji A i ( x t + 1 ) j = 1 We’ve defined α t ( i ) to be a joint rather than a posterior. It’s easy to obtain the posterior by normalisation: α t ( i ) � P ( s t = i | x 1 , . . . , x t , θ ) = k α t ( k ) This form enables us to compute the likelihood for θ = { A , Φ , π } efficiently in O ( TK 2 ) time: � � K P ( x 1 . . . x T | θ ) = P ( x 1 , . . . , x T , s 1 , . . . , s T , θ ) = α T ( k ) s 1 ,..., s T k = 1 ıve sum (number of paths = K T ). avoiding the exponential number of paths in the na¨
The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T For the SSM, the sums become integrals.
The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ y 0 V 0 For the SSM, the sums become integrals. Let ˆ 1 = Q 0 ; then (cf. FA) � � i ) , ˆ 1 − K 1 C ˆ y 0 y 0 V 0 V 0 P ( y 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ ˆ 1
The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ y 0 V 0 For the SSM, the sums become integrals. Let ˆ 1 = Q 0 ; then (cf. FA) � � 1 C T + R ) − 1 i ) , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ y 0 y 0 V 0 V 0 V 0 V 0 P ( y 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ ˆ 1
The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ y 0 V 0 For the SSM, the sums become integrals. Let ˆ 1 = Q 0 ; then (cf. FA) � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ y 0 y 0 V 0 V 0 V 0 V 0 P ( y 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ y 1 V 1 ˆ 1 1
The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ y 0 V 0 For the SSM, the sums become integrals. Let ˆ 1 = Q 0 ; then (cf. FA) � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ y 0 y 0 V 0 V 0 V 0 V 0 P ( y 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ y 1 V 1 ˆ 1 1 t ≡ E [ y t | x 1 , . . . , x T ] and ˆ y T V T In general, we define ˆ t ≡ V [ y t | x 1 , . . . , x T ] .
The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ y 0 V 0 For the SSM, the sums become integrals. Let ˆ 1 = Q 0 ; then (cf. FA) � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ y 0 y 0 V 0 V 0 V 0 V 0 P ( y 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ y 1 V 1 ˆ 1 1 t ≡ E [ y t | x 1 , . . . , x T ] and ˆ y T V T In general, we define ˆ t ≡ V [ y t | x 1 , . . . , x T ] . Then, � P ( y t | x 1 : t − 1 ) = d y t − 1 P ( y t | y t − 1 ) P ( y t − 1 | x 1 : t − 1 )
The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ y 0 V 0 For the SSM, the sums become integrals. Let ˆ 1 = Q 0 ; then (cf. FA) � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ y 0 y 0 V 0 V 0 V 0 V 0 P ( y 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ y 1 V 1 ˆ 1 1 t ≡ E [ y t | x 1 , . . . , x T ] and ˆ y T V T In general, we define ˆ t ≡ V [ y t | x 1 , . . . , x T ] . Then, � t − 1 A T + Q y t − 1 , A ˆ V t − 1 P ( y t | x 1 : t − 1 ) = d y t − 1 P ( y t | y t − 1 ) P ( y t − 1 | x 1 : t − 1 ) = N ( A ˆ ) t − 1 � �� � � �� � y t − 1 V t − 1 ˆ ˆ t t
The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ y 0 V 0 For the SSM, the sums become integrals. Let ˆ 1 = Q 0 ; then (cf. FA) � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ y 0 y 0 V 0 V 0 V 0 V 0 P ( y 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ y 1 V 1 ˆ 1 1 t ≡ E [ y t | x 1 , . . . , x T ] and ˆ y T V T In general, we define ˆ t ≡ V [ y t | x 1 , . . . , x T ] . Then, � t − 1 A T + Q y t − 1 , A ˆ V t − 1 P ( y t | x 1 : t − 1 ) = d y t − 1 P ( y t | y t − 1 ) P ( y t − 1 | x 1 : t − 1 ) = N ( A ˆ ) t − 1 � �� � � �� � y t − 1 V t − 1 ˆ ˆ t t y t − 1 y t − 1 V t − 1 V t − 1 , ˆ − K t C ˆ P ( y t | x 1 : t ) = N ( ˆ + K t ( x t − C ˆ ) ) t t t t � �� � � �� � y t ˆ V t ˆ C T + R ) − 1 t t K t = ˆ V t − 1 C T ( C ˆ V t − 1 t t � �� � Kalman gain
The marginal posterior: “Bayesian smoothing” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T P ( y t | x 1 : T )
The marginal posterior: “Bayesian smoothing” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T P ( y t | x 1 : T ) = P ( y t , x t + 1 : T | x 1 : t ) P ( x t + 1 : T | x 1 : t )
The marginal posterior: “Bayesian smoothing” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T P ( y t | x 1 : T ) = P ( y t , x t + 1 : T | x 1 : t ) P ( x t + 1 : T | x 1 : t ) = P ( x t + 1 : T | y t ) P ( y t | x 1 : t ) P ( x t + 1 : T | x 1 : t ) The marginal combines a backward message with the forward message found by filtering.
The HMM: Forward–Backward Algorithm State estimation: compute marginal posterior distribution over state at time t : γ t ( i ) ≡ P ( s t = i | x 1 : T ) = P ( s t = i , x 1 : t ) P ( x t + 1 : T | s t = i ) α t ( i ) β t ( i ) � = P ( x 1 : T ) j α t ( j ) β t ( j ) where there is a simple backward recursion for � K β t ( i ) ≡ P ( x t + 1 : T | s t = i ) = P ( s t + 1 = j , x t + 1 , x t + 2 : T | s t = i ) j = 1 � K � K = P ( s t + 1 = j | s t = i ) P ( x t + 1 | s t + 1 = j ) P ( x t + 2 : T | s t + 1 = j ) = Φ ij A j ( x t + 1 ) β t + 1 ( j ) j = 1 j = 1 α t ( i ) gives total inflow of probabilities to node ( t , i ) ; β t ( i ) gives total outflow of probabiilties. 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Bugs again: the bugs run forward from time 0 to t and backward from time T to t .
Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time.
Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states.
Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero!
Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! ◮ To find the single best path, we use the Viterbi decoding algorithm which is just Bellman’s dynamic programming algorithm applied to this problem. This is an inference P ( s 1 : T | x 1 : T , θ ) algorithm which computes the most probable state sequences: argmax s 1 : T
Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! ◮ To find the single best path, we use the Viterbi decoding algorithm which is just Bellman’s dynamic programming algorithm applied to this problem. This is an inference P ( s 1 : T | x 1 : T , θ ) algorithm which computes the most probable state sequences: argmax s 1 : T ◮ The recursions look the same as forward-backward, except with max instead of � .
Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! ◮ To find the single best path, we use the Viterbi decoding algorithm which is just Bellman’s dynamic programming algorithm applied to this problem. This is an inference P ( s 1 : T | x 1 : T , θ ) algorithm which computes the most probable state sequences: argmax s 1 : T ◮ The recursions look the same as forward-backward, except with max instead of � . ◮ Bugs once more: same trick except at each step kill all bugs but the one with the highest value at the node.
Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! ◮ To find the single best path, we use the Viterbi decoding algorithm which is just Bellman’s dynamic programming algorithm applied to this problem. This is an inference P ( s 1 : T | x 1 : T , θ ) algorithm which computes the most probable state sequences: argmax s 1 : T ◮ The recursions look the same as forward-backward, except with max instead of � . ◮ Bugs once more: same trick except at each step kill all bugs but the one with the highest value at the node. ◮ There is also a modified EM training based on the Viterbi decoder (assignment).
The LGSSM: Kalman smoothing y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T We use a slightly different decomposition: � P ( y t | x 1 : T ) = P ( y t , y t + 1 | x 1 : T ) d y t + 1 � = P ( y t | y t + 1 , x 1 : T ) P ( y t + 1 | x 1 : T ) d y t + 1 � = P ( y t | y t + 1 , x 1 : t ) P ( y t + 1 | x 1 : T ) d y t + 1 Markov property This gives the additional backward recursion : J t = ˆ t A T (ˆ t + 1 ) − 1 V t V t y T y t y T y t t + 1 − A ˆ ˆ t = ˆ t + J t ( ˆ t ) ˆ t = ˆ t + J t (ˆ t + 1 − ˆ V T V t V T V t T t + 1 ) J t
ML Learning for SSMs using batch EM A A A A y 1 y 2 y 3 • • • y T C C C C x 1 x 2 x 3 x T Parameters: θ = { µ 0 , Q 0 , A , Q , C , R } Free energy: � F ( q , θ ) = d y 1 : T q ( y 1 : T )( log P ( x 1 : T , y 1 : T | θ ) − log q ( y 1 : T )) q ∗ ( y ) = p ( y | x , θ ) E-step: Maximise F w.r.t. q with θ fixed: This can be achieved with a two-state extension of the Kalman smoother. M-step: Maximize F w.r.t. θ with q fixed. This boils down to solving a few weighted least squares problems, since all the variables in: � T p ( y , x | θ ) = p ( y 1 ) p ( x 1 | y 1 ) p ( y t | y t − 1 ) p ( x t | y t ) t = 2 form a multivariate Gaussian.
The M step for C � � 2 ( x t − C y t ) T R − 1 ( x t − C y t ) − 1 p ( x t | y t ) ∝ exp ⇒
The M step for C � � 2 ( x t − C y t ) T R − 1 ( x t − C y t ) − 1 p ( x t | y t ) ∝ exp ⇒ �� � ln p ( x t | y t ) C new = argmax C t q
The M step for C � � 2 ( x t − C y t ) T R − 1 ( x t − C y t ) − 1 p ( x t | y t ) ∝ exp ⇒ �� � ln p ( x t | y t ) C new = argmax C t q � � � − 1 ( x t − C y t ) T R − 1 ( x t − C y t ) = argmax + const 2 C t q
The M step for C � � 2 ( x t − C y t ) T R − 1 ( x t − C y t ) − 1 p ( x t | y t ) ∝ exp ⇒ �� � ln p ( x t | y t ) C new = argmax C t q � � � − 1 ( x t − C y t ) T R − 1 ( x t − C y t ) = argmax + const 2 C t q � � � − 1 x T t R − 1 x t − 2 x T t R − 1 C � y t � + � y T t C T R − 1 C y t � = argmax 2 C t
The M step for C � � 2 ( x t − C y t ) T R − 1 ( x t − C y t ) − 1 p ( x t | y t ) ∝ exp ⇒ �� � ln p ( x t | y t ) C new = argmax C t q � � � − 1 ( x t − C y t ) T R − 1 ( x t − C y t ) = argmax + const 2 C t q � � � − 1 x T t R − 1 x t − 2 x T t R − 1 C � y t � + � y T t C T R − 1 C y t � = argmax 2 C t � � � � �� ��� � − 1 � y t � x T t R − 1 C T R − 1 C y t y T = argmax Tr C 2Tr t C t t
The M step for C � � 2 ( x t − C y t ) T R − 1 ( x t − C y t ) − 1 p ( x t | y t ) ∝ exp ⇒ �� � ln p ( x t | y t ) C new = argmax C t q � � � − 1 ( x t − C y t ) T R − 1 ( x t − C y t ) = argmax + const 2 C t q � � � − 1 x T t R − 1 x t − 2 x T t R − 1 C � y t � + � y T t C T R − 1 C y t � = argmax 2 C t � � � � �� ��� � − 1 � y t � x T t R − 1 C T R − 1 C y t y T = argmax Tr C 2Tr t C t t �� � ∂ C = R − 1 � = B T , we have ∂ {·} x t � y t � T − R − 1 C using ∂ Tr [ AB ] y t y T t ∂ A t t
The M step for C � � 2 ( x t − C y t ) T R − 1 ( x t − C y t ) − 1 p ( x t | y t ) ∝ exp ⇒ �� � ln p ( x t | y t ) C new = argmax C t q � � � − 1 ( x t − C y t ) T R − 1 ( x t − C y t ) = argmax + const 2 C t q � � � − 1 x T t R − 1 x t − 2 x T t R − 1 C � y t � + � y T t C T R − 1 C y t � = argmax 2 C t � � � � �� ��� � − 1 � y t � x T t R − 1 C T R − 1 C y t y T = argmax Tr C 2Tr t C t t �� � ∂ C = R − 1 � = B T , we have ∂ {·} x t � y t � T − R − 1 C using ∂ Tr [ AB ] y t y T t ∂ A t t �� � �� �� − 1 � x t � y t � T y t y T ⇒ C new = t t t Notice that this is exactly the same equation as in factor analysis and linear regression!
The M step for A � � − 1 2 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) p ( y t + 1 | y t ) ∝ exp ⇒
The M step for A � � − 1 2 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) p ( y t + 1 | y t ) ∝ exp ⇒ �� � A new = argmax ln p ( y t + 1 | y t ) A t q
The M step for A � � − 1 2 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) p ( y t + 1 | y t ) ∝ exp ⇒ �� � A new = argmax ln p ( y t + 1 | y t ) A t q � � � − 1 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) = argmax + const 2 A t q
The M step for A � � − 1 2 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) p ( y t + 1 | y t ) ∝ exp ⇒ �� � A new = argmax ln p ( y t + 1 | y t ) A t q � � � − 1 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) = argmax + const 2 A t q � �� � � � � − 1 t + 1 Q − 1 y t + 1 − 2 t + 1 Q − 1 A y t t A T Q − 1 A y t y T y T y T = argmax + 2 A t
The M step for A � � − 1 2 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) p ( y t + 1 | y t ) ∝ exp ⇒ �� � A new = argmax ln p ( y t + 1 | y t ) A t q � � � − 1 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) = argmax + const 2 A t q � �� � � � � − 1 t + 1 Q − 1 y t + 1 − 2 t + 1 Q − 1 A y t t A T Q − 1 A y t y T y T y T = argmax + 2 A t � � � � ��� � � � � � − 1 Q − 1 A T Q − 1 A y t y T y t y T = argmax Tr A 2Tr t + 1 t A t t
The M step for A � � − 1 2 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) p ( y t + 1 | y t ) ∝ exp ⇒ �� � A new = argmax ln p ( y t + 1 | y t ) A t q � � � − 1 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) = argmax + const 2 A t q � �� � � � � − 1 t + 1 Q − 1 y t + 1 − 2 t + 1 Q − 1 A y t t A T Q − 1 A y t y T y T y T = argmax + 2 A t � � � � ��� � � � � � − 1 Q − 1 A T Q − 1 A y t y T y t y T = argmax Tr A 2Tr t + 1 t A t t � � � � ∂ A = Q − 1 � � = B T , we have ∂ {·} using ∂ Tr [ AB ] − Q − 1 A y t + 1 y T y t y T t t ∂ A t t
The M step for A � � − 1 2 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) p ( y t + 1 | y t ) ∝ exp ⇒ �� � A new = argmax ln p ( y t + 1 | y t ) A t q � � � − 1 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) = argmax + const 2 A t q � �� � � � � − 1 t + 1 Q − 1 y t + 1 − 2 t + 1 Q − 1 A y t t A T Q − 1 A y t y T y T y T = argmax + 2 A t � � � � ��� � � � � � − 1 Q − 1 A T Q − 1 A y t y T y t y T = argmax Tr A 2Tr t + 1 t A t t � � � � ∂ A = Q − 1 � � = B T , we have ∂ {·} using ∂ Tr [ AB ] − Q − 1 A y t + 1 y T y t y T t t ∂ A t t �� �� �� �� − 1 � � y t + 1 y T y t y T ⇒ A new = t t t t This is still analagous to factor analysis and linear regression, with expected correlations.
Learning (online gradient) Time series data must often be processed in real-time, and we may want to update parameters online as observations arrive. We can do so by updating a local version of the likelihood based on the Kalman filter estimates. Consider the log likelihood contributed by each data point ( ℓ t ): � T � T ln p ( x t | x 1 , . . . , x t − 1 ) = ℓ = ℓ t t = 1 t = 1 Then, ℓ t = − D 2 ln 2 π − 1 2 ln | Σ | − 1 y t − 1 ) T Σ − 1 ( x t − C ˆ y t − 1 2 ( x t − C ˆ ) t t where D is dimension of x , and: y t − 1 y t − 1 ˆ = A ˆ t t − 1 C T + R Σ = C ˆ V t − 1 t t − 1 A T + Q V t − 1 ˆ = A ˆ V t − 1 t We differentiate ℓ t to obtain gradient rules for A , C , Q , R . The size of the gradient step (learning rate) reflects our expectation about nonstationarity.
Learning HMMs using EM T T T T • • • s 1 s 2 s 3 s T A A A A x 1 x 2 x 3 x T Parameters: θ = { π , Φ , A } Free energy: � F ( q , θ ) = q ( s 1 : T )( log P ( x 1 : T , s 1 : T | θ ) − log q ( s 1 : T )) s 1 : T q ∗ ( s 1 : T ) = P ( s 1 : T | x 1 : T , θ ) E-step: Maximise F w.r.t. q with θ fixed: We will only need the marginal probabilities q ( s t , s t + 1 ) , which can also be obtained from the forward–backward algorithm. M-step: Maximize F w.r.t. θ with q fixed. We can re-estimate the parameters by computing the expected number of times the HMM was in state i , emitted symbol k and transitioned to state j . This is the Baum-Welch algorithm and it predates the (more general) EM algorithm.
M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ .
M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ . ◮ The initial state distribution is the expected number of times in state i at t = 1: ˆ π i = γ 1 ( i )
M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ . ◮ The initial state distribution is the expected number of times in state i at t = 1: ˆ π i = γ 1 ( i ) ◮ The expected number of transitions from state i to j which begin at time t is: ξ t ( i → j ) ≡ P ( s t = i , s t + 1 = j | x 1 : T ) = α t ( i )Φ ij A j ( x t + 1 ) β t + 1 ( j ) / P ( x 1 : T ) so the estimated transition probabilities are: � T − 1 T − 1 � � � Φ ij = ξ t ( i → j ) γ t ( i ) t = 1 t = 1
M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ . ◮ The initial state distribution is the expected number of times in state i at t = 1: ˆ π i = γ 1 ( i ) ◮ The expected number of transitions from state i to j which begin at time t is: ξ t ( i → j ) ≡ P ( s t = i , s t + 1 = j | x 1 : T ) = α t ( i )Φ ij A j ( x t + 1 ) β t + 1 ( j ) / P ( x 1 : T ) so the estimated transition probabilities are: � T − 1 T − 1 � � � Φ ij = ξ t ( i → j ) γ t ( i ) t = 1 t = 1 ◮ The output distributions are the expected number of times we observe a particular symbol in a particular state: � � � T � A ik = γ t ( i ) γ t ( i ) t : x t = k t = 1 (or the state-probability-weighted mean and variance for a Gaussian output model).
HMM practicalities ◮ Numerical scaling: the conventional message definition is in terms of a large joint: α t ( i ) = P ( x 1 : t , s t = i ) → 0 as t grows, and so can easily underflow. Rescale: K � � α t ( i ) = A i ( x t ) α t − 1 ( j )Φ ji ˜ ρ t = α t ( i ) α t ( i ) = α t ( i ) /ρ t ˜ j i = 1 Exercise: show that: � T ρ t = P ( x t | x 1 : t − 1 , θ ) ρ t = P ( x 1 : T | θ ) t = 1 What does this make ˜ α t ( i ) ?
HMM practicalities ◮ Numerical scaling: the conventional message definition is in terms of a large joint: α t ( i ) = P ( x 1 : t , s t = i ) → 0 as t grows, and so can easily underflow. Rescale: K � � α t ( i ) = A i ( x t ) α t − 1 ( j )Φ ji ˜ ρ t = α t ( i ) α t ( i ) = α t ( i ) /ρ t ˜ j i = 1 Exercise: show that: � T ρ t = P ( x t | x 1 : t − 1 , θ ) ρ t = P ( x 1 : T | θ ) t = 1 What does this make ˜ α t ( i ) ? ◮ Multiple observed sequences: average numerators and denominators in the ratios of updates.
HMM practicalities ◮ Numerical scaling: the conventional message definition is in terms of a large joint: α t ( i ) = P ( x 1 : t , s t = i ) → 0 as t grows, and so can easily underflow. Rescale: K � � α t ( i ) = A i ( x t ) α t − 1 ( j )Φ ji ˜ ρ t = α t ( i ) α t ( i ) = α t ( i ) /ρ t ˜ j i = 1 Exercise: show that: � T ρ t = P ( x t | x 1 : t − 1 , θ ) ρ t = P ( x 1 : T | θ ) t = 1 What does this make ˜ α t ( i ) ? ◮ Multiple observed sequences: average numerators and denominators in the ratios of updates. ◮ Local optima (random restarts, annealing; see discussion later).
HMM pseudocode: inference (E step) Forward-backward including scaling tricks. [ ◦ is the element-by-element (Hadamard/Schur) product: ‘ . ∗ ’ in matlab.] for t = 1 : T , i = 1 : K p t ( i ) = A i ( x t ) ρ 1 = � K α 1 = π ◦ p 1 i = 1 α 1 ( i ) α 1 = α 1 /ρ 1 ρ t = � K α t = (Φ T ∗ α t − 1 ) ◦ p t for t = 2 : T i = 1 α t ( i ) α t = α t /ρ t β T = 1 for t = T − 1 : 1 β t = Φ ∗ ( β t + 1 ◦ p t + 1 ) /ρ t + 1 log P ( x 1 : T ) = � T t = 1 log ( ρ t ) for t = 1 : T γ t = α t ◦ β t ξ t = Φ ◦ ( α t ∗ ( β t + 1 ◦ p t + 1 ) T ) /ρ t + 1 for t = 1 : T − 1
HMM pseudocode: parameter re-estimation (M step) Baum-Welch parameter updates: For each sequence l = 1 : L , run forward–backward to get γ ( l ) and ξ ( l ) , then � L π i = 1 l = 1 γ ( l ) 1 ( i ) L � L � T ( l ) − 1 ξ ( l ) t ( ij ) l = 1 t = 1 Φ ij = � L � T ( l ) − 1 γ ( l ) t ( i ) l = 1 t = 1 � L � T ( l ) t = 1 δ ( x t = k ) γ ( l ) t ( i ) l = 1 A ik = � L � T ( l ) t = 1 γ ( l ) t ( i ) l = 1
Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y : P ( y ) = N ( 0 , I ) P ( x | y ) = N (Λ y , Ψ)
Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y : ˜ P ( y ) = N ( 0 , I ) y = U y & ˜ P ( x | y ) = N (Λ y , Ψ) Λ = Λ U T
Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y : � U 0 , UIU T � P (˜ y ) = N = N ( 0 , I ) ˜ P ( y ) = N ( 0 , I ) y = U y & ⇒ � � � � ˜ P ( x | y ) = N (Λ y , Ψ) Λ = Λ U T ˜ Λ U T U y , Ψ P ( x | ˜ y ) = N = N Λ˜ y , Ψ
Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y : � U 0 , UIU T � P (˜ y ) = N = N ( 0 , I ) ˜ P ( y ) = N ( 0 , I ) y = U y & ⇒ � � � � ˜ P ( x | y ) = N (Λ y , Ψ) Λ = Λ U T ˜ Λ U T U y , Ψ P ( x | ˜ y ) = N = N Λ˜ y , Ψ Similarly, a mixture model is invariant to permutations of the latent.
Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y : � U 0 , UIU T � P (˜ y ) = N = N ( 0 , I ) ˜ P ( y ) = N ( 0 , I ) y = U y & ⇒ � � � � ˜ P ( x | y ) = N (Λ y , Ψ) Λ = Λ U T ˜ Λ U T U y , Ψ P ( x | ˜ y ) = N = N Λ˜ y , Ψ Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P ( y t + 1 | y t ) = N ( A y t , Q ) P ( x t | y t ) = N ( C y , R )
Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y : � U 0 , UIU T � P (˜ y ) = N = N ( 0 , I ) ˜ P ( y ) = N ( 0 , I ) y = U y & ⇒ � � � � ˜ P ( x | y ) = N (Λ y , Ψ) Λ = Λ U T ˜ Λ U T U y , Ψ P ( x | ˜ y ) = N = N Λ˜ y , Ψ Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: ˜ A = GAG − 1 P ( y t + 1 | y t ) = N ( A y t , Q ) ˜ y = G y & P ( x t | y t ) = N ( C y , R ) Q = GQG T ˜ C = CG − 1 ˜
Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y : � U 0 , UIU T � P (˜ y ) = N = N ( 0 , I ) ˜ P ( y ) = N ( 0 , I ) y = U y & ⇒ � � � � ˜ P ( x | y ) = N (Λ y , Ψ) Λ = Λ U T ˜ Λ U T U y , Ψ P ( x | ˜ y ) = N = N Λ˜ y , Ψ Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: A = GAG − 1 ˜ P ( y t + 1 | y t ) = N ( A y t , Q ) ˜ y = G y & P ( x t | y t ) = N ( C y , R ) Q = GQG T ˜ C = CG − 1 ˜ � GAG − 1 G y t , GQG T � � ˜ � y t , ˜ P (˜ y t + 1 | ˜ y t ) = N = N A ˜ Q ⇒ � � � ˜ � CG − 1 G y , R P ( x t | ˜ y t ) = N = N C ˜ y , R
Recommend
More recommend