COMS 4721: Machine Learning for Data Science Lecture 22, 4/18/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University
M ARKOV MODELS s 1 s 2 s 3 s 4 The sequence ( s 1 , s 2 , s 3 , . . . ) has the Markov property , if for all t p ( s t | s t − 1 , . . . , s 1 ) = p ( s t | s t − 1 ) . Our first encounter with Markov models assumed a finite state space, meaning we can define an indexing such that s ∈ { 1 , . . . , S } . This allowed us to represent the transition probabilities in a matrix, ⇔ p ( s t = j | s t − 1 = i ) . A ij
H IDDEN M ARKOV MODELS s n−1 s n+1 s 1 s 2 s n x 1 x 2 x n−1 x n x n+1 The hidden Markov model modified this by assuming the sequence of states was a latent process (i.e., unobserved). An observation x t is associated with each s t , where x t | s t ∼ p ( x | θ s t ) . Like a mixture model, this allowed for a few distributions to generate the data. It adds an extra transition rule between distributions.
D ISCRETE STATE SPACES A 22 A 21 In both cases, the state space was discrete and A 12 k = 2 relatively small in number. A 32 A 23 k = 1 A 11 ◮ For the Markov chain, we gave an example k = 3 A 31 where states correspond to positions in R d . A 13 ◮ A continuous hidden Markov model might A 33 perturb the latent state of the Markov chain. 1 k = 2 ◮ For example, each s i can be modified by continuous-valued noise, x i = s i + ǫ i . k = 1 0.5 ◮ But s 1 : T is still a discrete Markov chain. k = 3 0 0 0.5 1
D ISCRETE VS CONTINUOUS STATE SPACES Markov and hidden Markov models both assume a discrete state space. For Markov models: ◮ The state could be a data point x i (Markov Chain classifier) ◮ The state could be an object (object ranking) ◮ The state could be the destination of a link (internet search engines) For hidden Markov models we can simplify complex data: ◮ Sequences of discrete data may come from a few discrete distributions. ◮ Sequences of continuous data may come from a few distributions. What if we model the states as continuous too?
C ONTINUOUS - STATE M ARKOV MODEL Continuous Markov models extend the state space to a continuous domain. Instead of s ∈ { 1 , . . . , S } , s can take any value in R d . Again compare: ◮ Discrete-state Markov models: The states live in a discrete space. ◮ Continuous-state Markov models: The states live in a continuous space. The simplest example is the process s t = s t − 1 + ǫ t , ǫ t ∼ N ( 0 , aI ) . Each successive state is a perturbed version of the current state.
L INEAR G AUSSIAN M ARKOV MODEL The most basic continuous-state version of the hidden Markov model is called a linear Gaussian Markov model (also called the Kalman filter ). s t = Cs t − 1 + ǫ t − 1 x t = Ds t + ε t , � �� � � �� � observed process latent process ◮ s t ∈ R p is a continuous-state latent (unobserved) Markov process ◮ x t ∈ R d is a continuous-valued observation ◮ The process noise ǫ t ∼ N ( 0 , Q ) ◮ The measurement noise ε t ∼ N ( 0 , V )
E XAMPLE APPLICATIONS s 1 s 2 s n−1 s n s n+1 x n−1 x n+1 x 1 x 2 x n Difference from HMM: s t and x t are both from continuous distributions. The linear Gaussian Markov model (and its variants) has many applications. ◮ Tracking moving objects ◮ Automatic control systems ◮ Economics and finance (e.g., stock modeling) ◮ etc.
E XAMPLE : T RACKING We get (very) noisy measurements of an object’s position in time, x t ∈ R 2 . The time-varying state vector is s = [ pos 1 vel 1 accel 1 pos 2 vel 2 accel 2 ] T . Motivated by the underlying physics, we model this as: 2 (∆ t ) 2 1 ∆ t 1 0 0 0 ∆ t 0 1 0 0 0 e − α ∆ t 0 0 0 0 0 s t + 1 = s t + ǫ t 1 2 (∆ t ) 2 ∆ t 0 0 0 1 ∆ t 0 0 0 0 1 e − α ∆ t 0 0 0 0 0 � �� � ≡ C � 1 � 0 0 0 0 0 x t + 1 = s t + 1 + ε t + 1 0 0 0 1 0 0 � �� � ≡ D Therefore, s t not only approximates where the target is, but where it’s going.
E XAMPLE : T RACKING
T HE LEARNING PROBLEM As with the hidden Markov model, we’re given the sequence ( x 1 , x 2 , x 3 , . . . ) , where each x ∈ R d . The goal is to learn state sequence ( s 1 , s 2 , s 3 , . . . ) . All distributions are Gaussian, p ( s t + 1 = s | s t ) = N ( Cs t , Q ) , p ( x t = x | s t ) = N ( Ds t , V ) . Notice that with the discrete HMM we wanted to learn π , A and B , where ◮ π is the initial state distribution ◮ A is the transition matrix among the discrete set of states ◮ B contains the state-dependent distributions on discrete-valued data The situation here is very different.
T HE LEARNING PROBLEM No “B” to learn: In the linear Gaussian Markov model, each state is unique and so the distribution on x t is different for each t . No “A” to learn: In addition, each state transition is to a brand new state, so each s t has its own unique probability distribution. What we can learn are the two posterior distributions. 1. p ( s t | x 1 , . . . , x t ) : A distribution on the current state given the past. 2. p ( s t | x 1 , . . . , x T ) : A distribution on each latent state in the sequence ◮ # 1: Kalman filtering problem. We’ll focus on this one today. ◮ # 2: Kalman smoothing problem. Requires extra step (not discussed).
T HE K ALMAN FILTER Goal : Learn the sequence of distributions p ( s t | x 1 , . . . , x t ) given a sequence of data ( x 1 , x 2 , x 3 , . . . ) and the model s t + 1 | s t ∼ N ( Cs t , Q ) , x t | s t ∼ N ( Ds t , V ) . This is the (linear) Kalman filtering problem and is often used for tracking. Setup : We can use Bayes rule to write p ( s t | x 1 , . . . , x t ) ∝ p ( x t | s t ) p ( s t | x 1 , . . . x t − 1 ) and represent the prior as a marginal distribution � p ( s t | x 1 , . . . , x t − 1 ) = p ( s t | s t − 1 ) p ( s t − 1 | x 1 , . . . , x t − 1 ) ds t − 1
T HE K ALMAN FILTER We’ve decomposed the problem into parts that we do and don’t know (yet) � p ( s t | x 1 , . . . , x t ) ∝ p ( x t | s t ) p ( s t | s t − 1 ) p ( s t − 1 | x 1 , . . . , x t − 1 ) ds t − 1 � �� � � �� � � �� � ? N ( Ds t , V ) N ( Cs t − 1 , Q ) Observations and considerations: 1. The left is the posterior on s t and the right has the posterior on s t − 1 . 2. We want the integral to be in closed form and a known distribution. 3. We want the prior and likelihood terms to lead to a known posterior. 4. We want future calculations, e.g. for s t + 1 , to be easy. We will see how choosing the Gaussian distribution makes this all work.
T HE K ALMAN FILTER : S TEP 1 Calculate the marginal for prior distribution Hypothesize (temporarily) that the unknown distribution is Gaussian, � p ( s t | x 1 , . . . , x t ) ∝ p ( x t | s t ) p ( s t | s t − 1 ) p ( s t − 1 | x 1 , . . . , x t − 1 ) ds t − 1 � �� � � �� � � �� � N ( Ds t , V ) N ( Cs t − 1 , Q ) N ( µ, Σ) by hypothesis A property of the Gaussian is that marginals are still Gaussian, � N ( s t | Cs t − 1 , Q ) N ( s t − 1 | µ, Σ) ds t − 1 = N ( s t | C µ, Q + C Σ C T ) . We know C and Q (by design) and µ and Σ (by hypothesis).
T HE K ALMAN FILTER : S TEP 2 Calculate the posterior We plug in the marginal distribution for the prior and see that p ( s t | x 1 , . . . , x t ) ∝ N ( x t | Ds t , V ) N ( s t | C µ, Q + C Σ C T ) . Though the parameters look complicated, the posterior is just a Gaussian p ( s t | x 1 , . . . , x t ) = N ( s t | µ ′ , Σ ′ ) � − 1 � ( Q + C Σ C T ) − 1 + D T V − 1 D Σ ′ = Σ ′ � � µ ′ D T V − 1 x t + ( Q + C Σ C T ) − 1 C µ = We can plug the relevant values into these two equations.
A DDRESSING THE G AUSSIAN ASSUMPTION By making the assumption of a Gaussian in the prior, � p ( s t | x 1 , . . . , x t ) ∝ p ( x t | s t ) p ( s t | s t − 1 ) p ( s t − 1 | x 1 , . . . , x t − 1 ) ds t − 1 � �� � � �� � � �� � N ( x t | Ds t , V ) N ( s t | Cs t − 1 , Q ) N ( µ, Σ) by hypothesis we found that the posterior is also Gaussian with a new mean and covariance. ◮ We therefore only need to define a Gaussian prior on the first state to keep things moving forward. For example, p ( s 0 ) ∼ N ( 0 , I ) . Once this is done, all future calculations are in closed form.
K ALMAN FILTER : ONE FINAL QUANTITY Making predictions We know how to update the sequence of state posterior distributions p ( s t | x 1 , . . . , x t ) . What about predicting x t + 1 ? � p ( x t + 1 | x 1 , . . . , x t ) = p ( x t + 1 | s t + 1 ) p ( s t + 1 | x 1 , . . . , x t ) ds t + 1 � � = p ( x t + 1 | s t + 1 ) p ( s t + 1 | s t ) p ( s t | x 1 , . . . , x t ) ds t ds t + 1 � �� � � �� � � �� � N ( x t + 1 | Ds t + 1 , V ) N ( s t + 1 | Cs t , Q ) N ( s t | µ ′ , Σ ′ ) Again, Gaussians are nice because these operations stay Gaussian. This is a multivariate Gaussian that looks even more complicated than the previous one (omitted). Simply perform the previous integral twice.
A LGORITHM : K ALMAN FILTERING The Kalman filtering algorithm can be run in real time. 0. Set the initial state distribution p ( s 0 ) = N ( 0 , I ) 1. Prior to observing each new x t ∈ R d predict x t ∼ N ( µ x t , Σ x t ) (using previously discussed marginalization) 2. After observing each new x t ∈ R d update p ( s t | x 1 , . . . , x t ) = N ( µ s t , Σ s t ) (using equations on previous slide)
Recommend
More recommend