Particle Filters: Convergence Results and High Dimensions Mark Coates mark.coates@mcgill.ca McGill University Department of Electrical and Computer Engineering Montreal, Quebec, Canada Bellairs 2012
Outline Introduction to Sequential Monte Carlo Methods 1 Convergence Results 2 High Dimensionality 3 2 / 28
References Crisan, D. and Doucet, A. (2002). A survey of convergence results on particle filtering methods for practitioners. IEEE Trans. Signal Processing, 50(3):736-746, Mar. 2002. Beskos, A., Crisan, D., & Jasra A. (2011). On the stability of sequential Monte Carlo methods in high dimensions. Technical Report, Imperial College London. Snyder, C., Bengtsson, T., Bickel, P., & Anderson, J. (2008). Obstacles to high-dimensional particle filtering. Month. Weather Rev., 136, 46294640. Bengtsson, T., Bickel, P., & Li, B. (2008). Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems. In Essays in Honor of David A. Freeman, D. Nolan & T. Speed, Eds, 316334, IMS. Quang, P.B., Musso, C. and Le Gland F. (2011). An Insight into the Issue of Dimensionality in Particle Filtering. Proc. ISIF Int. Conf. Information Fusion, Edinburgh, Scotland. 3 / 28
Discrete-time Filtering Fixed observations y 1 , . . . , y n with y k ∈ R d y . Hidden Markov chain X 0 , . . . , X n with X k ∈ E d . Initial distribution X 0 ∼ µ ( dx 0 ). Probability transition kernel K ( dx t | x t − 1 ) such that: � Pr ( X t ∈ A | X t − 1 = x t − 1 ) = K ( dx t | x t − 1 ) (1) A Observations conditionally independent of X and have marginal distribution: � Pr ( Y t ∈ B | X t = x t ) = g ( dy t | x t ) (2) B 4 / 28
Bayes’ Recursion Paths of signal and observation processes from time k to l : X k : l = ( X k , X k +1 , . . . , X l ); Y k : l = ( Y k , Y k +1 , . . . , Y l ) . Define probability distribution: π k : l | m ( dx k : l ) = P ( X k : l ∈ dx k : l | Y 1: m = y 1: m ) Bayes theorem leads to the following relationship: t � π 0: t | t ( dx 0: t ) ∝ µ ( dx 0 ) K ( dx k | x k − 1 ) g ( y k | x k ) (3) k =1 5 / 28
Bayes’ Recursion Prediction: π 0: t | t − 1 ( dx 0: t ) = π 0: t − 1 | t − 1 ( dx 0: t − 1 ) K ( dx t | x t − 1 ) Update: � − 1 �� π 0: t | t ( dx 0: t ) = π 0: t | t − 1 ( dx 0: t ) g ( y t | x t ) π 0: t | t − 1 ( dx 0: t ) R d y 6 / 28
Particle Filtering Recursive algorithm. Produce particle cloud with empirical measure close to π t | t . N particle paths { x ( i ) t } N i =1 . Associated empirical measure: N t | t ( dx t ) = 1 � π N δ x ( i ) t ( dx t ) (4) N i =1 7 / 28
Particle Filtering Initialization: Sample x ( i ) ∼ π 0 | 0 ( dx 0 ). 0 For t ≥ 1 x ( i ) ∼ π N Importance sampling: Sample ˜ t − 1 | t − 1 K ( dx t ). t Weight evaluation: N w ( i ) x ( i ) w ( i ) � ∝ g ( y t | ˜ t ); = 1 (5) t t i =1 Resample: Sample x ( i ) π N ∼ ˜ t | t ( dx t ). t 8 / 28
Drawbacks Variation of Importance Weights x ( i ) t } N Distn. of particles { ˜ i =1 is approx. π t | t − 1 = π t − 1 | t − 1 K . The algorithm can be inefficent if this is “far” from π t | t . Then the ratio: π t | t ( dx t ) π t | t − 1 ( dx t ) ∝ g ( y t | x t ) can generate weights with high variance. 9 / 28
Drawbacks Variation induced by resampling Proposed resampling generates N ( i ) copies of the i -th particle. t These are drawn from a multinomial distribution, so: E ( N ( i ) Nw ( i ) t ) = t var ( N ( i ) Nw ( i ) t (1 − w ( i ) t ) = t ) 10 / 28
Sequential Importance Sampling/Resampling Initialization: Sample x ( i ) ∼ π 0 | 0 ( dx 0 ). 0 For t ≥ 1 x ( i ) t − 1 | t − 1 ˜ ∼ π N Importance sampling: Sample ˜ K ( dx t ). t Weight evaluation: ∝ K ( dx t | x ( i ) x ( i ) N t − 1 ) g ( y t | ˜ t ) w ( i ) w ( i ) � ; = 1 (6) t t K ( dx t | x ( i ) ˜ t − 1 ) i =1 Resample: Sample x ( i ) π N ∼ ˜ t | t ( dx t ). t 11 / 28
Sequential Importance Sampling/Resampling Algorithm is the same as the bootstrap with a new dynamic model. � ˜ Pr ( X t ∈ A | X t = x t − 1 , Y t = y t ) = K ( dx t | x t − 1 , y t ) A � Pr ( Y t ∈ B | X t − 1 = x t − 1 , X t = x t ) = w ( x t − 1 , x t , dy t ) B Only true if we assume observations are fixed! With this model, ρ 0: t | t − 1 � = π 0: t | t − 1 but ρ 0: t | t = π 0: t | t . If ˜ K has better mixing properties, or w ( x t − 1 , x t , y t ) is a flatter likelihood, then algorithm will perform better. 12 / 28
Almost Sure Convergence Theorem Assume that the transition kernel K is Feller and that the likelihood function g is bounded, continuous and strictly positive, N →∞ π N then lim t | t = π t | t almost surely. Feller: for ϕ a continuous bounded function, K ϕ is also a continous bounded function. Intuition: we want two realizations of the signal that start from “close” positions to remain “close” at subsequent times. � Define ( µ, ϕ ) = ϕµ . N →∞ µ N = µ if lim N →∞ ( µ N , ϕ ) = ( µ, ϕ ) for any We write lim continuous bounded function ϕ . 13 / 28
Proof discussion Let ( E , d ) be a metric space Let ( a t ) ∞ t =1 and ( b t ) ∞ t =1 be two sequences of continuous functions a t , b t : E → E . Let k t and k 1: t be defined: k t = a t ◦ b t k 1: t = k t ◦ k t − 1 ◦ · · · ◦ k 1 . (7) Perturb k t and k 1: t using function c N : t = c N ◦ a t ◦ c N ◦ b t k N k N 1: t = k N t ◦ k N t − 1 ◦ · · · ◦ k N 1 . (8) Assume that as N becomes larger, perturbations become smaller; c N converges to the identity function on E . Does this mean that k N t and k N 1: t converge? 14 / 28
Counterexample Let E = [0 , 1] and d ( α, β ) = | α − β | . Let a t and b t be equal to identity i on E ; so k t is also identity. α + α N , if α ∈ [0 , 1 / 2] 1 − ( N − 1) | 1 2 + 1 if α ∈ (1 2 , 1 2 + 1 c N ( α ) = 2 N − α | , N ) α + α − 1 if α ∈ (1 2 + 1 N − 2 , N , 1) N →∞ c N ( α ) = α for all α ∈ [0 , 1]. Now lim N →∞ k N (1 � 1 2 + 1 � N →∞ c N But lim 2) = lim = 1 2 N 15 / 28
Proof Discussion So successive small perturbations may not always lead to a small perturbation overall. We need a stronger type of convergence for c N : a uniform manner. For all ǫ > 0 there exists N ( ǫ ) such that d ( c N ( e ) , i ( e )) < ǫ for all N ≥ N ( ǫ ). N →∞ e N = e ⇒ lim N →∞ c N ( e N ) = e . This implies that lim N →∞ k N N →∞ k N Then lim t = k t and lim 1: t = k 1: t 16 / 28
Filtering Application E = P ( R d ): set of probability measures over R d endowed with topology of weak convergence. µ N converges weakly if lim N →∞ ( µ N , ϕ ) = ( µ, ϕ ) for all continuous bounded functions ϕ . � Here ( µ, ϕ ) = ϕµ . � Define b t ( ν )( dx t ) = R d K ( dx t | x t − 1 ) ν ( dx t − 1 ). So π t | t − 1 = b t ( π t − 1 | t − 1 ). Let a t ( ν ) be a probability measure: ( a t , ν ) = ( ν, g ) − 1 ( ν, ϕ g ) for any continuous bounded function ϕ . Then π t | t = a t ( π t | t − 1 ) = a t ◦ b t ( π t − 1 | t − 1 ). 17 / 28
Filtering Application Assume a t is continuous; slight variation in conditional distribution of X t will not result in big variation in conditional distribution after y t taken into account. One way: assume g ( y t |· ) is continuous, bounded strictly positive function. Positivity ensures the normalizing denominator is never 0. Particle filtering: perturbation c N is random, but with probability 1 we have the properties outlined above. 18 / 28
Convergence of the Mean Square Error N →∞ E [(( µ N , ϕ ) − ( µ, ϕ )) 2 ] = 0. Different convergence: lim Expectation over all realizations of the random particle method. Assumption: g ( y t |· ) is a bounded function in argument x t . Theorem There exists c t | t independent of N such that for any continous bounded function ϕ : || ϕ || 2 � t | t , ϕ ) − ( π t | t , ϕ )) 2 � (( π N E ≤ c t | t (9) N 19 / 28
Convergence of the Mean Square Error If one uses a kernel ˜ K instead of K , we need that || w || < ∞ . “In other words, particle filtering methods beat the curse of dimensionality as the rate of convergence is independent of the state dimension d .” “However to ensure a given precision on the mean square error...the number of particles N also depends on c t | t , which can depend on d .” [Crisan and Doucet, 2002] 20 / 28
Uniform Convergence We have shown that ( π N t | t , ϕ ) converges to ( π t | t , ϕ ) in the mean-square sense. Rate of convergence is in 1 / N . But how does c t | t behave over time? If the true optimal filter doesn’t forget its initial conditions, then errors accumulate over time. Need mixing assumptions on dynamic model (and thus on the true optimal filter). Uniform convergence results can be obtained [Del Moral 2004]. 21 / 28
Curse of dimensionality Let’s consider the batch setting. Observe Y 1: n ; try to estimate the hidden state X n . Let g ( y 1: n | x ) be the likelihood and f ( x ) the prior density. Suppose f ( x ) is chosen as the importance density. RMSE convergence can be bounded [Leglande, Oudjane 2002] as: t | t , ϕ ) − ( π t | t , ϕ )) 2 � 1 / 2 ≤ c 0 � (( π N E √ I ( f , g ) || ϕ || (10) N where sup x g ( y 1: n | x ) I ( f , g ) = (11) � R d g ( y 1: n | x ) f ( x ) dx 22 / 28
Recommend
More recommend