Advanced Simulation - Lecture 16 Patrick Rebeschini March 7th, 2018 Patrick Rebeschini Lecture 16 1/ 25
Outline Particle methods for static problems. Evidence estimation. The End! Patrick Rebeschini Lecture 16 2/ 25
Sequence of posterior distributions Consider the problem of estimating � ϕ ( x ) π ( x ) dx for test function ϕ : X → R and target distribution π on X . Introduce intermediate distributions π 0 , π 1 , . . . , π T such that: π 0 is easy to sample from, π t and π t +1 are not too different, π T = π . Patrick Rebeschini Lecture 16 3/ 25
Sequence of posterior distributions Case where π ( θ ) ∝ p ( θ ) p ( y 1: T | θ ). Introduce the partial posterior π t ( θ ) ∝ p ( θ ) p ( y 1: t | θ ). We have π 0 ( θ ) = p ( θ ), the prior distribution. We have π T ( θ ) = π ( θ ) = p ( θ | y 1: T ), the full posterior distribution. We can expect π t ( θ ) to be similar to π t +1 ( θ ), at least when t is large, when the data is i.i.d. Patrick Rebeschini Lecture 16 4/ 25
Sequence of posterior distributions General target distribution π ( x ). Introduce a simple parametric distribution q . We can introduce π t ( x ) ∝ π ( x ) γ t q ( x ) 1 − γ t where 0 = γ 0 < γ 1 < . . . < γ T = 1. Then π 0 = q and π T = π . If γ t is close to γ t +1 , then π t is close to π t +1 . Patrick Rebeschini Lecture 16 5/ 25
Sequential Importance Sampling: algorithm At time t = 0 Sample X i ∼ π 0 ( · ) for i ∈ { 1 , . . . , N } . Compute the weights 0 = 1 w i ∀ i ∈ { 1 , . . . , N } N . At time t ≥ 1 Compute the weights w i t = w i t − 1 × ω i t π t ( X i ) = w i t − 1 × π t − 1 ( X i ) . At all times, ( w i t , X i ) N i =1 approximates π t . Patrick Rebeschini Lecture 16 6/ 25
Sequential Importance Sampling: diagnostics As already seen for IS, we can compute the effective sample size � 2 �� N i =1 w i 1 t � = ESS t = � 2 . �� N � n i =1 ( w i � W i t ) 2 i =1 t t = N − 1 for all i . ESS t = N if W i t ≈ 1, and for j � = i , W j If there exists i such that W i t ≈ 0, then ESS t ≈ 1. As a rule of thumb, the higher the ESS the better our approximation. Patrick Rebeschini Lecture 16 7/ 25
Toy model Data y t ∼ N ( µ, 1), 1000 points generated with µ ⋆ = 2. Prior µ ∼ N (0 , 1) Sequence of partial posterior π t ( µ ) ∝ p ( µ ) � t s =1 p ( y s | µ ). Incremental weights: π t ( µ ) π t − 1 ( µ ) ∝ p ( y t | µ ) We can look at the evolution of the effective sample size with t . Patrick Rebeschini Lecture 16 8/ 25
Toy model 1024 128 ESS 32 2 0 250 500 750 1000 time Figure: Effective sample size against “time”, using sequential importance sampling. Patrick Rebeschini Lecture 16 9/ 25
Sequential Importance Sampling with Resampling At time t = 0 Sample X i 0 ∼ π 0 ( · ) for i ∈ { 1 , . . . , N } . Compute the weights 0 = 1 w i ∀ i ∈ { 1 , . . . , N } N . At time t ≥ 1 � → � N − 1 , X i � � w i t − 1 , X i . Resample t − 1 t − 1 t = ¯ Define X i X i t − 1 . Compute the weights π t ( X i t ) w i t = ω i t = t ) . π t − 1 ( X i Problem: there are less and less unique values in ( X i t ) N i =1 . Patrick Rebeschini Lecture 16 10/ 25
Toy model 1024 128 ESS 32 2 0 250 500 750 1000 time Figure: Effective sample size against “time”, using sequential importance sampling with resampling. Patrick Rebeschini Lecture 16 11/ 25
Toy model 1024 128 # unique 32 2 0 250 500 750 1000 time Figure: Number of unique values against “time”, using sequential importance sampling with resampling. Patrick Rebeschini Lecture 16 12/ 25
Move steps Consider particles ( N − 1 , ¯ X i t ) N i =1 , approximating π t . The ESS is maximum (= N ), but multiple values within ( ¯ X i t ) N i =1 are identical. How to diversify the particles while still approximating π t ? Apply a Markov kernel K t to each ¯ X i t : t ∼ K t ( ¯ X i X i ∀ i ∈ { 1 , . . . , N } t , dx ) Can we find K t such that ( N − 1 , X i t ) still approximates π t ? e.g. N 1 � a.s. � ϕ ( X i t ) − − − − → ϕ ( x ) π t ( x ) dx. N N →∞ i =1 Patrick Rebeschini Lecture 16 13/ 25
Move steps Assume that K t leaves π t invariant: � π t ( x ) K t ( x, y ) dx = π t ( y ) . If ¯ t ∼ K t ( ¯ X i t ∼ π t , then X i X i t , dy ) also follows π t . Also, if ¯ X ∼ q and w ( x ) = π ( x ) /q ( x ), such that E q [ w ( ¯ X ) ϕ ( ¯ X )] = E π [ ϕ ( X )] then, for X ∼ K ( ¯ X, dx ), E qK [ w ( ¯ X ) ϕ ( X )] = E π [ ϕ ( X )] . Patrick Rebeschini Lecture 16 14/ 25
Move steps Result: if ( w i , ¯ X i ) approximates π , then we can apply a X i and not change the weights. π -invariant kernel to each ¯ Draw X i ∼ K ( ¯ X i , dx ) for all i ∈ { 1 , . . . , N } . Keep the weights unchanged. ( w i , X i ) still approximates π . Patrick Rebeschini Lecture 16 15/ 25
Sequential Monte Carlo Sampler: algorithm At time t = 0 Sample X i 0 ∼ π 0 ( · ) for i ∈ { 1 , . . . , N } . Compute the weights 0 = 1 w i ∀ i ∈ { 1 , . . . , N } N . At time t ≥ 1 � → � N − 1 , X i � � w i t − 1 , X i Resample . t − 1 t − 1 Draw t ∼ K t − 1 ( ¯ X i X i t − 1 , dx ) . Compute the weights π t ( X i t ) w i t = ω i t = t ) . π t − 1 ( X i At all times, ( w i t , X i t ) N i =1 approximates π t . Patrick Rebeschini Lecture 16 16/ 25
Adaptive resampling Problem: if K t is a Metropolis-Hastings with invariant distribution π t , computational cost typically linear in t . Thus applying K t at each step t ∈ { 1 , . . . , T } yields an overall cost in T � O ( t ) = O ( T 2 ) t =1 We can save time by performing the resample-move step only when necessary. Use the Effective Sample Size to know whether to trigger a resample-move step. Patrick Rebeschini Lecture 16 17/ 25
Sequential Monte Carlo Sampler: algorithm At time t = 0 Sample X i 0 ∼ π 0 ( · ) for i ∈ { 1 , . . . , N } . Compute the weights 0 = 1 w i ∀ i ∈ { 1 , . . . , N } N At time t ≥ 1 If ESS < ˜ N , then: � � i Resample � w i t − 1 , X i � N − 1 , X . → t − 1 t − 1 Draw t ∼ K t − 1 ( ¯ X i X i t − 1 , dx ) . Compute the weights π t ( X i t ) w i t = w i t − 1 × t ) . π t − 1 ( X i At all times, ( w i t , X i t ) N i =1 approximates π t . Patrick Rebeschini Lecture 16 18/ 25
Toy model 1024 128 ESS 32 2 0 250 500 750 1000 time Figure: Effective sample size against “time”, using sequential Monte Carlo sampling. Dashed lines indicate resampling times. Patrick Rebeschini Lecture 16 19/ 25
Toy model 1024 128 Nunique 32 2 0 250 500 750 1000 time Figure: Number of unique values against “time”, using sequential Monte Carlo sampling. Dashed lines indicate resampling times. Patrick Rebeschini Lecture 16 20/ 25
Toy model 15 10 π T ( µ ) 5 0 1.90 1.95 2.00 2.05 2.10 2.15 µ Figure: Posterior approximation (in black), and true posterior (in red), after 1000 observations. Patrick Rebeschini Lecture 16 21/ 25
Evidence estimation Assume ( w i t , X i t ) N i =1 approximates the posterior distribution π t ( θ ) ∝ p ( θ ) p ( y 1: t | θ ) at time t . Then the following estimator � N i =1 w i t p ( y t +1 | X i t ) � N i =1 w i t converges to � p ( y t +1 | θ ) π t ( θ ) dθ p ( y t +1 | θ ) p ( θ ) p ( y 1: t | θ ) � = dθ p ( y 1: t ) 1 � = p ( y 1: t +1 | θ ) p ( θ ) dθ = p ( y t +1 | y 1: t ) . p ( y 1: t ) Similar to the likelihood estimator in particle filters. Patrick Rebeschini Lecture 16 22/ 25
Toy model We compare the SMC estimator with the estimator obtained by importance sampling from the prior distribution: X i ∼ p ( θ ) , ∀ i ∈ { 1 , . . . , N } L i t = p ( y 1: t | X i ) , ∀ i ∈ { 1 , . . . , N } N p N ( y 1: t ) = 1 � L i t . N i =1 We plot the following relative error t = | p N ( y 1: t ) − p ( y 1: t ) | r N . | p ( y 1: t ) | Patrick Rebeschini Lecture 16 23/ 25
Toy model IS SMC 1e−01 1e−02 1e−03 1e−04 N 1e−05 r t 1e−06 1e−07 0 250 500 750 10000 250 500 750 1000 time Figure: Relative error using SMC samplers and importance sampling from the prior; 10 independent experiments. Patrick Rebeschini Lecture 16 24/ 25
The End Inversion, Transformation, Composition, Accept-reject, Importance Sampling, Metropolis–Hastings, Gibbs sampling, Adaptive Multiple Importance Sampling, Reversible Jump, Slice Sampling, Sequential Importance Sampling, Particle filter, Pseudo-marginal Metropolis–Hastings, Sequential Monte Carlo Sampler. . . What about Nested Sampling, Path Sampling, Hamiltonian MCMC, Pinball Sampler, Sequential Quasi-Monte Carlo, Multiple-Try Metropolis–Hastings, Coupling From the Past, Multilevel Splitting, Wang–Landau algorithm, Free Energy MCMC, Stochastic Gradient Langevin Dynamics, Firefly MCMC, Configurational Bias Monte Carlo. . . ? Good luck for the exams! Patrick Rebeschini Lecture 16 25/ 25
Recommend
More recommend