Tuning of pseudo-marginal MCMC Alex Thiéry 1 1 National University of Singapore Joint work with C. Sherlock (Lancaster), G. Roberts (Warwick), J. Rosenthal (Toronto).
Pseudo-marginal MCMC Target density π ( x ) , positive unbiased estimate � π ( x ) Theorem (Beaumont 2003 - Andrieu-Roberts 2009) Usual MCMC with � π ( x ) instead of π ( x ) is correct! No free lunch: poor estimator � π , poor mixing.
Wrong Pseudo-marginal MCMC start from x 0 ∈ X 1 Proposal: y D ∼ q ( x k , · ) 2 � π ( y ) q ( y , x k ) Noisy Metropolis-Hastings: α = 1 ∧ 3 � π ( x k ) q ( x y , y ) y �→ x k + 1 with probability α 4 Not correct. Not too bad, after correction. � � � π ( y ) / � Notice that E π ( x ) � = π ( y ) /π ( x )
Pseudo-marginal MCMC start from x 0 ∈ X , estimator z 0 = � π ( x 0 ) 1 Proposal: y D ∼ q ( x k , · ) , estimator z y = � π ( y ) 2 z y q ( y , x k ) Noisy Metropolis-Hastings: α = 1 ∧ 3 z k q ( x y , y ) ( y , z y ) �→ ( x k + 1 , z k + 1 ) with probability α 4
Latent variables Observation y , latent variable u , parameter x ∈ X � P ( y | u , x ) P ( du | x ) π ( x ) ∝ π 0 ( x ) × U data augmentation: MCMC on X × U high correlation between latent variable and parameter? High-dimensional latent variable u ? Pseudo-marginal: imitate MCMC on X only If one can simulate u | x and compute P ( y | u , x ) � n k = 1 P ( y | u k , x ) � π ( x ) = π 0 ( x ) × n
Unbiased estimators? importance sampling sequential importance sampling particle filter unbiased estimator of f ( µ ) from unbiased estimator of µ . Random truncation, Russian’s roulette Probabilistic interpretation of PDEs (Feynman-Kac): replace solving a PDE by simulating a few particles
Stickyness better estimator, better mixing [talk C. Andrieu] � n k = 1 P ( y | u k , x ) � π ( x ) = π 0 ( x ) × n behaves like marginal chain as Var ( � π ( x )) → 0 if � π ( x ) ≫ π ( x ) , parameter x is over-estimated: stuck! Trade-off: MCMC mixing, estimator computational time good estimators take a lot of time to compute. good estimators yield good mixing.
How to measure efficiency? Integrated Autocorrelation Time: test function? � ∞ IACT = 1 + 2 Cor ( ϕ ( X t ) , ϕ ( X t + k )) k = 1 Spectral Gap, Log-Sobolev, mixing time: complicated! Target distribution of ( X k , W k ) depends on distribution of the noise: difficult to compare chain with different target distributions.
a toy example One dimensional Random Walk metropolis y = x + λ ξ with ξ D ∼ N ( 0 , 1 ) Gaussian target π ( x ) ∝ exp ( − x 2 / 2 ) Estimator of the form � π ( x ) = π ( x ) × e W with W = σ Z − σ 2 / 2 and Z D ∼ N ( 0 , 1 ) so that Var ( W ) = σ 2 Process ( X k , W k ) is a Markov chain
Influence of the noise σ noise sd=0 2 1 X 0 −1 −2 −3 0 100 200 300 400 500 iteration number noise sd=1 2 1 0 X −1 −2 0 100 200 300 400 500 iteration number noise sd=5 2 X 1 0 0 100 200 300 400 500 iteration number noise sd=10 0.0 −0.2 X −0.4 −0.6 −0.8 0 100 200 300 400 500 iteration number
Influence of the jump size λ 0.35 30 0.30 Spectral Gap 0.25 IACT 20 0.20 10 0.15 0.25 0.50 0.75 0.2 0.4 0.6 mean acceptance Mean Acceptance Probability
Diffusion limit High-dimensional state space, local move MCMC: Looks like a diffusion from far away all above approaches are identical if diffusion limit intuitive understanding + simple conclusions process W k is averaged out relevant for moderate dimension robust results (eg. 0.234 rule)
Goal Tuning of Pseudo Marginal RWM small jump size: explore slowly large jump size: most proposals are rejected cheap estimator: fast computation time, bad mixing good estimator: expensive, good mixing
High Dimension + simple target distributions product form target distribution in R d π ( x ) = f ( x 1 ) × f ( x 2 ) × . . . × f ( x d ) Random Walk Metropolis y = x + µ ξ D √ ξ ∼ N ( 0 , I d ) . with d Unbiased estimator � π ( x ) = π ( x ) × e W . Assumption: W is independent from x .
Accelerated first coordinate process. V ( d ) ( t ) = X ( d ) ( d × t ) 1 V ( d ) is not Markovian. In the limit d → ∞ it is!
Theorem (Sherlock-T-Roberts-Rosenthal) Weak convergence of V ( d ) on [ 0 , T ] towards Langevin diffusion � dV = 1 2 J ( µ ) ( log f ) ′ ( V ) dt + J ( µ ) dW Sketch of proof: look at Markov chain ( X k , W k ) on augmented space compute drift-variance (generator of diffusion) along X only two time scales: homogenization arguments X -process mixes after O ( d ) steps W -process mixes after O ( 1 ) steps average out the W process
Theorem (Sherlock-T-Roberts-Rosenthal) Weak convergence of V ( d ) on [ 0 , T ] towards Langevin diffusion � dV = 1 2 J ( µ ) ( log f ) ′ ( V ) dt + J ( µ ) dW Speed function: the larger J ( µ ) , the better the mixing Expected Squared Jump Distance: J ( µ ) = E � X k + 1 − X k � 2 Limiting mean acceptance probability: 0 < α ( µ ) < 1 α ( µ ) → 0 as µ increases Almost closed form expression for α ( µ ) and J ( µ ) if noise W is fixed, optimal µ independent from f ( x ) dx
Gaussian case: joint optimisation ( µ, σ ) Speed function 8 level 6 0.6 estimator variance 0.4 4 0.2 2 0 1 2 3 4 5 Jump Size
Gaussian case Speed function VS noise 10 -0.5 speed function in log-scale 10 -1 10 -1.5 10 -2 10 -2.5 10 -3 0 1 2 3 4 5 sigma
Discussion Gaussian assumption: is it relevant? What do we want to optimise? How robust the conclusions are?
Thank You! Questions?
Recommend
More recommend