Australian Centre of Excellence for Mathematical and Statistical Frontiers in Big Data, Big Models, New Insights The past, present and future of Monte Carlo methods (a personal experience) Scott A. Sisson UNSW Sydney & ACEMS 8th–12th July 2019, MCM2019 1/71
Talk Outline 1. The Past • RJMCMC convergence diagnostics ∗ • Adaptive MCMC 2. The Present • Approximate Bayesian computing • Likelihood-free approximate Gibbs sampling • MCMC for approximate data 3. The Future • of Monte Carlo methods research * Sisson & Fan (2007), Statistics & Computing. 2/71
RJMCMC convergence diagnostics Active research in 1990s–2000’s Gelman-Rubin (Potential Scale Reduction Factor) diagnostic (1992): � 1 / 2 � N − 1 W + 1 ˆ N ˆ B ˆ N R = ˆ W ◮ B = between chain variance. ◮ W = within chain variance. ◮ ˆ R → 1 when all chains have converged. ◮ Based on some quantity constantly observed across all chains ◮ Not clear how to use for RJMCMC: • Monitor deviance? • Monitor model indicator? 3/71
RJMCMC convergence diagnostics After reading some books on point processes (. . . ) ◮ Distance X v from point v to nearest event ξ 1 , . . . , ξ n in space A . ◮ For some models, can represent parameters as point process realisations • E.g. Mixture of Normals f ( x ) = � n i =1 w i φ ( x ; µ i , σ i ) • θ = ( µ 1 , . . . , µ n , σ 1 , . . . , σ n , w 1 , . . . , w n , n ) • θ = { ξ i = ( µ i , σ i , w i ) : i = 1 , . . . , n } ◮ Now X v is constantly observed across all chains ◮ So could estimate ˆ F ( x v | v ) for a chain, from some location v = ( µ v , σ v , ξ v ) Use for a diagnostic: � ∞ ◮ Comparing two chains u jk = � | F j ( x v | v ) − F k ( x v | v ) | p dx v dv V 0 ◮ or, ˆ R v on ( x 1 v , . . . , x C v ) over all chains (etc.) 4/71
RJMCMC convergence diagnostics Specific model 1 : Corbelled domes (unknown number of changepoints) � log( α j ) + β j log( d i + δ j ) + ǫ i γ j − 1 ≤ d i < γ j log( r i ) = log( c ) + ǫ i γ J ≤ d i with changepoints 0 = γ 0 ≤ . . . ≤ γ k , ǫ i ∼ N (0 , σ 2 ), subject to some continuity constraints. Fan and Brooks ◮ ξ j = ( α j , β j , δ j , γ j , σ ) ◮ Run 5 RJMCMC samplers for 15m iterations ◮ Did they converge? 1Fan and Brooks (2000), The Statistician. 5/71
RJMCMC convergence diagnostics ◮ (a,b) Kolomogorov-Smirnov and χ 2 test on model indicators � ◮ (c) PSRF/ ˆ R using deviance � 6/71
RJMCMC convergence diagnostics ◮ (a) All u j 3 traces suggest chain #3 is different (grey lines) ◮ (b) ˆ R v shows different performance for 3 specific v ∼ π ( ξ | x ). ◮ (c) Density of γ 2 under chain 3 is very different to other chains ◮ Difference identified by 3 closest v (points). 7/71
Talk Outline 1. The Past • RJMCMC convergence diagnostics • Adaptive MCMC ∗ 2. The Present • Approximate Bayesian computing • Likelihood-free approximate Gibbs sampling • MCMC for approximate data 3. The Future • of Monte Carlo methods research * Garthwaite, Fan & Sisson (2010/16) Communications in Statistics. 8/71
Adaptive MCMC Adaptive Random Walk Metropolis-Hastings (ARWMH): ◮ MCMC with proposal distribution q ( θ, θ ∗ ) = N ( θ, σ 2 A t ) ◮ Where A t = Cov( θ ) estimated from θ t − 1 , . . . , θ 1 ◮ Satisfies “diminishing adaptations” condition so achieves correct limiting distribution Q: How to choose σ ? (One) A: Realise a particular acceptance probability Which acceptance probability? ◮ 0.234 for multivariate proposal distributions (Roberts et al., 1997) ◮ 0.44 for univariate proposal distributions (Roberts and Rosenthal, 2001) Appears to work well in general Typically obtained by manually searching for σ 2 (laborious) 9/71
The Robbins-Monro Process ◮ Stochastic search algorithm 1.0 for a binary response with 0.8 probability of success p ( σ ). Can control σ . 0.6 p ( σ ) ◮ The aim is to find the value 0.4 of σ ∗ that gives a specified 0.2 value of p , say p ∗ = p ( σ ∗ ). 0.0 0 2 4 6 8 10 12 σ 1.0 Robbins-Monro process (Robbins & Monro, 1951) 0.8 Conduct a sequence of trials. 0.6 ◮ At each trial σ is set equal to the p ( σ ) 0.4 current estimate of σ ∗ . 0.2 ◮ If the trial is a success, σ is increased. 0.0 ◮ If it is a failure, σ is reduced. 0 2 4 6 8 10 12 σ ◮ Adjustment size decreases with each trial. 10/71
Incorporation into RWMH algorithms Within a RWMH algorithm: Given the current value of σ t , update σ t +1 according to � σ t + c (1 − p ∗ ) / t if θ ∗ is accepted σ t +1 = if θ ∗ is rejected σ t − cp ∗ / t where θ ∗ ∼ N ( θ t , σ 2 ), and c is a chosen steplength constant, which controls algorithm convergence. ◮ Assumes that p ( σ ) is a monotonic function of σ . ◮ The optimum choice of c is (Hodges and Lehmann, 1955) c ∗ = − 1 / [ dp ( σ ) / d σ ] σ = σ ∗ . In general this must be estimated. ◮ Previous attempts to estimate c ∗ used 3 additional Markov chains (Andrieu and Robert, 2001) 11/71
Estimating c ∗ ◮ Pr(accepting a MH move from θ ) � α ( θ, θ ∗ ) q ( θ, θ ∗ ) d θ ∗ p ( θ, σ ) = ◮ Hence, overall MH acceptance probability (OAP) is � p ( σ ) = p ( θ, σ ) π ( θ | y ) d θ. For Gaussian RWMH (symmetric proposal) we then have � π ( θ ∗ | y ) � � dq ( θ, θ ∗ ; σ ) dp ( σ ) � � � π ( θ | y ) d θ ∗ d θ = min π ( θ | y ) , 1 d σ d σ which is difficult to work with in general. However . . . 12/71
Some results If q is a m -dimensional MVN ( θ t , σ 2 A ) where A is independent of σ . Proposition 1 A lower bound on c ∗ /σ ∗ is 1 / ( mp ∗ ). Proposition 2 If π ( θ | y ) has finite variance, then for m = 1: c ∗ /σ ∗ → 1 / p ∗ as σ → ∞ . Proposition 3 If π ( θ | y ) is continuous over Θ, and if σ → 0 as p ( σ ) → 1, then c ∗ /σ ∗ ≈ 1 / (1 − p ∗ ) p ∗ → 1 . as 13/71
A simple estimate of c ∗ (univariate) Hence for univariate proposals ( m = 1), one possibility is c ∗ 1 σ ∗ ≈ p ∗ (1 − p ∗ ) which satisfies propositions 1–3. In practice: Have an estimate of c ∗ for every estimate of σ ∗ in the Robbins-Monro search process. Results for multivariate posteriors too. 14/71
How good in practice? 20 15 c*/ ! * 10 5 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 p* ◮ Given p ∗ numerically solve p ( σ ) for σ ∗ and dp ( σ ) / d σ for c ∗ . ◮ Univariate distributions: N(0,1); t 5 ; Cauchy; logistic; double Exponential; Gamma(5,1); Beta(3,7); U(0,1); 1 2 *N(0,1)+ 1 2 *N(5,5) ◮ Note: Useful to slightly overestimate c ∗ in practice 15/71
How good in practice? (a) m =2 (b) m =4 20 20 15 15 c*/ ! * c*/ ! * 10 10 5 5 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p* p* (c) m =8 (d) m =20 20 20 15 15 c*/ ! * c*/ ! * 10 10 5 5 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p* p* ◮ π ( θ | y ) = � h ( θ i ) where h i are previous 9 densities. ◮ Consistent results as before. 16/71
Algorithm comments Univariate target distributions with θ ∗ ∼ N ( θ t , σ 2 ): Set p ∗ = 0 . 44, update σ t according to � σ t + c (1 − p ∗ ) / t if θ ∗ is accepted σ t +1 = if θ ∗ is rejected, σ t − cp ∗ / t where c = σ t / { p ∗ (1 − p ∗ ) } Various other implementational details omitted. 17/71
Univariate simulation study σ ∗ quantile Optimal OAP quantile Target σ ∗ 0.05 median 0.95 0.05 median 0.95 distribution N (0 , 1) 2.42 2.31 2.43 2.56 0.417 0.443 0.468 t -dist. (5 d.f.) 2.71 2.54 2.73 2.89 0.413 0.441 0.470 Cauchy 4.39 3.69 4.25 5.03 0.389 0.443 0.501 Logistic 4.05 3.82 4.05 4.33 0.417 0.442 0.467 Double exponential 2.70 2.52 2.70 2.93 0.413 0.439 0.465 Gamma(5,1) 4.98 4.62 4.96 5.28 0.414 0.443 0.467 Beta(3,7) 0.335 0.311 0.335 0.355 0.417 0.440 0.466 Uniform 0.806 0.764 0.807 0.849 0.418 0.442 0.464 2 N (0 , 1) + 1 1 2 N (5 , 5) 6.07 5.59 6.10 6.50 0.415 0.442 0.468 norm norm DBEXP DBEXP BETA BETA 1.0 5.0 1.0 1.0 4 4.5 6 0.8 0.8 0.8 4.0 3 5 0.6 0.6 0.6 sigma sigma sigma 3.5 acc acc acc 0.4 0.4 0.4 4 2 3.0 0.2 2.5 0.2 0.2 3 1 2.0 0.0 0.0 0.0 0 500 1500 0 500 1500 0 500 1500 0 500 1500 0 500 1500 0 500 1500 Index Index Index Index Index Index 18/71
The Past: Summary 1. The Past • RJMCMC convergence diagnostics • Adaptive MCMC Active research in 1990s–2000’s Who worries about convergence in 2020? Should be more important than ever (bigger, more complex models) MCMC adaption still important (e.g. HMC) But less focus on going beyond (1-step) Markov chain Maybe the problems have been adequately resolved (for now). Research trends come and go, and this is ok. � 19/71
Talk Outline 1. The Past • RJMCMC convergence diagnostics • Adaptive MCMC 2. The Present • Approximate Bayesian computing ∗ • Likelihood-free approximate Gibbs sampling • MCMC for approximate data 3. The Future • of Monte Carlo methods research * Rodrigues, Nott & Sisson (2016) Computational Statistics & Data Analysis 20/71
Recommend
More recommend