SLIDE 1 Quasi-Reliable Estimates of Effective Sample Size
Robert Skeel
Purdue University / Arizona State University
July 20, 2016 with Youhan Fang and Yudong Cao
SLIDE 2
Evaluating Performance of MCMC Methods
Statistician Charles Geyer (1992): It would enforce a salutary discipline if the gold standard for comparison of Markov chain Monte Carlo schemes were asymptotic variance . . . it is easier to invent methods than to understand exactly what their strengths and weaknesses are . . . . Physicist Alan Sokal (1997): the better algorithm is the one that has the smaller autocorrelation time, when time is measured in units of CPU time
SLIDE 3
The Cost of an Independent Sample
τ × cost per step = N ESS × cost per step where τ is the integrated autocorrelation time, N = sample size, and ESS is effective sample size.
SLIDE 4
The Cost of an Independent Sample
τ × cost per step = N ESS × cost per step where τ is the integrated autocorrelation time, N = sample size, and ESS is effective sample size. However, not generally accepted for molecular simulation, because very often . . . ESS = 0.
SLIDE 5 The Cost of an Independent Sample
τ × cost per step = N ESS × cost per step where τ is the integrated autocorrelation time, N = sample size, and ESS is effective sample size. However, not generally accepted for molecular simulation, because very often . . . ESS = 0. Nonetheless, for comparing methods;
- ne can simply use small representative problems.
SLIDE 6
Evaluating Accuracy of MCMC Estimates
Alan Sokal again: It is important to emphasize that unless further information is given—namely, the autocorrelation time of the algorithm—. . . statements [of sample size] have no value whatsoever. What about problems for which even ESS = 1 is infeasible?
SLIDE 7 What to Do With Intractable Problems
- 1. Do “exploration” instead of sampling.
- 2. Change the problem:
2.1 Use a simpler model (Daniel Zuckerman). A coarsened model with error bars is better than a refined model without error bars (because then the limitations are more explicit). 2.2 Artificially limit the extent of state space, and adjust conclusions accordingly.
SLIDE 8 Reliability and Quasi-reliability
Reliability is impractical in general due to metastability. Here quasi-reliability means ensuring thorough sampling
- f that part of state space that has already been explored,
to minimize the risk of missing an opening to an unexplored part of state space. More concretely, it means thorough sampling of those modes that have already been visited, to minimize the risk of missing an opening to an unvisited mode.
SLIDE 9
Outline
Preliminaries Thorough Sampling Coping with Statistical Error Reversible Samplers
SLIDE 10
Problem Statement
Given probability density ρq(q), q ∈ Rd, known only up to a multiplicative factor, compute observables E[u(Q)], which are expectations for specified functions u(q). Note use of upper case for random variables.
SLIDE 11 Markov Chains
Q0 → Q1 → · · · → QN−1 Assume ergodic with stationary density ρq(q). Also, assume Q0 ∼ ρq(q). To estimate an observable, use E[u(Q)] ≈ U = 1 N
N−1
u(Qn), but use just one realization q0 → q1 → · · · → qN−1.
SLIDE 12 Variance of the Estimated Mean
Var[U] = 1 N Var[u(Q)]
N−1
N C(k) C(0)
C(k) = E[(u(Q0) − µ)(u(Qk) − µ)], with µ = E[u(Q)]). In the limit N → ∞, Var[U] = 1 N Var[u(Q)]τ + o( 1 N ) where τ is the integrated autocorrelation time τ = 1 + 2
+∞
C(k) C(0).
SLIDE 13 Estimates of CoVariances
CN(k) = 1 N
N−k−1
(u(qn) − u)(u(qn+k) − u).
Priestley (1981, pp.323–324)
SLIDE 14 MCMC in Extended State Space
Some samplers augment state variables q with auxiliary variables p, extend the p.d.f. to ρ(q, p) so that
and make moves in extended state space z0 → z1 → · · · → zN−1 where z = (q, p).
SLIDE 15 Forward Transfer Operator
Associated with the MCMC propagator is an operator F, which maps a relative density un−1 = ρn−1/ρ for Zn−1 to a relative density un = ρn/ρ for Zn: un = Fun−1 where Fun−1(z) = 1 ρ(z)
and ρ(z|z′) is the transition probablity for the chain. Note that u ≡ 1 is an eigenfunction of F for eigenvalue λ1 = 1.
SLIDE 16
Error in Probability Density
starting from ρ0(q) = δ(q − q0) is proportional to 1/|1 − λ2|, where λ2 is the nonunit eigenvalue of F nearest 1. However, the spectral gap is not the relevant quantity—in general.
SLIDE 17 Error in Estimate of τ
An estimate of τ has a standard deviation ≈
where M is the number of terms taken in formula for τ. Therefore, use instead τ ≈ 1 + 2
N−1
w(k)C(k) C(0) where w(k) is a decreasing function known as a lag window. The tiny program acor of Jonathan Goodman (2009) uses a lag window that is 1 from 0 to roughly 1.4τ and decreases linearly from 1.4τ to 2.4τ. It requires the number of samples N to exceed 35τ, roughly.
SLIDE 18
Outline
Preliminaries Thorough Sampling Coping with Statistical Error Reversible Samplers
SLIDE 19 Weakness of Existing Approach
Mixture of Gaussians − log ρ(q1, q2) =
1 2(16(q1 + 2)2 + (q2 − 4)2) + 1 2((q1 − 2)2 + 16q2 2) + const
Observable function: u(q1, q2) = q1.
SLIDE 20
Estimate of τ as N Increases
Sampler: Euler-Maruyama Brownian dynamics (w/o rejections). Need to ensure thorough sampling.
SLIDE 21
Definition of Thorough Sampling:
for any subset A of state space, an estimate 1A of E[1A(Q)] = Pr(Q ∈ A) should satisfy Var[1A − E[1A(Q)]] ≤ 1 4tol2.
SLIDE 22
Definition of Thorough Sampling:
for any subset A of state space, an estimate 1A of E[1A(Q)] = Pr(Q ∈ A) should satisfy Var[1A − E[1A(Q)]] ≤ 1 4tol2. Since Var[1A − E[1A(Q)]] ≈ τA 1 N Var[1A(Q)] ≤ 1 4N τA, it is enough to have 1 4N τA ≤ 1 4tol2 for all A.
SLIDE 23
Symmetries
There may be permutations P such that ρq(Pq) = ρq(q) and u(Pq) = u(q) for all interesting u. Then, consider only A for which 1A(Pq) = 1A(q).
SLIDE 24
Symmetries
There may be permutations P such that ρq(Pq) = ρq(q) and u(Pq) = u(q) for all interesting u. Then, consider only A for which 1A(Pq) = 1A(q). For simplicity, instead of only indicator functions, consider all functions in W = {u = u(q) | E[u(Q)] = 0, u(Pq) = u(q) for symmetries P}.
SLIDE 25 For More Explicit Notation
Introduce the inner product v, u =
Note that 1, u = E[u(Z)]. One can show by induction that E[v(Z0)u(Zk)] = Fkv, u. and, in particular, C(k) = Fku, u.
SLIDE 26 Maximum Autocorrelation Time
Define thorough sampling using τmax = max
u∈W
+∞
Fku, u u, u
SLIDE 27 Spatial Discretization
Consider u(q) = aTu(q) where ui ∈ W are given and ai are unknown. Then C(k) = Fku, u = aTCka where Ck = Fku, uT = E[u(Q0)u(Qk)T] and τmax ≈ max
a
aTKa aTC0a where K = C0 + 2
+∞
Ck —an eigenvalue problem. For u, suggest ui(q) = qi.
SLIDE 28
Cross Covariance Matrices
C0 is symmetric positive definite. The eigenvalue problem is well conditioned if Ck is symmetric. This depends on the sampler.
SLIDE 29 Reversible MCMC Samplers
satisfy detailed balance: ρ(z′|z) ρ(z) = ρ(z|z′) ρ(z′) e.g.,
- Brownian sampler, one step of the Euler-Maruyama integrator
for Brownian dynamics with or without a MRRTT (Metropolis-Rosenbluth-Rosenbluth-Teller-Teller) acceptance test,
- hybrid Monte Carlo,
- a generalized hybrid Monte Carlo step,
followed by a momenta flip,
- several steps with a reversible Langevin integrator,
followed by a momenta flip.
SLIDE 30
Modified Reversible MCMC Samplers
The momenta flip is counterproductive: Reversing the momenta flip seems to improve mixing. A modified reversible propagator couples the reversible substep zn−1 → ¯ zn with a substep, zn = R(¯ zn), where R satisfies R ◦ R = id and ρ ◦ R = ρ. For a momenta flip, R(q, p) = (q, −p).
SLIDE 31
An Empirical Comparison
Canc` es, Legoll & Stoltz (2007)
Discrepancy for a pair of torsion angles in alkanes. dimension 36 36 87 114 Langevin dynamics 0.034 0.016 0.014 0.017 HMC 0.039 0.012 0.026 0.021 Brownian dynamics 0.079 0.023 0.040 0.031 Nos´ e-Hoover chains 0.103 0.046 0.029 0.035 Brownian sampler 0.104 0.034 0.048 0.061 1 million samples for column 1, 10 million for others.
SLIDE 32 Discretized Langevin Dynamics
Let F(q) = −∇q log ρ(q, p). O: p := √1 − γ∆tp + √γ∆tN(0, I) B: p := p + 1
2∆tF(q)
A: q := q + ∆tp B: p := p + 1
2∆tF(q)
O: p := √1 − γ∆tp + √γ∆tN(0, I) The dynamics has a precise stationary density. It differs from the desired density by O(∆t2).
Leimkuhler, Matthews & Stoltz (2015)
Such error is much smaller than statistical error. (The Euler-Maruyama integrator is a special case of this.)
SLIDE 33
Symmetry of Cross Covariance Matrices
For a modified reversible propagator, its forward transfer operator is F = R ¯ F, where ¯ F is the operator for the reversible substep and R is the operator for the substep zn = R(¯ zn). It can be shown that ¯ Fg, f = g, ¯ Ff and Rg, f = g, Rf . Additionally, assume Ru(q) = u(q). Then matrices Ck can be shown to be symmetric modulo sampling error.
SLIDE 34
Outline
Preliminaries Thorough Sampling Coping with Statistical Error Reversible Samplers
SLIDE 35 Simplest Neural Network
Fit model u(q; x) = q3 tanh(q1x + q2) + q4 to data yi ≈ u(q; xi): − log ρ(q) = 1 2β
100
(yi − u(q; xi))2 + 1 2αq2 + const where β = 1.5 and α = 0.01. A Bayesian approach might use the mean E[u(Q; x)] as the approximation. Sampler: Euler-Maruyama Brownian dynamics w/o rejections.
SLIDE 36
Data and Predicted Function
The given data is symmetric about x = 0; lack of symmetry because N = 1 000 000 only.
SLIDE 37
Time series for u(q; x1)
SLIDE 38
Autcorrelation Function (ACF) for u(q; x1)
SLIDE 39
Integrated ACF for N = 10 000
SLIDE 40
Integrated ACF for N = 100 000
SLIDE 41
Integrated ACF for N = 1 000 000
SLIDE 42
ACF for N = 1 000 000 integrated 100 times further
SLIDE 43 Estimated τ as N Increases
acor
SLIDE 44
Outline
Preliminaries Thorough Sampling Coping with Statistical Error Reversible Samplers
SLIDE 45
Reversible Samplers
If τmax were the maximum over all u(z), then τmax = 1 + λ2 1 − λ2 .
SLIDE 46
Mixture of Gaussians, Again
Estimated τ and τmax as N increases:
SLIDE 47
Simplest Neural Network, Again
Fitting model u(q; x) = q3 tanh(q1x + q2) + q4 to data yi ≈ u(q; xi).
SLIDE 48
Time Series for Switch Location −q2/q1
SLIDE 49
Histogram for Switch Location −q2/q1
SLIDE 50
Autcorrelation Function (ACF) for u(q; x1)
SLIDE 51
ACF for Switch Location −q2/q1
SLIDE 52
Estimated τ and τmax as N Increases
SLIDE 53 Summarizing
Let τ denote the integrated autocorrelation time.
◮ Estimating τ is considered essential. ◮ Greater reliability is possible by trying to estimate the worst
case τ.
◮ Better estimates of τ are needed.
SLIDE 54
Thank you