Quasi-Reliable Estimates of Effective Sample Size Robert Skeel - - PowerPoint PPT Presentation

quasi reliable estimates of effective sample size
SMART_READER_LITE
LIVE PREVIEW

Quasi-Reliable Estimates of Effective Sample Size Robert Skeel - - PowerPoint PPT Presentation

Quasi-Reliable Estimates of Effective Sample Size Robert Skeel Purdue University / Arizona State University July 20, 2016 with Youhan Fang and Yudong Cao Evaluating Performance of MCMC Methods Statistician Charles Geyer (1992): It would


slide-1
SLIDE 1

Quasi-Reliable Estimates of Effective Sample Size

Robert Skeel

Purdue University / Arizona State University

July 20, 2016 with Youhan Fang and Yudong Cao

slide-2
SLIDE 2

Evaluating Performance of MCMC Methods

Statistician Charles Geyer (1992): It would enforce a salutary discipline if the gold standard for comparison of Markov chain Monte Carlo schemes were asymptotic variance . . . it is easier to invent methods than to understand exactly what their strengths and weaknesses are . . . . Physicist Alan Sokal (1997): the better algorithm is the one that has the smaller autocorrelation time, when time is measured in units of CPU time

slide-3
SLIDE 3

The Cost of an Independent Sample

τ × cost per step = N ESS × cost per step where τ is the integrated autocorrelation time, N = sample size, and ESS is effective sample size.

slide-4
SLIDE 4

The Cost of an Independent Sample

τ × cost per step = N ESS × cost per step where τ is the integrated autocorrelation time, N = sample size, and ESS is effective sample size. However, not generally accepted for molecular simulation, because very often . . . ESS = 0.

slide-5
SLIDE 5

The Cost of an Independent Sample

τ × cost per step = N ESS × cost per step where τ is the integrated autocorrelation time, N = sample size, and ESS is effective sample size. However, not generally accepted for molecular simulation, because very often . . . ESS = 0. Nonetheless, for comparing methods;

  • ne can simply use small representative problems.
slide-6
SLIDE 6

Evaluating Accuracy of MCMC Estimates

Alan Sokal again: It is important to emphasize that unless further information is given—namely, the autocorrelation time of the algorithm—. . . statements [of sample size] have no value whatsoever. What about problems for which even ESS = 1 is infeasible?

slide-7
SLIDE 7

What to Do With Intractable Problems

  • 1. Do “exploration” instead of sampling.
  • 2. Change the problem:

2.1 Use a simpler model (Daniel Zuckerman). A coarsened model with error bars is better than a refined model without error bars (because then the limitations are more explicit). 2.2 Artificially limit the extent of state space, and adjust conclusions accordingly.

slide-8
SLIDE 8

Reliability and Quasi-reliability

Reliability is impractical in general due to metastability. Here quasi-reliability means ensuring thorough sampling

  • f that part of state space that has already been explored,

to minimize the risk of missing an opening to an unexplored part of state space. More concretely, it means thorough sampling of those modes that have already been visited, to minimize the risk of missing an opening to an unvisited mode.

slide-9
SLIDE 9

Outline

Preliminaries Thorough Sampling Coping with Statistical Error Reversible Samplers

slide-10
SLIDE 10

Problem Statement

Given probability density ρq(q), q ∈ Rd, known only up to a multiplicative factor, compute observables E[u(Q)], which are expectations for specified functions u(q). Note use of upper case for random variables.

slide-11
SLIDE 11

Markov Chains

Q0 → Q1 → · · · → QN−1 Assume ergodic with stationary density ρq(q). Also, assume Q0 ∼ ρq(q). To estimate an observable, use E[u(Q)] ≈ U = 1 N

N−1

  • n=0

u(Qn), but use just one realization q0 → q1 → · · · → qN−1.

slide-12
SLIDE 12

Variance of the Estimated Mean

Var[U] = 1 N Var[u(Q)]

  • 1 + 2

N−1

  • k=1
  • 1 − k

N C(k) C(0)

  • where the covariances

C(k) = E[(u(Q0) − µ)(u(Qk) − µ)], with µ = E[u(Q)]). In the limit N → ∞, Var[U] = 1 N Var[u(Q)]τ + o( 1 N ) where τ is the integrated autocorrelation time τ = 1 + 2

+∞

  • k=1

C(k) C(0).

slide-13
SLIDE 13

Estimates of CoVariances

CN(k) = 1 N

N−k−1

  • n=0

(u(qn) − u)(u(qn+k) − u).

Priestley (1981, pp.323–324)

slide-14
SLIDE 14

MCMC in Extended State Space

Some samplers augment state variables q with auxiliary variables p, extend the p.d.f. to ρ(q, p) so that

  • ρ(q, p)dp = ρq(q),

and make moves in extended state space z0 → z1 → · · · → zN−1 where z = (q, p).

slide-15
SLIDE 15

Forward Transfer Operator

Associated with the MCMC propagator is an operator F, which maps a relative density un−1 = ρn−1/ρ for Zn−1 to a relative density un = ρn/ρ for Zn: un = Fun−1 where Fun−1(z) = 1 ρ(z)

  • ρ(z|z′)un−1(z′)ρ(z′)dz′

and ρ(z|z′) is the transition probablity for the chain. Note that u ≡ 1 is an eigenfunction of F for eigenvalue λ1 = 1.

slide-16
SLIDE 16

Error in Probability Density

starting from ρ0(q) = δ(q − q0) is proportional to 1/|1 − λ2|, where λ2 is the nonunit eigenvalue of F nearest 1. However, the spectral gap is not the relevant quantity—in general.

slide-17
SLIDE 17

Error in Estimate of τ

An estimate of τ has a standard deviation ≈

  • M/Nτ

where M is the number of terms taken in formula for τ. Therefore, use instead τ ≈ 1 + 2

N−1

  • k=1

w(k)C(k) C(0) where w(k) is a decreasing function known as a lag window. The tiny program acor of Jonathan Goodman (2009) uses a lag window that is 1 from 0 to roughly 1.4τ and decreases linearly from 1.4τ to 2.4τ. It requires the number of samples N to exceed 35τ, roughly.

slide-18
SLIDE 18

Outline

Preliminaries Thorough Sampling Coping with Statistical Error Reversible Samplers

slide-19
SLIDE 19

Weakness of Existing Approach

Mixture of Gaussians − log ρ(q1, q2) =

1 2(16(q1 + 2)2 + (q2 − 4)2) + 1 2((q1 − 2)2 + 16q2 2) + const

Observable function: u(q1, q2) = q1.

slide-20
SLIDE 20

Estimate of τ as N Increases

Sampler: Euler-Maruyama Brownian dynamics (w/o rejections). Need to ensure thorough sampling.

slide-21
SLIDE 21

Definition of Thorough Sampling:

for any subset A of state space, an estimate 1A of E[1A(Q)] = Pr(Q ∈ A) should satisfy Var[1A − E[1A(Q)]] ≤ 1 4tol2.

slide-22
SLIDE 22

Definition of Thorough Sampling:

for any subset A of state space, an estimate 1A of E[1A(Q)] = Pr(Q ∈ A) should satisfy Var[1A − E[1A(Q)]] ≤ 1 4tol2. Since Var[1A − E[1A(Q)]] ≈ τA 1 N Var[1A(Q)] ≤ 1 4N τA, it is enough to have 1 4N τA ≤ 1 4tol2 for all A.

slide-23
SLIDE 23

Symmetries

There may be permutations P such that ρq(Pq) = ρq(q) and u(Pq) = u(q) for all interesting u. Then, consider only A for which 1A(Pq) = 1A(q).

slide-24
SLIDE 24

Symmetries

There may be permutations P such that ρq(Pq) = ρq(q) and u(Pq) = u(q) for all interesting u. Then, consider only A for which 1A(Pq) = 1A(q). For simplicity, instead of only indicator functions, consider all functions in W = {u = u(q) | E[u(Q)] = 0, u(Pq) = u(q) for symmetries P}.

slide-25
SLIDE 25

For More Explicit Notation

Introduce the inner product v, u =

  • v(z)u(z)ρ(z) dz.

Note that 1, u = E[u(Z)]. One can show by induction that E[v(Z0)u(Zk)] = Fkv, u. and, in particular, C(k) = Fku, u.

slide-26
SLIDE 26

Maximum Autocorrelation Time

Define thorough sampling using τmax = max

u∈W

  • 1 + 2

+∞

  • k=1

Fku, u u, u

  • .
slide-27
SLIDE 27

Spatial Discretization

Consider u(q) = aTu(q) where ui ∈ W are given and ai are unknown. Then C(k) = Fku, u = aTCka where Ck = Fku, uT = E[u(Q0)u(Qk)T] and τmax ≈ max

a

aTKa aTC0a where K = C0 + 2

+∞

  • k=1

Ck —an eigenvalue problem. For u, suggest ui(q) = qi.

slide-28
SLIDE 28

Cross Covariance Matrices

C0 is symmetric positive definite. The eigenvalue problem is well conditioned if Ck is symmetric. This depends on the sampler.

slide-29
SLIDE 29

Reversible MCMC Samplers

satisfy detailed balance: ρ(z′|z) ρ(z) = ρ(z|z′) ρ(z′) e.g.,

  • Brownian sampler, one step of the Euler-Maruyama integrator

for Brownian dynamics with or without a MRRTT (Metropolis-Rosenbluth-Rosenbluth-Teller-Teller) acceptance test,

  • hybrid Monte Carlo,
  • a generalized hybrid Monte Carlo step,

followed by a momenta flip,

  • several steps with a reversible Langevin integrator,

followed by a momenta flip.

slide-30
SLIDE 30

Modified Reversible MCMC Samplers

The momenta flip is counterproductive: Reversing the momenta flip seems to improve mixing. A modified reversible propagator couples the reversible substep zn−1 → ¯ zn with a substep, zn = R(¯ zn), where R satisfies R ◦ R = id and ρ ◦ R = ρ. For a momenta flip, R(q, p) = (q, −p).

slide-31
SLIDE 31

An Empirical Comparison

Canc` es, Legoll & Stoltz (2007)

Discrepancy for a pair of torsion angles in alkanes. dimension 36 36 87 114 Langevin dynamics 0.034 0.016 0.014 0.017 HMC 0.039 0.012 0.026 0.021 Brownian dynamics 0.079 0.023 0.040 0.031 Nos´ e-Hoover chains 0.103 0.046 0.029 0.035 Brownian sampler 0.104 0.034 0.048 0.061 1 million samples for column 1, 10 million for others.

slide-32
SLIDE 32

Discretized Langevin Dynamics

Let F(q) = −∇q log ρ(q, p). O: p := √1 − γ∆tp + √γ∆tN(0, I) B: p := p + 1

2∆tF(q)

A: q := q + ∆tp B: p := p + 1

2∆tF(q)

O: p := √1 − γ∆tp + √γ∆tN(0, I) The dynamics has a precise stationary density. It differs from the desired density by O(∆t2).

Leimkuhler, Matthews & Stoltz (2015)

Such error is much smaller than statistical error. (The Euler-Maruyama integrator is a special case of this.)

slide-33
SLIDE 33

Symmetry of Cross Covariance Matrices

For a modified reversible propagator, its forward transfer operator is F = R ¯ F, where ¯ F is the operator for the reversible substep and R is the operator for the substep zn = R(¯ zn). It can be shown that ¯ Fg, f = g, ¯ Ff and Rg, f = g, Rf . Additionally, assume Ru(q) = u(q). Then matrices Ck can be shown to be symmetric modulo sampling error.

slide-34
SLIDE 34

Outline

Preliminaries Thorough Sampling Coping with Statistical Error Reversible Samplers

slide-35
SLIDE 35

Simplest Neural Network

Fit model u(q; x) = q3 tanh(q1x + q2) + q4 to data yi ≈ u(q; xi): − log ρ(q) = 1 2β

100

  • i=1

(yi − u(q; xi))2 + 1 2αq2 + const where β = 1.5 and α = 0.01. A Bayesian approach might use the mean E[u(Q; x)] as the approximation. Sampler: Euler-Maruyama Brownian dynamics w/o rejections.

slide-36
SLIDE 36

Data and Predicted Function

The given data is symmetric about x = 0; lack of symmetry because N = 1 000 000 only.

slide-37
SLIDE 37

Time series for u(q; x1)

slide-38
SLIDE 38

Autcorrelation Function (ACF) for u(q; x1)

slide-39
SLIDE 39

Integrated ACF for N = 10 000

slide-40
SLIDE 40

Integrated ACF for N = 100 000

slide-41
SLIDE 41

Integrated ACF for N = 1 000 000

slide-42
SLIDE 42

ACF for N = 1 000 000 integrated 100 times further

slide-43
SLIDE 43

Estimated τ as N Increases

acor

  • ur version
slide-44
SLIDE 44

Outline

Preliminaries Thorough Sampling Coping with Statistical Error Reversible Samplers

slide-45
SLIDE 45

Reversible Samplers

If τmax were the maximum over all u(z), then τmax = 1 + λ2 1 − λ2 .

slide-46
SLIDE 46

Mixture of Gaussians, Again

Estimated τ and τmax as N increases:

slide-47
SLIDE 47

Simplest Neural Network, Again

Fitting model u(q; x) = q3 tanh(q1x + q2) + q4 to data yi ≈ u(q; xi).

slide-48
SLIDE 48

Time Series for Switch Location −q2/q1

slide-49
SLIDE 49

Histogram for Switch Location −q2/q1

slide-50
SLIDE 50

Autcorrelation Function (ACF) for u(q; x1)

slide-51
SLIDE 51

ACF for Switch Location −q2/q1

slide-52
SLIDE 52

Estimated τ and τmax as N Increases

slide-53
SLIDE 53

Summarizing

Let τ denote the integrated autocorrelation time.

◮ Estimating τ is considered essential. ◮ Greater reliability is possible by trying to estimate the worst

case τ.

◮ Better estimates of τ are needed.

slide-54
SLIDE 54

Thank you