monte carlo simulation inspired by computational
play

Monte Carlo simulation inspired by computational optimization Colin - PowerPoint PPT Presentation

Monte Carlo simulation inspired by computational optimization Colin Fox fox@physics.otago.ac.nz Al Parker, John Bardsley MCQMC Feb 2012, Sydney Sampling from ( x ) Consider : x is high-dimensional ( 10 4 10 8 ) and is expensive to


  1. Monte Carlo simulation inspired by computational optimization Colin Fox fox@physics.otago.ac.nz Al Parker, John Bardsley MCQMC Feb 2012, Sydney

  2. Sampling from π ( x ) Consider : x is high-dimensional ( 10 4 – 10 8 ) and π is expensive to compute MH-MCMC repeatedly simulates fixed transition kernel P with π P = π via detailed balance (self-adjoint in π ) n -step distribution: π ( n ) = π ( n − 1) P = π (0) P n n -step error in distribution: π ( n ) − π = π (0) − π P n � � error polynomial: P n = ( I − ( I − P )) n = (1 − λ ) n Hence convergence is geometric This was the state of stationary linear solvers in 1950’s – now considered very slow Polynomial acceleration: n n (choose x ( i ) w.p. α i ) ∼ π (0) � α i P i � α i (1 − λ ) i error poly: Q n = i =1 i =1

  3. Sampling from π ( x ) Consider : x is high-dimensional ( 10 4 – 10 8 ) and π is expensive to compute MH-MCMC repeatedly simulates fixed transition kernel P with π P = π via detailed balance (self-adjoint in π ) n -step distribution: π ( n ) = π ( n − 1) P = π (0) P n n -step error in distribution: π ( n ) − π = π (0) − π P n � � error polynomial: P n = ( I − ( I − P )) n = (1 − λ ) n Hence convergence is geometric This was the state of stationary linear solvers in 1950’s – now considered very slow Polynomial acceleration: n n (choose x ( i ) w.p. α i ) ∼ π (0) � α i P i � α i (1 − λ ) i error poly: Q n = i =1 i =1

  4. Gibbs sampling Gibbs sampling a repeatedly samples from (block) conditional distributions Forward and reverse sweep simulates a reversible kernel, as in MH-MCMC Normal distributions � det ( A ) � − 1 � 2 x T Ax + b T x π ( x ) = exp 2 π n precision matrix A , covariance matrix Σ = A − 1 (both SPD) wlog treat zero mean Particularly interested in case where A is sparse (GMRF) and n large When is π (0) is also normal, then so is the n-step distribution A ( n ) → A Σ ( n ) → Σ In what sense is “stochastic relaxation” related to “relaxation”? What decomposition of A is this performing? a Glauber 1963 (heat-bath algorithm), Turcin 1971, Geman and Geman 1984

  5. Gibbs samplers and equivalent linear solvers Optimization ... Gauss-Seidel Cheby-GS CG/Lanczos Sampling ... Gibbs Cheby-Gibbs Lanczos

  6. Matrix formulation of Gibbs sampling from N(0 , A − 1 ) Let y = ( y 1 , y 2 , ..., y n ) T Component-wise Gibbs updates each component in sequence from the (normal) conditional distributions One ‘sweep’ over all n components can be written y ( k +1) = − D − 1 Ly ( k +1) − D − 1 L T y ( k ) + D − 1 / 2 z ( k ) where: D = diag( A ) , L is the strictly lower triangular part of A , z ( k − 1) ∼ N( 0 , I ) y ( k +1) = Gy ( k ) + c ( k ) c ( k ) is iid ’noise’ with zero mean, finite covariance stochastic AR(1) process = first order stationary iteration plus noise Goodman & Sokal, 1989

  7. Matrix splitting form of stationary iterative methods The splitting A = M − N converts linear system Ax = b to Mx = Nx + b . If M is nonsingular x = M − 1 Nx + M − 1 b . Iterative methods compute successively better approximations by x ( k +1) = M − 1 Nx ( k ) + M − 1 b = Gx ( k ) + g Many splittings use terms in A = L + D + U . Gauss-Seidel sets M = L + D x ( k +1) = − D − 1 Lx ( k +1) − D − 1 L T x ( k ) + D − 1 b spot the similarity to Gibbs y ( k +1) = − D − 1 Ly ( k +1) − D − 1 L T y ( k ) + D − 1 / 2 z ( k ) Goodman & Sokal, 1989; Amit & Grenander, 1991

  8. Matrix splitting form of stationary iterative methods The splitting A = M − N converts linear system Ax = b to Mx = Nx + b . If M is nonsingular x = M − 1 Nx + M − 1 b . Iterative methods compute successively better approximations by x ( k +1) = M − 1 Nx ( k ) + M − 1 b = Gx ( k ) + g Many splittings use terms in A = L + D + U . Gauss-Seidel sets M = L + D x ( k +1) = − D − 1 Lx ( k +1) − D − 1 L T x ( k ) + D − 1 b spot the similarity to Gibbs y ( k +1) = − D − 1 Ly ( k +1) − D − 1 L T y ( k ) + D − 1 / 2 z ( k ) Goodman & Sokal 1989; Amit & Grenander 1991

  9. Gibbs converges ⇐ ⇒ solver converges Theorem 1 Let A = M − N , M invertible. The stationary linear solver x ( k +1) = M − 1 Nx ( k ) + M − 1 b = Gx ( k ) + M − 1 b converges, if and only if the random iteration y ( k +1) = M − 1 Ny ( k ) + M − 1 c ( k ) = Gy ( k ) + M − 1 c ( k ) converges in distribution. Here c ( k ) iid ∼ π n has zero mean and finite variance Proof. Both converge iff ̺ ( G ) < 1 � Convergent splittings generate convergent (generalized) Gibbs samplers Mean converges with asymptotic convergence factor ̺ ( G ) , covariance with ̺ ( G ) 2 Young 1971 Thm 3-5.1, Duflo 1997 Thm 2.3.18-4, Goodman & Sokal, 1989, Galli & Gao 2001 F Parker 2012

  10. Some not so common Gibbs samplers for N(0 , A − 1 ) = M T + N c ( k ) � � splitting/sampler converge if M Var 2 1 2 Richardson ω I − A 0 < ω < ω I ̺ ( A ) Jacobi 2 D − A A SDD D GS/Gibbs D + L always D 1 2 − ω SOR/B&F ω D + L 0 < ω < 2 ω D ω ω 2 − ω M SOR D − 1 M T M SOR D − 1 M T � SSOR/REGS 0 < ω < 2 SOR 2 − ω SOR + N T SOR D − 1 N SOR � Want: convenient to solve Mu = r and sample from N(0 , M T + N ) Relaxation parameter ω can accelerate Gibbs SSOR is a forwards and backwards sweep of SOR to give a symmetric splitting SOR: Adler 1981; Barone & Frigessi 1990, Amit & Grenander 1991, SSOR: Roberts & Sahu 1997

  11. A closer look at convergence To sample from N( µ, A − 1 ) where A µ = b Split A = M − N , M invertible. G = M − 1 N , and c ( k ) iid ∼ N(0 , M T + N ) The iteration y ( k +1) = Gy ( k ) + M − 1 � ( c ( k ) + b � implies � y ( m ) � − µ = G m � � y (0) � � E E − µ and � y ( m ) � − A − 1 = G m � � y (0) � − A − 1 � G m Var Var (Hence asymptotic average convergence factors ̺ ( G ) and ̺ ( G ) 2 ) Errors go down as the polynomial P m ( I − G ) = ( I − ( I − G )) m = � m I − M − 1 A � P m ( λ ) = (1 − λ ) m solver and sampler have exactly the same error polynomial

  12. A closer look at convergence To sample from N( µ, A − 1 ) where A µ = b Split A = M − N , M invertible. G = M − 1 N , and c ( k ) iid ∼ N(0 , M T + N ) The iteration y ( k +1) = Gy ( k ) + M − 1 � ( c ( k ) + b � implies � y ( m ) � − µ = G m � � y (0) � � E E − µ and � y ( m ) � − A − 1 = G m � � y (0) � − A − 1 � G m Var Var (Hence asymptotic average convergence factors ̺ ( G ) and ̺ ( G ) 2 ) Errors go down as the polynomial P m ( I − G ) = ( I − ( I − G )) m = � m I − M − 1 A � P m ( λ ) = (1 − λ ) m solver and sampler have exactly the same error polynomial

  13. Controlling the error polynomial The splitting A = 1 � 1 − 1 � τ M + M − N τ gives the iteration operator I − τ M − 1 A � � G τ = and error polynomial P m ( λ ) = (1 − τλ ) m The sequence of parameters τ 1 , τ 2 , . . . , τ m gives the error polynomial m � P m ( λ ) = (1 − τ l λ ) l =1 ... so we can choose the zeros of P m ! This gives a non-stationary solver ≡ non-homogeneous Markov chain (equivalently, can post-process chain by taking linear combination of states) Golub & Varga 1961, Golub & van Loan 1989, Axelsson 1996, Saad 2003, F & Parker 2012

  14. The best (Chebyshev) polynomial 10 iterations, factor of 300 improvement Choose � � 1 = λ n + λ 1 + λ n − λ 1 π 2 l + 1 cos l = 0 , 1 , 2 , . . . , p − 1 τ l 2 2 2 p where λ 1 λ n are extreme eigenvalues of M − 1 A

  15. Second-order accelerated sampler First-order accelerated iteration turns out to be unstable Numerical stability, and optimality at each step, is given by the second-order iteration y ( k +1) = (1 − α k ) y ( k − 1) + α k y ( k ) + α k τ k M − 1 ( c ( k ) − Ay ( k ) ) with α k and τ k chosen so error polynomial satisfies Chebyshev recursion. Theorem 2 2 nd -order solver converges ⇒ 2 nd -order sampler converges (given correct noise distribution) Error polynomial is optimal, at each step, for both mean and covariance Asymptotic average reduction factor (Axelsson 1996) is � σ = 1 − λ 1 /λ n � 1 + λ 1 /λ n Axelsson 1996, F & Parker 2012

  16. Algorithm 1: Chebyshev accelerated SSOR sampling from N( 0 , A − 1 ) input : The SSOR splitting M , N of A ; smallest eigenvalue λ min of M − 1 A ; largest eigenvalue λ max of M − 1 A ; relaxation parameter ω ; initial state y (0) ; k max output : y ( k ) approximately distributed as N ( 0 , A − 1 ) � 2 � 2 � 1 / 2 , δ = � λ max − λ min , θ = λ max + λ min set γ = ω − 1 ; 4 2 set α = 1 , β = 2 /θ , τ = 1 /θ , τ c = 2 τ − 1 ; for k = 1 , . . . , k max do if k = 0 then b = 2 α − 1 , a = τ c b , κ = τ ; else b = 2(1 − α ) /β + 1 , a = τ c + ( b − 1) (1 /τ + 1 /κ − 1) , κ = β + (1 − α ) κ ; end sample z ∼ N( 0 , I ) ; c = γb 1 / 2 D 1 / 2 z ; x = y ( k ) + M − 1 ( c − Ay ( k ) ) ; sample z ∼ N( 0 , I ) ; c = γa 1 / 2 D 1 / 2 z ; w = x − y ( k ) + M − T ( c − Ax ) ; y ( k +1) = α ( y ( k ) − y ( k − 1) + τ w ) + y ( k − 1) ; β = ( θ − βδ ) − 1 ; α = θβ ; end

Recommend


More recommend