stochastic gradient annealed importance sampling
play

Stochastic Gradient Annealed Importance Sampling Scott Cameron - PowerPoint PPT Presentation

Stochastic Gradient Annealed Importance Sampling Scott Cameron Hans Eggers Steve Kroon Stellenbosch University NITheP Motivation Stochastic optimization 1 Motivation Goal: Effjcient large-scale marginal likelihood estimation using


  1. Stochastic Gradient Annealed Importance Sampling Scott Cameron Hans Eggers Steve Kroon Stellenbosch University NITheP

  2. Motivation Stochastic optimization 1

  3. Motivation Goal: Effjcient large-scale marginal likelihood estimation using mini-batches 2

  4. Marginal Likelihood (Evidence) Consider a Bayesian model D = { y n } N p ( D , θ ) = p ( θ ) ∏ p ( y n | θ ) n = 1 n Posterior given by Bayes theorem p ( θ |D ) = p ( D| θ ) p ( θ ) p ( D ) Marginal likelihood ∫ Z := p ( D ) = p ( D| θ ) p ( θ ) d θ Posterior predictive ∫ p ( y ′ |D ) = p ( y ′ | θ ) p ( θ |D ) d θ 3

  5. Model Comparison/Combination Posterior over models M 1 , M 2 , · · · P ( M 1 |D ) p ( M 1 ) P ( M 2 |D ) = Z 1 p ( M 2 ) Z 2 M 1 is a ‘better’ model than M 2 if Z 1 ≫ Z 2 Combined predictions i p ( y ′ |D , M i ) Z i p ( M i ) ∑ p ( y ′ |D ) = i Z i p ( M i ) ∑ Weighs models proportionately to how well they describe data 4

  6. Why is this diffjcult? Example model µ ∼ N ( 0 , 1 ) y n ∼ N ( µ, 1 ) Naive estimator M Z = 1 ˆ ∑ p ( D| µ i ) µ i ∼ p ( µ ) M i = 1 5

  7. Why is this diffjcult? Consistently underestimate/overestimate Prior samping Harmonic mean 6

  8. Annealed Importance Sampling Adiabatically decrease temperature: 0 = λ 0 < · · · < λ T = 1 f t ( θ ) = p ( D| θ ) λ t p ( θ ) Update particles with HMC 1 U t ( θ ) = − λ t log p ( D| θ ) − log p ( θ ) Iterated importance sampling w ( t ) ← w ( t − 1 ) p ( D| θ ( t − 1 ) ) λ t − λ t − 1 i i i Estimator M Z = 1 w ( T ) ˆ ∑ M i i = 1 1 Hamiltonian Monte Carlo 7

  9. Annealed Importance Sampling Adiabatically decrease temperature: 0 = λ 0 < · · · < λ T = 1 f t ( θ ) = p ( D| θ ) λ t p ( θ ) Update particles with HMC 1 U t ( θ ) = − λ t log p ( D| θ ) − log p ( θ ) Iterated importance sampling w ( t ) ← w ( t − 1 ) p ( D| θ ( t − 1 ) ) λ t − λ t − 1 i i i Estimator M Z = 1 w ( T ) ˆ ∑ M i i = 1 1 Hamiltonian Monte Carlo 7

  10. Problems with Scalability Accurate estimates require T ∝ |D| 1. HMC needs likelihood gradients, O ( |D| ) 2. Importance weights need likelihood, O ( |D| ) |D| 2 ) More or less O complexity ( 8

  11. Stochastic Gradient HMC Simulate Langevin dynamics ˙ θ = v ⟨ ξ ( t ) ξ ( t ′ ) ⟩ = δ ( t − t ′ ) v = −∇ U ( θ ) − γ v + √ 2 γ ξ ˙ Fokker–Planck equation 2 ( ) ∂ p 0 − I ∂ t = ∂ T A { p ∂ H + ∂ p } A = I γ Canonical ensemble p ∞ ( θ, v ) = 1 Ze − H ( θ, v ) 2 H ( θ, v ) = U ( θ ) + 1 2 v 2 9

  12. solves (1) Stochastic Gradient HMC Euler–Maruyama discretization ∆ θ = v ∆ v = − η ∇ ˆ U ( θ ) − α v + N ( 0 , 2 ( α − ˆ β ) η ) Mini-batch energy estimate U ( θ ) = −|D| ˆ ∑ log p ( y | θ ) − log p ( θ ) | B | y ∈ B Time complexity O ( | B | ) ≪ O ( |D| ) 10

  13. Stochastic Gradient HMC Euler–Maruyama discretization ∆ θ = v ∆ v = − η ∇ ˆ U ( θ ) − α v + N ( 0 , 2 ( α − ˆ β ) η ) Mini-batch energy estimate U ( θ ) = −|D| ˆ ∑ log p ( y | θ ) − log p ( θ ) | B | y ∈ B Time complexity O ( | B | ) ≪ O ( |D| ) solves (1) 10

  14. Comparison of MCMC Trajectories RWMH HMC SGLD SGHMC 11

  15. solves (2) Bayesian Updating/Online Estimation Predictive distributions ∫ ∏ ∏ p ( y n | y < n ) = p ( y n | θ ) p ( θ | y < n ) Z = n n Estimate p ( y n | y < n ) with AIS θ ( n ) w ( n ) ← AIS ( y n , θ ( n − 1 ) , ˜ ) i i i Marginal likelihood M Z = 1 w ( n ) ˆ ∑ ∏ ˜ M i i = 1 n 12

  16. Bayesian Updating/Online Estimation Predictive distributions ∫ ∏ ∏ p ( y n | y < n ) = p ( y n | θ ) p ( θ | y < n ) Z = n n Estimate p ( y n | y < n ) with AIS θ ( n ) w ( n ) ← AIS ( y n , θ ( n − 1 ) , ˜ ) i i i Marginal likelihood M Z = 1 w ( n ) ˆ ∑ ∏ ˜ M i i = 1 n solves (2) 12

  17. Stochastic Gradient Annealed Importance Sampling Intermediate distributions [∏ ] f ( λ ) n ( θ ) = p ( y n | θ ) λ p ( y k | θ ) p ( θ ) k < n Update particles with SGHMC n ( θ ) = − λ log p ( y n | θ ) − n − 1 U ( λ ) ˆ ∑ log p ( y | θ ) − log p ( θ ) | B | y ∈ B Importance weights w ( t ) ← w ( t − 1 ) p ( y n | θ ( t − 1 ) ) λ t − λ t − 1 i i i ML estimator M Z = 1 w ( T ) ˆ ∑ i M i = 1 13

  18. Results Gaussian mixture model • vs nested sampling • vs annealed importance sampling 14

  19. Parameter sensitivity Adaptive annealing schedule • Blue ≈ no annealing steps 15

  20. Distribution Shift Data may change over time 1 ≤ n ≤ 10 3 10 3 < n ≤ 10 4 10 4 < n ≤ 10 5 total 16

  21. Distribution Shift Dashed lines = shuffmed data 17

  22. Thank You! [1] Cameron, S.A.; Eggers, H.C.; Kroon, S. Stochastic Gradient Annealed Importance Sampling for Effjcient Online Marginal Likelihood Estimation. Entropy 21.11 (2019). [2] Chen, T.; Fox, E.; Guestrin, C. Stochastic Gradient Hamiltonian Monte Carlo. ICML Proceedings vol. 5. (2014). Funded by NITheP 3 Paper sponsored by MaxEnt 2019 Big thanks to Hans and Steve! 3 National Institute of Theoretical Physics 18

  23. Extra Slides

  24. SGAIS Algorithm 1 Stochastic Gradient Annealed Importance Sampling 1: ∀ i : sample θ i ∼ p ( θ ) 2: ∀ i : w i ← 1 3: for n = 1 , . . . , N do λ ← 0 4: while λ < 1 do 5: ∆ ← argmin ∆ [ESS(∆) − ESS ∗ ] 6: 7: λ ← λ + ∆ ∀ i : w i ← w i p ( y n | θ i ) ∆ 8: ▷ optionally resample particles ∀ i : θ i ← SGHMC ( θ i , ˆ U ( λ ) 9: n ) end while 10: 11: end for Z = 1 12: return ˆ i w i ∑ M 19

  25. Number of Particles 20

  26. Efgective Sample Size 21

  27. Learning Rate 22

  28. Burnin 23

  29. Learning Rate × Burnin 24

  30. Batch Size 25

Recommend


More recommend