bayesian inference and markov chain monte carlo
play

Bayesian Inference and Markov Chain Monte Carlo Algorithms on GPUs - PowerPoint PPT Presentation

Bayesian Inference and Markov Chain Monte Carlo Algorithms on GPUs Alexander Terenin and David Draper University of California, Santa Cruz Joint work with Shawfeng Dong May 11, 2017 Talk for Nvidia GPU Technology Conference arXiv:1608.04329


  1. Bayesian Inference and Markov Chain Monte Carlo Algorithms on GPUs Alexander Terenin and David Draper University of California, Santa Cruz Joint work with Shawfeng Dong May 11, 2017 Talk for Nvidia GPU Technology Conference arXiv:1608.04329 Special thanks to Nvidia and Akitio for providing hardware

  2. What are we trying to do? Statistical machine learning and artificial intelligence at scale arg min L ( x , θ ) + || θ || L ( x , θ ) : loss function || θ || : regularization Goal: minimize loss • Typical approach: stochastic gradient descent Alternative approach: rewrite loss as an instance of Bayes’ Rule Alexander Terenin and David Draper 1 Bayesian Inference and MCMC on GPUs

  3. Bayesian Representation of Statistical Machine Learning Consider the exponential of the loss f ( x | θ ) ∝ exp { c L ( x , θ ) } π ( θ ) ∝ exp || θ || f ( x | θ ) : likelihood π ( θ ) : prior loss function ⇐ ⇒ posterior distribution New goal: draw samples from f ( θ | x ) • A lot like non-convex optimization Alexander Terenin and David Draper 2 Bayesian Inference and MCMC on GPUs

  4. What are we trying to do? Goal: draw samples from f ( θ | x ) Some difficulties at scale • Big Data: large x , computations using all data will be slow • Complex Models: large θ (can be ≥ x ): curse of dimensionality Why not just find the maximum? • Understand, quantify, and propagate uncertainty • Sampling algorithms are essentially global optimizers • Loss may have no analytic form, making SGD impractical Alexander Terenin and David Draper 3 Bayesian Inference and MCMC on GPUs

  5. Hardware Bayesian inference is inherently expensive: let’s parallelize it • Parallelizable : only has meaning in context • Different types of parallel hardware have different requirements GPUs: main challenges • Memory bottleneck: limited RAM, may need to stream data • Warp divergence: fine-grained if/else → if, wait, else GPUs: design goals • Expose fine-grained parallelism • Minimize branching to control warp divergence • Ideally: run out-of-core (i.e. on minibatches streaming off disk) Alexander Terenin and David Draper 4 Bayesian Inference and MCMC on GPUs

  6. Gibbs Sampling The canonical Bayesian sampling algorithm Draws samples from target with density f ( x, y, z ) sequentially • Full conditionals: f ( x | y, z ) , f ( y | x, z ) , f ( z | x, y ) Algorithm: Gibbs Sampling • Step 1: draw x 1 | y 0 , z 0 • Step 2: draw y 1 | x 1 , z 0 • Step 3: draw z 1 | x 1 , y 1 , repeat until convergence to f ( x, y, z ) How do we parallelize this? Alexander Terenin and David Draper 5 Bayesian Inference and MCMC on GPUs

  7. GPU-accelerated Gibbs Sampling Start with an exchangeable model N � f ( x | θ ) = f ( x i | θ ) i =1 Example: Probit Regression β ∼ N( µ, λ 2 ) y i | z i = round[Φ( z i )] z i | x i , β ∼ N ( x i β , 1) Data Augmentation Gibbs Sampler ( X T X ) − 1 X T z , ( X T X ) − 1 � � z i | β ∼ TN( x i β , 1 , y i ) β | z ∼ N Alexander Terenin and David Draper 6 Bayesian Inference and MCMC on GPUs

  8. GPU-accelerated Gibbs Sampling Data Augmentation Gibbs Sampler ( X T X ) − 1 X T z , ( X T X ) − 1 � � z i | β ∼ TN( x i β , 1 , y i ) β | z ∼ N Both steps are amenable to GPU-based parallelism • Draw β | z in parallel: use Cholesky decomposition • Draw z | β in parallel: z i ⊥ ⊥ z − i for all i by exchangeability Sufficient fine-grained parallelism in X β , X T z , Chol( X T X ) Some tricks used to control warp divergence in TN kernel Overlap computation and output: write β to disk while updating z Alexander Terenin and David Draper 7 Bayesian Inference and MCMC on GPUs

  9. GPU-accelerated Gibbs Sampling Data Augmentation Gibbs Sampler ( X T X ) − 1 X T z , ( X T X ) − 1 � � z i | β ∼ TN( x i β , 1 , y i ) β | z ∼ N What if we add a hierarchical prior such as the Horseshoe? β | λ ∼ N(0 , λ 2 ) λ | ν ∼ π ( ν ) ν ∼ π ( η ) Hierarchical priors factorize: update λ | − and ν | − in parallel • If GPU is not saturated, the computation is essentially free • More complicated model: more available parallelism Alexander Terenin and David Draper 8 Bayesian Inference and MCMC on GPUs

  10. GPU-accelerated Performance Horseshoe Probit Regression CPU and GPU Run Time: 10,000 Monte Carlo iterations Dimension 2:17 1,000,000 GPU: 10,000 GPU: 1,000 Data Size 1:11 GPU: 100 0:28 100,000 2:22 Workstation: 1,000 Workstation: 100 0:41 Laptop: 1,000 0:17 10,000 Laptop: 100 0:23 1:30 0 10 20 30 40 50 60 70 80 90 Time (minutes) It’s lightning fast, and requires no new theory • N = 10 , 000 , p = 1 , 000: 90 minutes → 41 seconds Alexander Terenin and David Draper 9 Bayesian Inference and MCMC on GPUs

  11. Conclusions Bayesian problems can benefit immensely from hardware acceleration • External GPUs, like the Akitio Node , are making this accessible MCMC is both inherently sequential and massively parallelizable • Not well-studied, lots of potential for new results • Stay tuned: minibatch-based MCMC possible in continuous time A. Terenin, S. Dong, and D. Draper. GPU-accelerated Gibbs Sampling. arXiv:1608.04329, 2016. Alexander Terenin and David Draper 10 Bayesian Inference and MCMC on GPUs

Recommend


More recommend