Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++ Alexander Dubbs May 13, 2010 Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
Introduction I Gibbs sample large Bayesian Networks using C and Cilk++, using MATLAB for pre- and post-processing. I chose a class of Bayes Nets with special structure, so that their mean, covariance, marginals, etc. are analytically knowable, making algorithm performance easy to analyze. I achieved no parallel speedups this time, but made substantial headway towards better implementations in the future. Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
Marginal Distributions Let p ( x 1 , . . . , x n ) be a probability distribution that is not necessarily normalized (its integral over R n is positive but possibly not 1). Define the i -th marginal of p as follows: p ( x 1 , . . . , x n ) p ( x i | x 1 , . . . , x i − 1 , x i +1 , . . . , x n ) = � R p ( x 1 , . . . , x n ) dx i Shorthand: p ( x i | p ¬ x i ) This answers the question, “What is the probability that x i is some value if we know the rest of the x j ’s?” Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
Gibbs Sampling The Fundamental Assumption of Gibbs Sampling: Holding x 1 , . . . , x i − 1 , x i +1 , . . . , x n constant, we have a sampling rule for the one-dimensional distribution p ( x i |¬ x i ) . (When that isn’t true, JAGS uses “Slice Sampling,” a very good adaptive Monte-Carlo method for 1-D distributions). X ∼ p ( x j |¬ x j ) will here mean “ X is sampled from the distribution p ( x j |¬ x j ).” It need not be normalized for you to sample from it. Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
Gibbs Sampling The algorithm uses two for loops: 1. Initialize x 0 1 , . . . , x 0 n to anything. 2. For i = 1 , 2 , . . . For j = 1 , . . . , n j − 1 , x i − 1 Sample x i j ∼ p ( x j | x i 1 , . . . , x i j +1 , . . . , x i − 1 ). n The “Gauss-Seidel” property of this algorithm - that some superscripts in the above marginal are i , instead of all ( i − 1) - is crucial to its success. Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
Exploiting Independence for Parallelism Say you want to Gibbs sample p ( x 1 , x 2 ). The algorithm dictates that you generate samples in the following order given x 0 1 and x 0 2 : x 1 1 , x 1 2 , x 2 1 , x 2 2 , x 3 1 , x 3 2 , x 4 1 , x 4 2 , . . . However, if p ( x 1 , x 2 ) = p ( x 1 ) p ( x 2 ), you can do a “Jacobi Iteration” instead. I place two variables in ()’s if they can be generated only as functions of variables on their left, and one does not need to know the other in order to be generated. The order becomes: ( x 1 1 , x 1 2 ) , ( x 2 1 , x 2 2 ) , ( x 3 1 , x 3 2 ) , ( x 4 1 , x 4 2 ) , . . . Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
Exploiting Independence for Parallelism In general, take a distribution p ( x 1 , . . . , x n ). Define the following partition of { x 1 , . . . , x n } into nonempty sets C k , called “colors”: m � { x 1 , . . . , x n } = C k , k =1 where if x i and x j are in the same C k , or “have the same color,” then they are independent. The converse need not hold (and usually doesn’t). Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
A Parallel Algorithm Modified from R. Bradford, A. Thomas, “Markov chain Monte Carlo methods for family trees using a parallel processor,” Statistics and Computing (1996), 6 , 67-75. Chad Scherrer had the idea independently. 1. Partition { x 1 , . . . , x n } = � m k =1 C k as previously described. 2. Initialize x 0 1 , . . . , x 0 n to anything. 3. For i = 1 , 2 , . . . For k = 1 , . . . , m Sample all x j ’s in C k in parallel, to make x k j ’s. Specifically, for x j ∈ C k , sample x i 1 , . . . , x α j − 1 j − 1 , x α j +1 j ∼ p ( x j | x α 1 j +1 , . . . , x α n n ) , where α p = i if p ∈ C k ′ for k ′ < k and α p = i − 1 if p ∈ C k ′ for k ′ ≥ k . The “ ≥ ” is where the parallelism is. Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
Theoretical Pros and Cons Successfully implemented in (Bradford, Thomas, 1996). Opportunity for data locality. Performance highly dependent on coloring (NP-Complete). Can create cache lines: if two independent variables are both dependent on the same third variable, they may reach for its value at the same time to update from their own marginals. Cilk++ hyperobjects, which I have not implemented (yet), will probably help. Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
The Test Problem The following method creates a multivariate normal p ( x 1 , . . . , x n ) with sparse dependency structure among the variables: Pick a subset of maybe four of the variables, say x 1 , x 2 , x 3 , x 4 . 1 Assign them a pos. def. sym. “covariance” matrix. Invert it, and then extend it to an n × n matrix by 2 zero-padding. Do the above two steps a number of times, and then add the 3 inverted “covariance” matrices together. Also add in I n × n . Then invert that quantity to get C , the covariance matrix of a multivariate normal over ( x 1 , . . . , x n ). The mean is � 0. x i and x j will only be dependent if they appear in a set of four 4 in step 1. Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
The Test Problem There is an analytic formula for p ( x j |¬ x j ) for normals, so we can plug that into the Gibbs Sampler. It involves inverting some of the minors of the covariance matrix C . I do that in MATLAB and dump the data to text to be read by the Cilk++ program. Part of this process is determining the dependency relations among the x j ’s, forming a “dependency graph,” which can be colored ( x j ’s are nodes, connected by edges if they are dependent). I have MATLAB do a greedy coloring and output the data to text to be read by Cilk++. Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
The Test Problem We know the mean and covariance of p ( x 1 , . . . , x n ), so we can assess the Gibbs Sampler’s convergence by seeing if its sample (“ergodic”) mean and covariance match up to the true values. The Cilk++ program outputs text which is read by MATLAB, in which postprocessing can be done smoothly. Remark: This is a bad way to sample from a multivariate normal, I only Gibbs Sample it to test the Gibbs Sampler. The right way to sample from it is to rotate to a basis where the x j ’s are independent, sample them in parallel there, and then rotate back. The target basis is the eigenbasis of C . Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
Results No parallel speedup this time. I used the test problem described previously, with ⌊ N / 3 ⌋ small 4 × 4 randomly chosen “covariance” matrices. Figure: N from 10 to 100, five different distributions p ( x 1 , . . . , x N ) per N . The Cilk++ implementation took longer. 10 , 000 samples done for all N , errors in ergodic mean grew from roughly 3% to 9%. Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
Comments I was unable to go to larger N , since my data transfer system from MATLAB to Cilk++ doesn’t exploit sparsity. This will be fixed. I also tried quite a few embarrassingly parallel implementations that created massive slowdowns. Initially the problem was probably cache lines, but even after recopying all of the relevant data things still aren’t working. I think that may be because multiple processors are trying to access nearby bytes, but I’m not sure. Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
The Future Rewrite all code to be object-oriented, better take advantage of sparsity. Explore parallelism with Cilk Hyperobjects, such as Reducers. Use same data structures to explore Quadrature and Langevin-Equation Sampling approaches, as well as Gibbs Sampling. Alexander Dubbs Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++
Recommend
More recommend