Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Christopher De Sa Kunle Olukotun Christopher Ré {cdesa,kunle,chrismre}@stanford.edu Stanford 1
Overview
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. …etc. Zhang et al, PVLDB 2014 Smola et al, PVLDB 2010
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work?
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee .
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee .
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee . Our contributions 1. The “folklore” is not necesarily true. 2. ...but it works under reasonable conditions .
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee . Our contributions 1. The “folklore” is not necesarily true. 2. ...but it works under reasonable conditions .
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee . Our contributions 1. The “folklore” is not necesarily true. 2. ...but it works under reasonable conditions .
Problem: given a probability distribution , produce samples from it. • e.g. to do inference in a graphical model 10
Problem: given a probability distribution , produce samples from it. • e.g. to do inference in a graphical model Algorithm: Gibbs sampling • de facto Markov chain Monte Carlo ( MCMC ) method for inference • produces a series of approximate samples that approach the target distribution 11
What is Gibbs Sampling? 12
What is Gibbs Sampling? Algorithm 1 Gibbs sampling Require: Variables x i for 1 ≤ i ≤ n , and distribution π . loop Choose s by sampling uniformly from { 1 , . . . , n } . Re-sample x s uniformly from P π ( x s | x { 1 ,...,n }\{ s } ). output x end loop 13
What is Gibbs Sampling? Algorithm 1 Gibbs sampling Require: Variables x i for 1 ≤ i ≤ n , and distribution π . loop Choose s by sampling uniformly from { 1 , . . . , n } . Re-sample x s uniformly from P π ( x s | x { 1 ,...,n }\{ s } ). output x end loop x 2 x 7 x 5 x 4 x 1 x 6 x 3 14
What is Gibbs Sampling? Algorithm 1 Gibbs sampling Choose a variable to Require: Variables x i for 1 ≤ i ≤ n , and distribution π . update at random. loop Choose s by sampling uniformly from { 1 , . . . , n } . Re-sample x s uniformly from P π ( x s | x { 1 ,...,n }\{ s } ). output x end loop x 2 x 7 x 5 x 5 x 4 x 1 x 6 x 3 15
What is Gibbs Sampling? Algorithm 1 Gibbs sampling Compute its conditional Require: Variables x i for 1 ≤ i ≤ n , and distribution π . distribution given the loop other variables. Choose s by sampling uniformly from { 1 , . . . , n } . Re-sample x s uniformly from P π ( x s | x { 1 ,...,n }\{ s } ). output x end loop x 2 x 7 x 7 x 5 x 5 x 4 x 4 P ( ) = 0.7 x 5 x 1 x 6 x 6 x 3 P ( ) = 0.3 x 5 16
What is Gibbs Sampling? Algorithm 1 Gibbs sampling Compute its conditional Require: Variables x i for 1 ≤ i ≤ n , and distribution π . distribution given the loop other variables. Choose s by sampling uniformly from { 1 , . . . , n } . Re-sample x s uniformly from P π ( x s | x { 1 ,...,n }\{ s } ). output x Update the variable by end loop sampling from its conditional distribution. x 2 x 7 x 7 x 5 x 5 x 5 x 4 x 4 P ( ) = 0.7 x 5 x 1 x 6 x 6 x 3 P ( ) = 0.3 x 5 17
What is Gibbs Sampling? Algorithm 1 Gibbs sampling Require: Variables x i for 1 ≤ i ≤ n , and distribution π . loop Output the current Choose s by sampling uniformly from { 1 , . . . , n } . state as a sample. Re-sample x s uniformly from P π ( x s | x { 1 ,...,n }\{ s } ). output x end loop x 2 x 7 x 5 x 5 x 5 x 4 x 1 x 6 x 3 18
Gibbs Sampling: A Practical Perspective 19
Gibbs Sampling: A Practical Perspective • Pros of Gibbs sampling – Easy to implement – Updates are sparse à fast on modern CPUs • Cons of Gibbs sampling – sequential algorithm à can’t naively parallelize 20
Gibbs Sampling: A Practical Perspective • Pros of Gibbs sampling – Easy to implement – Updates are sparse à fast on modern CPUs • Cons of Gibbs sampling – sequential algorithm à can’t naively parallelize Leave up to 98% e.g. of performance No parallelism on the table! 64 core 21
Asynchronous Gibbs Sampling 22
Asynchronous Gibbs Sampling • Run multiple threads in parallel without locks – also known as H OGWILD ! – adapted from a popular technique for stochastic gradient descent (SGD) • When we read a variable, it could be stale – while we re-sample a variable, its adjacent variables can be overwritten by other threads – semantics not equivalent to standard (sequential) Gibbs sampling 23
Asynchronous Gibbs Sampling • Run multiple threads in parallel without locks – also known as H OGWILD ! – adapted from a popular technique for stochastic gradient descent (SGD) • When we read a variable, it could be stale – while we re-sample a variable, its adjacent variables can be overwritten by other threads – semantics not equivalent to standard (sequential) Gibbs sampling 24
25
Question Does asynchronous Gibbs sampling work? …and what does it mean for it to work? 26
Question Does asynchronous Gibbs sampling work? …and what does it mean for it to work? Two desiderata 27
Question Does asynchronous Gibbs sampling work? …and what does it mean for it to work? Two desiderata want to get accurate estimates ê bound the bias 28
Question Does asynchronous Gibbs sampling work? …and what does it mean for it to work? Two desiderata want to be independent of initial conditions want to get accurate estimates quickly ê ê bound the bound the bias mixing time 29
Previous Work 30
Previous Work • “ Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent ” — Niu et al, NIPS 2011. follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015 • “ Analyzing Hogwild Parallel Gaussian Gibbs Sampling ” — Johnson et al, NIPS 2013. 31
Previous Work • “ Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent ” — Niu et al, NIPS 2011. follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015 • “ Analyzing Hogwild Parallel Gaussian Gibbs Sampling ” — Johnson et al, NIPS 2013. 32
Question Does asynchronous Gibbs sampling work? …and what does it mean for it to work? Two desiderata want to be independent of initial conditions want to get accurate estimates quickly ê ê bound the bound the bias mixing time 33
Bias 34
Bias • How close are samples to target distribution? – standard measurement: total variation distance k µ � ν k TV = max A ⊂ Ω | µ ( A ) � ν ( A ) | • For sequential Gibbs, no asymptotic bias : 35
Bias • How close are samples to target distribution? – standard measurement: total variation distance k µ � ν k TV = max A ⊂ Ω | µ ( A ) � ν ( A ) | • For sequential Gibbs, no asymptotic bias : t →∞ k P ( t ) µ 0 � π k TV = 0 8 µ 0 , lim 36
Bias • How close are samples to target distribution? – standard measurement: total variation distance k µ � ν k TV = max A ⊂ Ω | µ ( A ) � ν ( A ) | • For sequential Gibbs, no asymptotic bias : t →∞ k P ( t ) µ 0 � π k TV = 0 8 µ 0 , lim “Folklore” : asynchronous Gibbs is also unbiased. …but this is not necessarily true ! 37
Simple Bias Example p (0 , 1) = p (1 , 0) = p (1 , 1) = 1 p (0 , 0) = 0 . 3 1 / 4 (0 , 1) (1 , 1) 3 / 4 1 / 2 1 / 4 1 / 2 1 / 4 1 / 4 (0 , 0) (1 , 0) 3 / 4 1 / 2 38
Simple Bias Example p (0 , 1) = p (1 , 0) = p (1 , 1) = 1 p (0 , 0) = 0 . 3 1 / 4 (0 , 1) (1 , 1) 3 / 4 1 / 2 1 / 4 1 / 2 1 / 4 1 / 4 (0 , 0) (1 , 0) 3 / 4 1 / 2 39
Simple Bias Example p (0 , 1) = p (1 , 0) = p (1 , 1) = 1 p (0 , 0) = 0 . 3 1 / 4 (0 , 1) (1 , 1) 3 / 4 1 / 2 1 / 4 1 / 2 1 / 4 1 / 4 (0 , 0) (1 , 0) 3 / 4 1 / 2 40
Recommend
More recommend