ensuring rapid mixing and low bias for asynchronous gibbs
play

Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling - PowerPoint PPT Presentation

Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Christopher De Sa Kunle Olukotun Christopher R {cdesa,kunle,chrismre}@stanford.edu Stanford 1 Overview Asynchronous Gibbs sampling is a popular algorithm thats used in


  1. Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Christopher De Sa Kunle Olukotun Christopher Ré {cdesa,kunle,chrismre}@stanford.edu Stanford 1

  2. Overview

  3. Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. …etc. Zhang et al, PVLDB 2014 Smola et al, PVLDB 2010

  4. Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work?

  5. Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee .

  6. Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee .

  7. Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee . Our contributions 1. The “folklore” is not necesarily true. 2. ...but it works under reasonable conditions .

  8. Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee . Our contributions 1. The “folklore” is not necesarily true. 2. ...but it works under reasonable conditions .

  9. Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee . Our contributions 1. The “folklore” is not necesarily true. 2. ...but it works under reasonable conditions .

  10. Problem: given a probability distribution , produce samples from it. • e.g. to do inference in a graphical model 10

  11. Problem: given a probability distribution , produce samples from it. • e.g. to do inference in a graphical model Algorithm: Gibbs sampling • de facto Markov chain Monte Carlo ( MCMC ) method for inference • produces a series of approximate samples that approach the target distribution 11

  12. What is Gibbs Sampling? 12

  13. What is Gibbs Sampling? Algorithm 1 Gibbs sampling Require: Variables x i for 1 ≤ i ≤ n , and distribution π . loop Choose s by sampling uniformly from { 1 , . . . , n } . Re-sample x s uniformly from P π ( x s | x { 1 ,...,n }\{ s } ). output x end loop 13

  14. What is Gibbs Sampling? Algorithm 1 Gibbs sampling Require: Variables x i for 1 ≤ i ≤ n , and distribution π . loop Choose s by sampling uniformly from { 1 , . . . , n } . Re-sample x s uniformly from P π ( x s | x { 1 ,...,n }\{ s } ). output x end loop x 2 x 7 x 5 x 4 x 1 x 6 x 3 14

  15. What is Gibbs Sampling? Algorithm 1 Gibbs sampling Choose a variable to Require: Variables x i for 1 ≤ i ≤ n , and distribution π . update at random. loop Choose s by sampling uniformly from { 1 , . . . , n } . Re-sample x s uniformly from P π ( x s | x { 1 ,...,n }\{ s } ). output x end loop x 2 x 7 x 5 x 5 x 4 x 1 x 6 x 3 15

  16. What is Gibbs Sampling? Algorithm 1 Gibbs sampling Compute its conditional Require: Variables x i for 1 ≤ i ≤ n , and distribution π . distribution given the loop other variables. Choose s by sampling uniformly from { 1 , . . . , n } . Re-sample x s uniformly from P π ( x s | x { 1 ,...,n }\{ s } ). output x end loop x 2 x 7 x 7 x 5 x 5 x 4 x 4 P ( ) = 0.7 x 5 x 1 x 6 x 6 x 3 P ( ) = 0.3 x 5 16

  17. What is Gibbs Sampling? Algorithm 1 Gibbs sampling Compute its conditional Require: Variables x i for 1 ≤ i ≤ n , and distribution π . distribution given the loop other variables. Choose s by sampling uniformly from { 1 , . . . , n } . Re-sample x s uniformly from P π ( x s | x { 1 ,...,n }\{ s } ). output x Update the variable by end loop sampling from its conditional distribution. x 2 x 7 x 7 x 5 x 5 x 5 x 4 x 4 P ( ) = 0.7 x 5 x 1 x 6 x 6 x 3 P ( ) = 0.3 x 5 17

  18. What is Gibbs Sampling? Algorithm 1 Gibbs sampling Require: Variables x i for 1 ≤ i ≤ n , and distribution π . loop Output the current Choose s by sampling uniformly from { 1 , . . . , n } . state as a sample. Re-sample x s uniformly from P π ( x s | x { 1 ,...,n }\{ s } ). output x end loop x 2 x 7 x 5 x 5 x 5 x 4 x 1 x 6 x 3 18

  19. Gibbs Sampling: A Practical Perspective 19

  20. Gibbs Sampling: A Practical Perspective • Pros of Gibbs sampling – Easy to implement – Updates are sparse à fast on modern CPUs • Cons of Gibbs sampling – sequential algorithm à can’t naively parallelize 20

  21. Gibbs Sampling: A Practical Perspective • Pros of Gibbs sampling – Easy to implement – Updates are sparse à fast on modern CPUs • Cons of Gibbs sampling – sequential algorithm à can’t naively parallelize Leave up to 98% e.g. of performance No parallelism on the table! 64 core 21

  22. Asynchronous Gibbs Sampling 22

  23. Asynchronous Gibbs Sampling • Run multiple threads in parallel without locks – also known as H OGWILD ! – adapted from a popular technique for stochastic gradient descent (SGD) • When we read a variable, it could be stale – while we re-sample a variable, its adjacent variables can be overwritten by other threads – semantics not equivalent to standard (sequential) Gibbs sampling 23

  24. Asynchronous Gibbs Sampling • Run multiple threads in parallel without locks – also known as H OGWILD ! – adapted from a popular technique for stochastic gradient descent (SGD) • When we read a variable, it could be stale – while we re-sample a variable, its adjacent variables can be overwritten by other threads – semantics not equivalent to standard (sequential) Gibbs sampling 24

  25. 25

  26. Question Does asynchronous Gibbs sampling work? …and what does it mean for it to work? 26

  27. Question Does asynchronous Gibbs sampling work? …and what does it mean for it to work? Two desiderata 27

  28. Question Does asynchronous Gibbs sampling work? …and what does it mean for it to work? Two desiderata want to get accurate estimates ê bound the bias 28

  29. Question Does asynchronous Gibbs sampling work? …and what does it mean for it to work? Two desiderata want to be independent of initial conditions want to get accurate estimates quickly ê ê bound the bound the bias mixing time 29

  30. Previous Work 30

  31. Previous Work • “ Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent ” — Niu et al, NIPS 2011. follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015 • “ Analyzing Hogwild Parallel Gaussian Gibbs Sampling ” — Johnson et al, NIPS 2013. 31

  32. Previous Work • “ Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent ” — Niu et al, NIPS 2011. follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015 • “ Analyzing Hogwild Parallel Gaussian Gibbs Sampling ” — Johnson et al, NIPS 2013. 32

  33. Question Does asynchronous Gibbs sampling work? …and what does it mean for it to work? Two desiderata want to be independent of initial conditions want to get accurate estimates quickly ê ê bound the bound the bias mixing time 33

  34. Bias 34

  35. Bias • How close are samples to target distribution? – standard measurement: total variation distance k µ � ν k TV = max A ⊂ Ω | µ ( A ) � ν ( A ) | • For sequential Gibbs, no asymptotic bias : 35

  36. Bias • How close are samples to target distribution? – standard measurement: total variation distance k µ � ν k TV = max A ⊂ Ω | µ ( A ) � ν ( A ) | • For sequential Gibbs, no asymptotic bias : t →∞ k P ( t ) µ 0 � π k TV = 0 8 µ 0 , lim 36

  37. Bias • How close are samples to target distribution? – standard measurement: total variation distance k µ � ν k TV = max A ⊂ Ω | µ ( A ) � ν ( A ) | • For sequential Gibbs, no asymptotic bias : t →∞ k P ( t ) µ 0 � π k TV = 0 8 µ 0 , lim “Folklore” : asynchronous Gibbs is also unbiased. …but this is not necessarily true ! 37

  38. Simple Bias Example p (0 , 1) = p (1 , 0) = p (1 , 1) = 1 p (0 , 0) = 0 . 3 1 / 4 (0 , 1) (1 , 1) 3 / 4 1 / 2 1 / 4 1 / 2 1 / 4 1 / 4 (0 , 0) (1 , 0) 3 / 4 1 / 2 38

  39. Simple Bias Example p (0 , 1) = p (1 , 0) = p (1 , 1) = 1 p (0 , 0) = 0 . 3 1 / 4 (0 , 1) (1 , 1) 3 / 4 1 / 2 1 / 4 1 / 2 1 / 4 1 / 4 (0 , 0) (1 , 0) 3 / 4 1 / 2 39

  40. Simple Bias Example p (0 , 1) = p (1 , 0) = p (1 , 1) = 1 p (0 , 0) = 0 . 3 1 / 4 (0 , 1) (1 , 1) 3 / 4 1 / 2 1 / 4 1 / 2 1 / 4 1 / 4 (0 , 0) (1 , 0) 3 / 4 1 / 2 40

Recommend


More recommend