parallel splash belief propagation
play

Parallel Splash Belief Propagation Joseph E. Gonzalez Yucheng Low - PowerPoint PPT Presentation

Parallel Splash Belief Propagation Joseph E. Gonzalez Yucheng Low Carlos Guestrin David OHallaron Computers which worked on this project: BigBro1, BigBro2, BigBro3, BigBro4, BigBro5, BigBro6, BiggerBro, BigBroFS Tashish01, Tashi02,


  1. Parallel Splash Belief Propagation Joseph E. Gonzalez Yucheng Low Carlos Guestrin David O’Hallaron Computers which worked on this project: BigBro1, BigBro2, BigBro3, BigBro4, BigBro5, BigBro6, BiggerBro, BigBroFS Tashish01, Tashi02, Tashi03, Tashi04, Tashi05, Tashi06, …, Tashi30, parallel, gs6167, koobcam (helped with writing) Carnegie Mellon

  2. Change in the Foundation of ML Why talk about parallelism now? Future Parallel Performance Log(Speed in GHz) Future Sequential Performance 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 Release Date 2

  3. Why is this a Problem? Want to be Nearest Neighbor here [Google et al.] Parallelism Basic Regression [Cheng et al.] Support Vector Machines [Graf et al.] Graphical Models [Mendiburu et al.] Sophistication 3

  4. Why is it hard? Algorithmic Efficiency Parallel Efficiency Eliminate wasted Expose independent computation computation Implementation Efficiency Map computation to real hardware 4

  5. The Key Insight Statistical Structure ••Graphical Model Structure ••Graphical Model Parameters Computational Structure ••Chains of Computational Dependences ••Decay of Influence Parallel Structure ••Parallel Dynamic Scheduling ••State Partitioning for Distributed Computation 5

  6. The Result Splash Belief Propagation Nearest Neighbor Goal [Google et al.] Parallelism Basic Regression [Cheng et al.] Support Vector Machines [Graf et al.] Graphical Models Graphical Models [Mendiburu et al.] [Gonzalez et al.] Sophistication 6

  7. Outline Overview Graphical Models: Statistical Structure Inference: Computational Structure τ ε - Approximate Messages: Statistical Structure Parallel Splash Dynamic Scheduling Partitioning Experimental Results Conclusions 7

  8. Graphical Models and Parallelism Graphical models provide a common language for general purpose parallel algorithms in machine learning A parallel inference algorithm would improve: Protein Structure Movie Computer Vision Prediction Recommendation Inference is a key step in Learning Graphical Models 8

  9. Overview of Graphical Models Graphical represent of local statistical dependencies Observed Random Variables Noisy Picture “True” Pixel Values Latent Pixel Variables Local Dependencies Continuity Assumptions Inference What is the probability that this pixel is black? 9

  10. Synthetic Noisy Image Problem Noisy Image Predicted Image Overlapping Gaussian noise Assess convergence and accuracy

  11. Protein Side-Chain Prediction Model side-chain interactions as a graphical model Inference What is the most likely orientation? 11

  12. Protein Side-Chain Prediction 276 Protein Networks: Approximately: 700 Variables 1600 Factors 70 Discrete orientations Strong Factors Example Degree Distribution 0.15 0.1 0.05 0 6 14 22 30 38 46 Degree 12

  13. Markov Logic Networks Represent Logic as a graphical model A: Alice True/False? Friends(A,B) B: Bob Friends(A,B) And Smokes(A) Smokes(A) Smokes(B) è Smokes(B) Smokes(A) è Cancer(A) Smokes(B) è Cancer(B) Inference Pr(Cancer(B) = True | Smokes(A) = True & Friends(A,B) = True) = ? Cancer(A) Cancer(B) 13

  14. Markov Logic Networks UW-Systems Model A: Alice Friends(A,B) True/False? B: Bob 8K Binary Variables Friends(A,B) And Smokes(A) Smokes(A) Smokes(B) è Smokes(B) 406K Factors Smokes(A) è Cancer(A) Smokes(B) è Cancer(B) Irregular degree Cancer(A) Cancer(B) distribution: Some vertices with high degree 14

  15. Outline Overview Graphical Models: Statistical Structure Inference: Computational Structure τ ε - Approximate Messages: Statistical Structure Parallel Splash Dynamic Scheduling Partitioning Experimental Results Conclusions 15

  16. The Inference Problem A: Alice Friends(A,B) True/False? B: Bob What is the probability What is the best Friends(A,B) And Smokes(A) Smokes(A) Smokes(B) è Smokes(B) that Bob Smokes given configuration of the Smokes(A) è Cancer(A) Smokes(B) è Cancer(B) Alice Smokes? protein side-chains? Cancer(A) Cancer(B) NP-Hard in General Approximate Inference: What is the probability Belief Propagation that each pixel is black? 16

  17. Belief Propagation (BP) Iterative message passing algorithm Naturally Parallel Algorithm 17

  18. Parallel Synchronous BP Given the old messages all new messages can be computed in parallel: CPU 1 CPU 2 CPU 3 CPU n Old New Messages Messages Map-Reduce Ready! 18

  19. Sequential Computational Structure 19

  20. Hidden Sequential Structure 20

  21. Hidden Sequential Structure Evidence Evidence Running Time: Time for a single Number of Iterations parallel iteration 21

  22. Optimal Sequential Algorithm Running Time Naturally Parallel 2n 2 /p p ≤ 2n Gap Forward-Backward 2n p = 1 Optimal Parallel n p = 2 22

  23. Key Computational Structure Running Time Naturally Parallel 2n 2 /p p ≤ 2n Inherent Sequential Structure Gap Requires Efficient Scheduling Optimal Parallel n p = 2 23

  24. Outline Overview Graphical Models: Statistical Structure Inference: Computational Structure τ ε - Approximate Messages: Statistical Structure Parallel Splash Dynamic Scheduling Partitioning Experimental Results Conclusions 24

  25. Parallelism by Approximation True Messages 1 2 3 4 5 6 7 8 9 10 τ ε -Approximation τ ε represents the minimal sequential structure 1 25

  26. Tau-Epsilon Structure Often τ ε decreases quickly: Protein Networks Message Approximation Error in Log Scale Markov Logic Networks 26

  27. Running Time Lower Bound Theorem: Using p processors it is not possible to obtain a τ ε approximation in time less than: Parallel Sequential Component Component 27

  28. Proof: Running Time Lower Bound Consider one direction using p/2 processors ( p≥2 ): τ ε n - τ ε … 1 n τ ε τ ε τ ε τ ε τ ε τ ε τ ε We must make n - τ ε vertices τ ε left-aware A single processor can only make k- τ ε +1 vertices left aware in k -iterations 28

  29. Optimal Parallel Scheduling Processor 1 Processor 2 Processor 3 Theorem: Using p processors this algorithm achieves a τ ε approximation in time: 29

  30. Proof: Optimal Parallel Scheduling All vertices are left-aware of the left most vertex on their processor After exchanging messages After next iteration: After k parallel iterations each vertex is (k-1)(n/p) left-aware 30

  31. Proof: Optimal Parallel Scheduling After k parallel iterations each vertex is (k-1)(n/p) left- aware Since all vertices must be made τ ε left aware: Each iteration takes O(n/p) time: 31

  32. Comparing with SynchronousBP Processor 1 Processor 2 Processor 3 Synchronous Schedule Optimal Schedule Gap 32

  33. Outline Overview Graphical Models: Statistical Structure Inference: Computational Structure τ ε - Approximate Messages: Statistical Structure Parallel Splash Dynamic Scheduling Partitioning Experimental Results Conclusions 33

  34. The Splash Operation Generalize the optimal chain algorithm: ~ to arbitrary cyclic graphs: 1) Grow a BFS Spanning tree with fixed size 2) Forward Pass computing all messages at each vertex 3) Backward Pass computing all messages at each vertex 34

  35. Running Parallel Splashes CPU 1 CPU 2 CPU 3 Key Challenges: Splash Splash Splash 1) How do we schedules Splashes? 2) How do we partition the Graph? Local State Local State Local State Partition the graph Schedule Splashes locally Transmit the messages along the boundary of the partition 35

  36. Where do we Splash? Assign priorities and use a scheduling queue to select roots: ? Splash ? ? Scheduling Queue Splash How do we assign priorities? CPU 1 Local State

  37. Message Scheduling Residual Belief Propagation [Elidan et al., UAI 06]: Assign priorities based on change in inbound messages Large Change Small Change Large Change Small Change Message Message 1 2 Small Change: Large Change: Expensive No-Op Informative Update Message Message Message Message 37

  38. Problem with Message Scheduling Small changes in messages do not imply small changes in belief: Large change in Small change in belief all message Message Message Message Belief Message 38

  39. Problem with Message Scheduling Large changes in a single message do not imply large changes in belief: Small change Large change in in belief a single message Message Message Message Belief Message 39

  40. Belief Residual Scheduling Assign priorities based on the cumulative change in belief: r v = + + 1 1 1 A vertex whose belief has changed substantially since last being updated will likely produce Message informative new messages. Change 40

  41. Message vs. Belief Scheduling Belief Scheduling improves accuracy and convergence Error in Beliefs % Converged in 4Hrs Message Scheduling 0.06 100% L1 Error in Beliefs 80% Belief Scheduling 0.05 Better 60% 0.04 40% 0.03 20% 0.02 0% 0 50 100 Belief Message Time (Seconds) Residuals Residual 41

  42. Splash Pruning Belief residuals can be used to dynamically reshape and resize Splashes: Low Beliefs Residual

  43. Splash Size Using Splash Pruning our algorithm is able to dynamically select the optimal splash size 350 Running Time (Seconds) Without Pruning 300 With Pruning 250 Better 200 150 100 50 0 0 10 20 30 40 50 60 Splash Size (Messages) 43

Recommend


More recommend