Parallel Splash Belief Propagation Joseph E. Gonzalez Yucheng Low Carlos Guestrin David O’Hallaron Computers which worked on this project: BigBro1, BigBro2, BigBro3, BigBro4, BigBro5, BigBro6, BiggerBro, BigBroFS Tashish01, Tashi02, Tashi03, Tashi04, Tashi05, Tashi06, …, Tashi30, parallel, gs6167, koobcam (helped with writing) Carnegie Mellon
Change in the Foundation of ML Why talk about parallelism now? Future Parallel Performance Log(Speed in GHz) Future Sequential Performance 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 Release Date 2
Why is this a Problem? Want to be Nearest Neighbor here [Google et al.] Parallelism Basic Regression [Cheng et al.] Support Vector Machines [Graf et al.] Graphical Models [Mendiburu et al.] Sophistication 3
Why is it hard? Algorithmic Efficiency Parallel Efficiency Eliminate wasted Expose independent computation computation Implementation Efficiency Map computation to real hardware 4
The Key Insight Statistical Structure ••Graphical Model Structure ••Graphical Model Parameters Computational Structure ••Chains of Computational Dependences ••Decay of Influence Parallel Structure ••Parallel Dynamic Scheduling ••State Partitioning for Distributed Computation 5
The Result Splash Belief Propagation Nearest Neighbor Goal [Google et al.] Parallelism Basic Regression [Cheng et al.] Support Vector Machines [Graf et al.] Graphical Models Graphical Models [Mendiburu et al.] [Gonzalez et al.] Sophistication 6
Outline Overview Graphical Models: Statistical Structure Inference: Computational Structure τ ε - Approximate Messages: Statistical Structure Parallel Splash Dynamic Scheduling Partitioning Experimental Results Conclusions 7
Graphical Models and Parallelism Graphical models provide a common language for general purpose parallel algorithms in machine learning A parallel inference algorithm would improve: Protein Structure Movie Computer Vision Prediction Recommendation Inference is a key step in Learning Graphical Models 8
Overview of Graphical Models Graphical represent of local statistical dependencies Observed Random Variables Noisy Picture “True” Pixel Values Latent Pixel Variables Local Dependencies Continuity Assumptions Inference What is the probability that this pixel is black? 9
Synthetic Noisy Image Problem Noisy Image Predicted Image Overlapping Gaussian noise Assess convergence and accuracy
Protein Side-Chain Prediction Model side-chain interactions as a graphical model Inference What is the most likely orientation? 11
Protein Side-Chain Prediction 276 Protein Networks: Approximately: 700 Variables 1600 Factors 70 Discrete orientations Strong Factors Example Degree Distribution 0.15 0.1 0.05 0 6 14 22 30 38 46 Degree 12
Markov Logic Networks Represent Logic as a graphical model A: Alice True/False? Friends(A,B) B: Bob Friends(A,B) And Smokes(A) Smokes(A) Smokes(B) è Smokes(B) Smokes(A) è Cancer(A) Smokes(B) è Cancer(B) Inference Pr(Cancer(B) = True | Smokes(A) = True & Friends(A,B) = True) = ? Cancer(A) Cancer(B) 13
Markov Logic Networks UW-Systems Model A: Alice Friends(A,B) True/False? B: Bob 8K Binary Variables Friends(A,B) And Smokes(A) Smokes(A) Smokes(B) è Smokes(B) 406K Factors Smokes(A) è Cancer(A) Smokes(B) è Cancer(B) Irregular degree Cancer(A) Cancer(B) distribution: Some vertices with high degree 14
Outline Overview Graphical Models: Statistical Structure Inference: Computational Structure τ ε - Approximate Messages: Statistical Structure Parallel Splash Dynamic Scheduling Partitioning Experimental Results Conclusions 15
The Inference Problem A: Alice Friends(A,B) True/False? B: Bob What is the probability What is the best Friends(A,B) And Smokes(A) Smokes(A) Smokes(B) è Smokes(B) that Bob Smokes given configuration of the Smokes(A) è Cancer(A) Smokes(B) è Cancer(B) Alice Smokes? protein side-chains? Cancer(A) Cancer(B) NP-Hard in General Approximate Inference: What is the probability Belief Propagation that each pixel is black? 16
Belief Propagation (BP) Iterative message passing algorithm Naturally Parallel Algorithm 17
Parallel Synchronous BP Given the old messages all new messages can be computed in parallel: CPU 1 CPU 2 CPU 3 CPU n Old New Messages Messages Map-Reduce Ready! 18
Sequential Computational Structure 19
Hidden Sequential Structure 20
Hidden Sequential Structure Evidence Evidence Running Time: Time for a single Number of Iterations parallel iteration 21
Optimal Sequential Algorithm Running Time Naturally Parallel 2n 2 /p p ≤ 2n Gap Forward-Backward 2n p = 1 Optimal Parallel n p = 2 22
Key Computational Structure Running Time Naturally Parallel 2n 2 /p p ≤ 2n Inherent Sequential Structure Gap Requires Efficient Scheduling Optimal Parallel n p = 2 23
Outline Overview Graphical Models: Statistical Structure Inference: Computational Structure τ ε - Approximate Messages: Statistical Structure Parallel Splash Dynamic Scheduling Partitioning Experimental Results Conclusions 24
Parallelism by Approximation True Messages 1 2 3 4 5 6 7 8 9 10 τ ε -Approximation τ ε represents the minimal sequential structure 1 25
Tau-Epsilon Structure Often τ ε decreases quickly: Protein Networks Message Approximation Error in Log Scale Markov Logic Networks 26
Running Time Lower Bound Theorem: Using p processors it is not possible to obtain a τ ε approximation in time less than: Parallel Sequential Component Component 27
Proof: Running Time Lower Bound Consider one direction using p/2 processors ( p≥2 ): τ ε n - τ ε … 1 n τ ε τ ε τ ε τ ε τ ε τ ε τ ε We must make n - τ ε vertices τ ε left-aware A single processor can only make k- τ ε +1 vertices left aware in k -iterations 28
Optimal Parallel Scheduling Processor 1 Processor 2 Processor 3 Theorem: Using p processors this algorithm achieves a τ ε approximation in time: 29
Proof: Optimal Parallel Scheduling All vertices are left-aware of the left most vertex on their processor After exchanging messages After next iteration: After k parallel iterations each vertex is (k-1)(n/p) left-aware 30
Proof: Optimal Parallel Scheduling After k parallel iterations each vertex is (k-1)(n/p) left- aware Since all vertices must be made τ ε left aware: Each iteration takes O(n/p) time: 31
Comparing with SynchronousBP Processor 1 Processor 2 Processor 3 Synchronous Schedule Optimal Schedule Gap 32
Outline Overview Graphical Models: Statistical Structure Inference: Computational Structure τ ε - Approximate Messages: Statistical Structure Parallel Splash Dynamic Scheduling Partitioning Experimental Results Conclusions 33
The Splash Operation Generalize the optimal chain algorithm: ~ to arbitrary cyclic graphs: 1) Grow a BFS Spanning tree with fixed size 2) Forward Pass computing all messages at each vertex 3) Backward Pass computing all messages at each vertex 34
Running Parallel Splashes CPU 1 CPU 2 CPU 3 Key Challenges: Splash Splash Splash 1) How do we schedules Splashes? 2) How do we partition the Graph? Local State Local State Local State Partition the graph Schedule Splashes locally Transmit the messages along the boundary of the partition 35
Where do we Splash? Assign priorities and use a scheduling queue to select roots: ? Splash ? ? Scheduling Queue Splash How do we assign priorities? CPU 1 Local State
Message Scheduling Residual Belief Propagation [Elidan et al., UAI 06]: Assign priorities based on change in inbound messages Large Change Small Change Large Change Small Change Message Message 1 2 Small Change: Large Change: Expensive No-Op Informative Update Message Message Message Message 37
Problem with Message Scheduling Small changes in messages do not imply small changes in belief: Large change in Small change in belief all message Message Message Message Belief Message 38
Problem with Message Scheduling Large changes in a single message do not imply large changes in belief: Small change Large change in in belief a single message Message Message Message Belief Message 39
Belief Residual Scheduling Assign priorities based on the cumulative change in belief: r v = + + 1 1 1 A vertex whose belief has changed substantially since last being updated will likely produce Message informative new messages. Change 40
Message vs. Belief Scheduling Belief Scheduling improves accuracy and convergence Error in Beliefs % Converged in 4Hrs Message Scheduling 0.06 100% L1 Error in Beliefs 80% Belief Scheduling 0.05 Better 60% 0.04 40% 0.03 20% 0.02 0% 0 50 100 Belief Message Time (Seconds) Residuals Residual 41
Splash Pruning Belief residuals can be used to dynamically reshape and resize Splashes: Low Beliefs Residual
Splash Size Using Splash Pruning our algorithm is able to dynamically select the optimal splash size 350 Running Time (Seconds) Without Pruning 300 With Pruning 250 Better 200 150 100 50 0 0 10 20 30 40 50 60 Splash Size (Messages) 43
Recommend
More recommend