a dynamic programming based mcmc framework for solving
play

A Dynamic Programming-based MCMC Framework for Solving DCOPs with - PowerPoint PPT Presentation

A Dynamic Programming-based MCMC Framework for Solving DCOPs with GPUs Ferdinando Fioretto 1(2) (joint work with) William Yeoh 2 and Enrico Pontelli 2 1 University of Michigan 2 New Mexico State University CP 2016, Toulouse Introduction GPUs


  1. A Dynamic Programming-based MCMC Framework for Solving DCOPs with GPUs Ferdinando Fioretto 1(2) (joint work with) William Yeoh 2 and Enrico Pontelli 2 1 University of Michigan 2 New Mexico State University CP 2016, Toulouse

  2. Introduction GPUs DMCMC Results Conclusions Distributed Discrete Optimization with Preferences 1

  3. Introduction GPUs DMCMC Results Conclusions GPUs • Every new desktop/laptop is now equipped with a graphic processing unit (GPU). • GPU = Massively Parallel Architecture. • For most of their life, such GPUs are idle. • General Purpose GPU applications: Numerical Analysis Bioinformatics Deep Learning MathWorks MATLAB 2

  4. Outline • Introduction • GPUs • D-MCMC • Results • Conclusions 3

  5. Introduction GPUs DMCMC Results Conclusions Multi-Agent Constraint Optimization • A DCOP is a tuple < X , D , F , A, α>, where: • X is a set of variables. • D is a set of finite domains for each variable. • F is a set of constraints between variables . • A is a set of agents, controlling the variables in X . • α is a mapping from variables to agents. agent x a x b U 0 0 3 0 1 20 1 0 2 1 1 5 variables constraints centralized DCOP centralized 4 solver Algorithm solver

  6. Introduction GPUs DMCMC Results Conclusions Multi-Agent Constraint Optimization • A DCOP is a tuple < X , D , F , A, α>, where: • X is a set of variables. • D is a set of finite domains for each variable. • F is a set of constraints between variables . • A is a set of agents, controlling the variables in X . • α is a mapping from variables to agents. x 3 x 1 Boundary Local variables x 5 variables B i x 2 L i x 4 Agent a i 5

  7. Introduction GPUs DMCMC Results Conclusions Multi-Agent Constraint Optimization • A DCOP is a tuple < X , D , F , A, α>, where: • X is a set of variables. • D is a set of finite domains for each variable. • F is a set of constraints between variables . • A is a set of agents, controlling the variables in X . • α is a mapping from variables to agents. • GOAL: Find a utility maximal assignment. x ⇤ = arg max F ( x ) x X = arg max f ( x | scope ( f ) ) x f 2 F 6

  8. Introduction GPUs DMCMC Results Conclusions MCMC Sampling • MCMC algorithms approximate probability distributions. • They use a proposal distribution to generate a sequence of samples z (1) , z (2) , … which forms a Marokv Chain. • The quality of the sample improves as a function of the number of steps. Source: http://xr0038.hatenadiary.jp/ 7

  9. Introduction GPUs DMCMC Results Conclusions MCMC Sampling • MCMC sampling algorithms can be used to solve DCOPs. [Nguyen et al., AAMAS 2013] • MCMC Sampling algorithms can be used to solve the Maximum A Posteriori (MAP) estimation problem. • The authors provide a mapping from solving a DCOP to solving a MAP. 8

  10. Introduction GPUs DMCMC Results Conclusions Graphical Processing Units (GPUs) • A GPU is a massive parallel architecture: • Thousands of multi-threaded computing cores . • Very high memory bandwidths . • ~80% of transistors devoted to data processing rather than caching. • However: • GPU cores are slower than CPU cores. • GPU memories have different sizes and access times. • GPU programming is more challenging and time consuming. 9

  11. Introduction GPUs DMCMC Results Conclusions Execution Model • A Thread is the basic parallel unit. CPU GPU • Identified by a Thread ID. Block Block Block (0,0) (1,0) (2,0) Block Block Block Kernel 1 (0,0) (1,0) (2,0) block Block block Block block Block • Threads are organized into Blocks . (0,1) (1,1) (2,1) Block Block Block (0,1) (1,1) (2,1) block block block • Several Streaming Multiprocessors , (SD) scheduled in parallel. Kernel 2 block Thread Thread Thread Thread Thread • Single Instruction Multiple Thread (0,0) (1,0) (2,0) (3,0) (4,0) Thread Thread Thread Thread Thread (SIMT) parallel model. (0,1) (1,1) (2,1) B (3,1) (4,1) Thread Thread Thread Thread Thread (0,2) (1,2) (2,2) (3,2) (4,2) Thread Thread Thread Thread Thread (0,3) (1,3) (2,3) (3,3) (4,3) ... 10

  12. Introduction GPUs DMCMC Results Conclusions Memory Hierarchy • The GPU memory architecture is rather involved. • Registers GRID Block Block • Fastest . Shared Shared • Only accessible by a thread . memory memory • Lifetime of a thread. • Shared memory regs regs regs regs • Fast. Thread Thread Thread Thread • Accessible by all threads in a block . • Global memory GLOBAL MEMORY • High access latency HOST CONSTANT MEMORY • Potential of traffic congestion . 11

  13. Introduction GPUs DMCMC Results Conclusions CUDA: Compute Unified Device Architecture Host Device 12

  14. Introduction GPUs DMCMC Results Conclusions CUDA: Compute Unified Device Architecture Host Device lloc (&deviceV, sizeV); cu cudaMallo data Global Memory cudaMemcpy(deviceV, hostV, sizeV, ...) 13

  15. Introduction GPUs DMCMC Results Conclusions CUDA: Compute Unified Device Architecture Host Device cudaKernel ( ) cu Kernel invocation Global Memory Kernel <nThreads, nBlocks>( ) cu cudaKe 14

  16. Introduction GPUs DMCMC Results Conclusions CUDA: Compute Unified Device Architecture Host Device data Global Memory cudaMemcpy(hostV, deviceV, sizeV, ...) 15

  17. Introduction GPUs DMCMC Results Conclusions D-MCMC: Related Work [Nguyen et al. AAMAS-2013] D-Gibbs Sampling Algorithm 1: Gibbs ( z 1 , . . . , z n ) 1 for i = 1 to n do z 0 i ← Initialize ( z i ) 2 3 end S 4 for t = 1 to T do for i = 1 to n do 5 i � 1 , z t � 1 z t i ← Sample ( P ( z i | z t 1 , . . . , z t i +1 , . . . , z t � 1 )) 6 n 7 end 8 end • Computing the normalizing constant can be expensive. • A lots of sample to converge. 16

  18. Introduction GPUs DMCMC Results Conclusions DMCMC Each agent controls several variables. • Given values for its boundary variables each agent can • solve its local sub-problem independently from other agents. B j B i x 8 x 3 x 6 x 1 x 10 x 5 x 7 x 2 L j x 9 L i x 4 Agent a j Agent a i 17

  19. Introduction GPUs DMCMC Results Conclusions DMCMC Each agent controls several variables. • Given values for its boundary variables. • Find a solution for the local sub-problem using MCMC • algorithms: Gibbs sampling and Metropolis–Hastings. Joint utility table B i x 3 L i B i x 1 x 1 x 2 x 3 x 4 x 5 U 0 0 3 2 5 21 x 5 0 1 2 1 4 20 x 2 0 2 3 5 1 32 L i x 4 18

  20. Introduction GPUs DMCMC Results Conclusions DMCMC: Local Sampling Process 3 Level of Parallelism GPU 1 Joint utility table Block Block Block (0,0) (1,0) (2,0) Block Block Block x 1 x 2 x 3 x 4 x 5 U (0,0) (1,0) (2,0) block Block block Block block Block 0 0 3 2 5 21 (0,1) (1,1) (2,1) Block Block Block 0 1 2 1 4 20 (0,1) (1,1) (2,1) block block block 0 2 3 5 1 32 Each row of the Joint utility table is computed in parallel using several blocks. [Fioretto et al. CP-15] 19

  21. Introduction GPUs DMCMC Results Conclusions DMCMC: Local Sampling Process 3 Level of Parallelism GPU R 2 Joint utility table multiple Block Block Block (0,0) (1,0) (2,0) samples Block Block Block x 1 x 2 x 3 x 4 x 5 U (0,0) (1,0) (2,0) block Block block Block block Block 0 0 3 2 5 21 (0,1) (1,1) (2,1) Block Block Block 0 0 2 1 4 20 (0,1) (1,1) (2,1) block block block 0 0 3 5 1 32 0 1 2 1 4 20 0 1 3 5 1 32 0 1 2 1 4 20 0 2 3 5 1 32 … 20

  22. Introduction GPUs DMCMC Results Conclusions DMCMC: Local Sampling Process 3 Level of Parallelism GPU 3 Block Block Block (0,0) (1,0) (2,0) Block Block Block (0,0) (1,0) (2,0) block Block block Block block Block (0,1) (1,1) (2,1) Block Block Block (0,1) (1,1) (2,1) Gibbs Sampling Process block block block q ( x k = d id | x l ∈ L i \ { x k } ) = 1 0 X Z π exp f j ( z | x fj ) f j ∈ F i block q ( x k = d id | x l ∈ L i \ { x k } ) = 1 1 Thread Thread Thread Thread Thread X Z π exp f j ( z | x fj ) (0,0) (1,0) (2,0) (3,0) (4,0) Thread Thread Thread Thread Thread f j ∈ F i (0,1) (1,1) (2,1) B (3,1) (4,1) Thread Thread Thread Thread Thread q ( x k = d id | x l ∈ L i \ { x k } ) = 1 2 X (0,2) (1,2) (2,2) (3,2) (4,2) Z π exp f j ( z | x fj ) Thread Thread Thread Thread Thread (0,3) (1,3) (2,3) (3,3) (4,3) f j ∈ F i ... … 21

  23. Introduction GPUs DMCMC Results Conclusions Algorithm design and data structure • Ensure data accesses are coalesced . good bad • Minimize the accesses to the global memory . • Padding Utility Tables’ rows; Perfect hashing. 22

Recommend


More recommend