Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming Ashwin M. Aji (Ph.D. Candidate), Wu-chun Feng (Virginia Tech) Pavan Balaji, James Dinan, Rajeev Thakur (Argonne National Lab.) synergy.cs.vt.edu
Data Movement in CPU-GPU Clusters 3 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Data Movement in CPU-GPU Clusters CPU CPU Network main main memory memory 3 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory 3 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 3 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 if if(ra (rank nk == == 0) 0) if(rank == if(ra nk == 1 1) { { 3 } } Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 if if(ra (rank nk == == 0) 0) if(rank == if(ra nk == 1 1) { { GP GPUM UMemc mcpy py(h (host st_b _buf, f, d dev ev_bu buf, f, D2 D2H) H) 3 } } Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 if(ra if (rank nk == == 0) 0) if(rank == if(ra nk == 1 1) { { GP GPUM UMemc mcpy py(host_ host_buf buf, , de dev_buf v_buf, D2H , D2H) MPI MPI_R _Recv ecv(hos host_b t_buf uf, . , .. . . ..) GPU GPUMe Memcp mcpy(de dev_b v_buf uf, , hos host_ t_buf buf, , H2D H2D) MPI MPI_S _Send end(hos host_b t_buf uf, . , .. . . ..) 3 } } Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Data Movement in CPU-GPU Clusters (Pipelined) GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 4 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Data Movement in CPU-GPU Clusters (Pipelined) GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 4 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Data Movement in CPU-GPU Clusters (Pipelined) int processed[chunks] = {0}; for(j=0;j<chunks;j++) { /* Pipeline */ cudaMemcpyAsync(host_buf+offset, GPU GPU gpu_buf+offset, device device D2H, streams[j], ...); } memory memory numProcessed = 0; j = 0; flag = 1; while (numProcessed < chunks) { if(cudaStreamQuery(streams[j] == cudaSuccess)) { /* start MPI */ CPU CPU Network MPI_Isend(host_buf+offset,...); main main numProcessed++; processed[j] = 1; memory memory } MPI_Testany(...); /* check progress */ if(numProcessed < chunks) /* next chunk */ while(flag) { j=(j+1)%chunks; flag=processed[j]; } MPI Rank = 0 MPI Rank = 1 } MPI_Waitall(); 4 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Data Movement in CPU-GPU Clusters (Pipelined) GPU GPU device device memory memory • Performance vs. Productivity tradeoff • CPU CPU Multiple optimizations for different... Network main main …GPUs (AMD/Intel/NVIDIA) memory memory …programming models (CUDA/ OpenCL) …library versions (CUDA v3/CUDA v4) MPI Rank = 0 MPI Rank = 1 4 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
GPU-integrated MPI GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 7 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
GPU-integrated MPI GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 7 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
GPU-integrated MPI GPU GPU device device memory memory • Examples: MPI-ACC, MVAPICH, Open MPI • Programmability/Productivity: multiple accelerators CPU CPU Network main main and prog. models (CUDA, OpenCL) memory memory • Performance: system-specific and vendor-specific optimizations (Pipelining, GPUDirect, pinned host memory, IOH affinity) MPI Rank = 0 MPI Rank = 1 if(ra if (rank nk == == 0) 0) if(ra if(rank == nk == 1 1) { { MP MPI_ I_Sen end(an any_b _buf uf, . .. . .. ..) MPI MPI_R _Recv ecv(any any_bu _buf, .. .. .. ..) } } 7 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Need for Synchronization foo_blocking(buf_1); /* No Sync Needed */ bar(buf_1); foo_nonblocking(buf_1); /* Sync Needed */ bar(buf_1); foo_nonblocking(buf_1); /* No Sync Needed */ bar(buf_2); 8 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Need for Synchronization in GPU- integrated MPI if if(ra (rank nk == == 0) 0) if(ra if (rank nk == == 1) 1) { { MPI MPI_S _Send end(any any_bu _buf, .. ...) .) MPI MPI_R _Recv ecv(any any_bu _buf, .. ...) .) } } 9 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Need for Synchronization in GPU- integrated MPI if if(ra (rank nk == == 0) 0) if(ra if (rank nk == == 1) 1) { { GPU GPUEx Exec( ec(any any_buf buf, , ... ...); ); GPU GPUMe Memcp mcpy(o y(other her_b _buf, uf, H2 H2D); ); MP MPI_ I_Ise send nd(o (othe her_ r_buf uf, , .. ...) GPU GPUEx Exec( ec(oth other_b r_buf uf, . , ...) ..); MPI MPI_S _Send end(any any_bu _buf, .. ...) .) MPI MPI_R _Recv ecv(any any_bu _buf, .. ...) .) GPU GPUMe Memcp mcpy(a y(any_b y_buf uf, H , H2D) 2D); GPU GPUMe Memcp mcpy(a y(any_b y_buf uf, H , H2D) 2D); MPI MPI_I _Isen send(o d(other her_b _buf, uf, .. ...); ); MPI MPI_W _Wait aitall all(); ); } } 9 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Need for Synchronization in GPU- integrated MPI if if(ra (rank nk == == 0) 0) if if(ra (rank nk == == 1) 1) { { • Interleaved MPI and GPU operations GPU GPUEx Exec( ec(any any_buf buf, , ... ...); ); GPU GPUMe Memcp mcpy(o y(other her_b _buf, uf, H2 H2D); ); • Dependent vs. Independent MP MPI_ I_Ise send nd(o (othe her_ r_buf uf, , .. ...) GPU GPUEx Exec( ec(oth other_b r_buf uf, . , ...) ..); MPI MPI_S _Send end(any any_bu _buf, .. ...) .) MPI MPI_R _Recv ecv(any any_bu _buf, .. ...) .) • Blocking vs. Non-blocking GPU GPUMe Memcp mcpy(a y(any_b y_buf uf, H , H2D) 2D); GPU GPUMe Memcp mcpy(a y(any_b y_buf uf, H , H2D) 2D); MPI MPI_I _Isen send(o d(other her_b _buf, uf, .. ...); ); MPI MPI_W _Wait aitall all(); ); } } 9 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Rest of this talk… • With interleaved MPI and GPU operations, what are (or should be) the synchronization semantics? UVA-based – Explicit GPU- – Implicit Integrated MPI Attribute- • How the based synchronization semantics can affect performance and productivity? 10 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
GPU-integrated MPI: UVA-based Design • What is UVA? – Unified Virtual Addressing UVA: Single Address Space Default case: Multiple Memory Spaces 11 Source: Peer-to-Peer & Unified Virtual Addressing CUDA Webinar http://on-demand.gputechconf.com/gtc-express/2011/presentations/cuda_webinars_GPUDirect_uva.pdf Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
GPU-integrated MPI: UVA-based Design • void * for CPU or GPU 0 or GPU 1 or GPU n • cuPointerGetAttribute queries for the buffer type UVA: Single Address Space (CPU or GPU i ) • Exclusive to CUDA v4.0+ 12 Source: Peer-to-Peer & Unified Virtual Addressing CUDA Webinar http://on-demand.gputechconf.com/gtc-express/2011/presentations/cuda_webinars_GPUDirect_uva.pdf Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
GPU-integrated MPI: UVA-based Design MPI_Send((void *)buf, count, MPI_CHAR, dest, tag, MPI_COMM_WORLD); – buf : CPU or GPU buffer – MPI implementation can perform pipelining if it is GPU – No change to MPI standard if (my_rank == sender) { /* send from GPU (CUDA) */ MPI_Send(dev_buf, ...); } else { /* receive into host */ MPI_Recv(host_buf, ...); } 13 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
GPU-integrated MPI: Attribute-based Design MPI_Send((void *)buf, count, new_type, dest, MPI_Send((void *)buf, count, MPI_CHAR MPI_CHAR, dest, tag, MPI_COMM_WORLD); tag, MPI_COMM_WORLD); – new_type: add attributes to inbuilt datatypes (MPI_INT, MPI_CHAR) double *cuda_dev_buf; cl_mem ocl_dev_buf; /* initialize a custom type */ MPI_Type_dup(MPI_CHAR, &type); if (my_rank == sender) { /* send from GPU (CUDA) */ MPI_Type_set_attr(type, BUF_TYPE, BUF_TYPE_CUDA); MPI_Send(cuda_dev_buf, type, ...); } else { /* receive into GPU (OpenCL) */ MPI_Type_set_attr(type, BUF_TYPE, BUF_TYPE_OPENCL); MPI_Recv(ocl_dev_buf, type, ...); } 14 MPI_Type_free(&type); Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu
Recommend
More recommend