COOPERATIVE GROUPS Kyrylo Perelygin, Yuan Lin GTC 2017
Cooperative Groups: a flexible model for synchronization and communication within groups of threads. At a glance Benefits all applications Scalable Cooperation among groups of threads Examples include: DEVELOPERS Persistent RNNs Physics Flexible parallel decompositions Search Algorithms Sorting Composition across software boundaries Deploy Everywhere 2
LEVELS OF COOPERATION: TODAY Warp Warp SM __syncthreads(): block level synchronization barrier in CUDA GPU Multi-GPU 3
LEVELS OF COOPERATION: CUDA 9.0 For current coalesced set of threads: auto g = coalesced_threads(); For warp-sized group of threads: Warp auto block = this_thread_block(); Warp SM auto g = tiled_partition<32>(block) For CUDA thread blocks: auto g = this_thread_block(); For device-spanning grid : GPU auto g = this_grid(); Multi-GPU For multiple grids spanning GPUs: auto g = this_multi_grid(); All Cooperative Groups functionality is within a cooperative_groups:: namespace 4
THREAD GROUP Base type, the implementation depends on its construction. Unifies the various group types into one general, collective, thread group. We need to extend the CUDA programming model with handles that can represent the groups of threads that can communicate/synchronize Thread Grid Group Group Thread Block Tile Multi-Grid Coalesced Thread Group Group Block 5
THREAD BLOCK Implicit group of all the threads in the launched thread block Implements the same interface as thread_group: void sync(); // Synchronize the threads in the group unsigned size(); // Total number of threads in the group unsigned thread_rank(); // Rank of the calling thread within [0, size] bool is_valid(); // Whether the group violated any API constraints And additional thread_block specific functions: dim3 group_index(); // 3-dimensional block index within the grid dim3 thread_index(); // 3-dimensional thread index within the block 6
PROGRAM DEFINED DECOMPOSITION CUDA KERNEL All threads launched thread_block g = this_thread_block(); foobar(thread_block g) All threads in thread block thread_group tile32 = tiled_partition(g, 32); thread_group tile4 = tiled_partition(tile32, 4); Restricted to powers of two, and <= 32 in initial release 7
GENERIC PARALLEL ALGORITHMS Per-Block Per-Warp g = this_thread_block(); g = tiled_partition(this_thread_block(), 32); reduce(g, ptr, myVal); reduce(g, ptr, myVal); __device__ int reduce(thread_group g, int *x, int val) { int lane = g.thread_rank(); for (int i = g.size()/2; i > 0; i /= 2) { x[lane] = val; g.sync(); val += x[lane + i]; g.sync(); } return val; } 8
THREAD BLOCK TILE A subset of threads of a thread block, divided into tiles in row-major order thread_block_tile<32> tile32 = tiled_partition<32>(this_thread_block()); thread_block_tile<4> tile4 = tiled_partition<4>(this_thread_block()); Exposes additional functionality: .any() .shfl() .all() .shfl_down() .ballot() .shfl_up() .match_any() Size known at compile time = fast! .shfl_xor() .match_all() 9
STATIC TILE REDUCE Per-Tile of 16 threads g = tiled_partition<16>(this_thread_block()); tile_reduce(g, myVal); template <unsigned size> __device__ int tile_reduce(thread_block_tile<size> g, int val) { for (int i = g.size()/2; i > 0; i /= 2) { val += g.shfl_down(val, i); } return val; } 10
GRID GROUP A set of threads within the same grid, guaranteed to be resident on the device New CUDA Launch API to opt-in: cudaLaunchCooperativeKernel (…) __global__ kernel() { grid_group grid = this_grid(); // load data // loop - compute, share data grid.sync(); // devices are now synced } Device needs to support the cooperativeLaunch property. cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm, kernel, numThreads, 0)); 11
GRID GROUP The goal: keep as much state as possible resident Genetic Algorithms / Particle Simulations Shortest Path / Search Master driven algorithms Weight array perfect for Synchronization Synchronization persistence between a master block between update and Iteration over vertices? and slaves collision simulation Fuse! 12
MULTI GRID GROUP A set of threads guaranteed to be resident on the same system, on multiple devices __global__ void kernel() { multi_grid_group multi_grid = this_multi_grid(); // load data // loop - compute, share data multi_grid.sync(); // devices are now synced, keep on computing } GPU A GPU B Block 0 Block 0 Block 1 Block 1 Synchronize 13
MULTI GRID GROUP Launch on multiple devices at once New CUDA Launch API to opt-in: cudaLaunchCooperativeKernelMultiDevice (…) Devices need to support the cooperativeMultiDeviceLaunch property. struct cudaLaunchParams params[numDevices]; for (int i = 0; i < numDevices; i++) { params[i].func = (void *)kernel; params[i].gridDim = dim3(…); // Use occupancy calculator params[i].blockDim = dim3(…); params[i].sharedMem = …; params[i ].stream = …; // Cannot use the NULL stream params[i].args = …; } cudaLaunchCooperativeKernelMultiDevice(params, numDevices); 14
COALESCED GROUP Discover the set of coalesced threads, i.e. a group of converged threads executing in SIMD Size: 8 coalesced_group active = coalesced_threads(); 15
COALESCED GROUP Discover the set of coalesced threads, i.e. a group of converged threads executing in SIMD Size: 8 coalesced_group active = coalesced_threads(); if () { // start block Size: 3 coalesced_group g1 = coalesced_threads(); Internal Lane Mask 1 3 7 16
COALESCED GROUP Discover the set of coalesced threads, i.e. a group of converged threads executing in SIMD Size: 8 coalesced_group active = coalesced_threads(); if () { // start block Size: 3 coalesced_group g1 = coalesced_threads(); Internal Lane Mask 1 3 7 0 1 2 g1.thread_rank(); Automatic translation to rank-in-group! 17
COALESCED GROUP Discover the set of coalesced threads, i.e. a group of converged threads executing in SIMD Size: 8 coalesced_group active = coalesced_threads(); if () { // start block Size: 3 coalesced_group g1 = coalesced_threads(); Internal Lane Mask 1 3 7 0 1 2 g1.thread_rank(); g1.shfl(value, 0); Automatic translation from rank-in-group to SIMD lane! 18
COALESCED GROUP Discover the set of coalesced threads, i.e. a group of converged threads executing in SIMD Size: 8 coalesced_group active = coalesced_threads(); if () { // start block Size: 3 coalesced_group g1 = coalesced_threads(); Internal Lane Mask 1 3 7 0 1 2 g1.thread_rank(); g1.shfl(value, 0); 0 1 0 Size: 2 and 1 g2 = tiled_partition(g1, 2); 19
COALESCED GROUP Discover the set of coalesced threads, i.e. a group of converged threads executing in SIMD Size: 8 coalesced_group active = coalesced_threads(); if () { // start block Size: 3 coalesced_group g1 = coalesced_threads(); Internal Lane Mask 1 3 7 0 1 2 g1.thread_rank(); g1.shfl(value, 0); 0 1 0 Size: 2 and 1 g2 = tiled_partition(g1, 2); } // end block active.sync() 20
ATOMIC AGGREGATION Opportunistic cooperation within a warp inline __device__ int atomicAggInc(int *p) { coalesced_group g = coalesced_threads(); int prev; if (g.thread_rank() == 0) { prev = atomicAdd(p, g.size()); } prev = g.thread_rank() + g.shfl(prev, 0); return prev; } 21
ARCHITECTURE CUDA Application Cooperative Launch APIs Cooperative Group APIs <cuda_runtime.h> <cooperative_groups.h> CUDA Runtime Device Runtime Cooperative Launch APIs CUDA *_sync builtins <cuda.h> CUDA Driver PTX *.sync instructions GPU 22
WARP SYNCHRONOUS PROGRAMMING IN CUDA 9.0 23
24
CUDA WARP THREADING MODEL NVIDIA GPU multiprocessors create, manage, schedule and execute threads in warps (32 parallel threads). Threads in a warp may diverge and re-converge during execution. diverged diverged converged time Full efficiency may be realized when all 32 threads of a warp are converged. 25
WARP SYNCHRONOUS PROGRAMMING Warp synchronous programming is a CUDA programming technique that leverages warp execution for efficient inter-thread communication. e.g. reduction, scan, aggregated atomic operation, etc. • CUDA C++ supports warp synchronous programming by providing warp synchronous built-in functions and cooperative group collectives. 26
EXAMPLE: SUM ACROSS A WARP val = input[lane_id]; val += __shfl_xor_sync(0xffffffff, val, 1); val += __shfl_xor_sync(0xffffffff, val, 2); 32 𝑗𝑜𝑞𝑣𝑢[𝑗] val = σ 𝑗=0 val += __shfl_xor_sync(0xffffffff, val, 4); val += __shfl_xor_sync(0xffffffff, val, 8); val += __shfl_xor_sync(0xffffffff, val, 16); 27
HOW TO WRITE WARP SYNCHRONOUS PROGRAMMING Make Sync Explicit diverged diverged Thread re-convergence Use built-in functions to converge • threads explicitly Do not rely on implicit thread re- • convergence. converged time 28
HOW TO WRITE WARP SYNCHRONOUS PROGRAMMING Make Sync Explicit diverged diverged Thread re-convergence Use built-in functions to converge • threads explicitly Do not rely on implicit thread re- • convergence. Reading and writing the same Data exchange between threads memory location by different threads Use built-in functions to sync threads • may cause data races. and exchange data in one step. When using shared memory, avoid data • races between convergence points. 29
Recommend
More recommend