GPU-Centric Thinking: Use Case Acceleration of a DNA Sequencer Pipeline Chuck Seberino Principal Software Engineer
Chuck’s Three Guiding Principles • Hardware Architecture - Understanding limitations (strengths) of SIMD and compute-heavy ASIC. • Memory - Shipping data back and forth between CPU and GPU doesn’t have to be a bad thing. • Multi-level Parallelism - Look at optimizing overlapping kernels, not just single kernel. 2
Hardware
Hardware Architecture • There are definite areas where you can fall off the fast track. • GPU hardware is intentionally simplistic, so that silicon real estate can be geared towards compute! 4
ASIC Comparison • Intel Haswell i7-5960X 8 core: ~2B • NVIDIA GM200 Maxwell 3072 core: ~8B 5
ASIC Comparison • Intel Haswell i7-5960X 8 core: ~2B • NVIDIA GM200 Maxwell 3072 core: ~8B 6
Understand the Hardware // Scale integer value and store as floating point. __global__ void kernel(const int* input, float scale, float* output) { int index = blockIdx.x*blockDim.x + threadIdx.x; output[index] = scale * input[index]; } kernel<<<1000, 1000>>>(input, scale, output); gridsize blocksize • A naive implementation would set grid and block size to launch exactly 1M threads (1000x1000). 7
Understand the Hardware // Scale integer value and store as floating point. __global__ void kernel(const int* input, int length, float scale, float* output) { int index = blockIdx.x*blockDim.x + threadIdx.x; if (index >= length) return; output[index] = scale * input[index]; } kernel<<< 977, 1024>>>(input, 1000000, scale, output); gridsize blocksize • Pass in input length • Perform range check • Adjust blocksize and gridsize to align with warp size 8
Understand the Hardware // Scale integer value and store as floating point. __global__ void kernel(const int* input, int length, float scale, float* output) { int index = blockIdx.x*blockDim.x + threadIdx.x; if (index >= length) return; output[index] = scale * input[index]; } kernel<<< 977, 1024>>>(input, 1000000, scale, output); gridsize blocksize • Making block size a multiple of warp size (32) makes code more efficient. • 1000x1000 kernel sparsely populates 1000 thread blocks with only 8 threads, “wasting” 24000 threads. (31 Full warps + 1 warp at 25%) x 1000 • 977x1024 kernel also launches 1000488 total threads, with 488 early return. 32 Full warps x 976 + 18 Full warps + 14 Early return warps. Could have also used 15625x64. 9
Details of Comparison 10
Memory
Unified Memory Model • Unified Memory - Right now might be a good time to start taking advantage of it, especially if porting new code. –Alleviates user from managing both host and device memory and handling data transfers. –When NVLink comes on the scene, UM can make immediate use of it. • CON: Doesn’t support explicit data movement. 12
Memory Copy Ambiguity • From CUDA C Best Practices Guide Chapter 9.1: • “Also, because of the overhead associated with each transfer, batching many small transfers into one larger transfer performs significantly better than making each transfer separately, even if doing so requires packing non-contiguous regions of memory into a contiguous buffer and then unpacking after the transfer.” • “In contrast with cudaMemcpy() , the asynchronous transfer version requires pinned host memory ...” • What is considered a small transfer? • What happens if I try to use cudaMemcpyAsync() with pageable memory? 13
Small Transfers For a Fixed Price Host To Device Transfer Speed Host to Device Transfer Time 13000 30000 11700 10400 22500 Transfer Time (us) 9100 7800 MB/s 6500 15000 5200 3900 7500 2600 1300 0 0 1 22 200 12364 65612 1 22 200 12364 65612 Transfer Size (KB) Transfer Size (KB) MBP(2013) GeForce GT 650M PCIe 2.0 GeForce Titan X PCIe 3.0 Quadro M6000 Host to Device Transfer Time for Small Sizes 6.0 4.5 Transfer time (us) Cost in time for < 200KB 3.0 transfers is fixed 1.5 0 20 40 60 80 100 120 140 160 180 200 Transfer Size (KB) 14
cudaMemcpyAsync() with pageable memory • Calling cudaMemCpyAsync () with pageable memory works, but ... –Copy operation gets serialized on GPU along with kernel launches - no copy engine overlap with kernels –Host doesn’t block on call though (silently pins) –Can examine in Visual Profiler 15
cudaMemcpyAsync Pinned 16
... vs. cudaMemcpyAsync Paged Not Pinned! 17
Multi-Level Parallelism
Multi-Level Parallelism • In the case where there is more than a single task to be completed, it may be advantageous to parallelize work in multiple ways. • Use several sets of CPU threads to dispatch identical sets of work to different areas of memory • Use CUDA streams to split out work into independent chunks. 19
Keeping the GPU Busy • Increase efficiency by running non-optimal kernels at the same time. –Stalls in one kernel allow other kernels to become active keeping GPU busy. –NOTE: Does not work for branching kernels! 20
Multiple Streams • Synchronization is a great way to parallelize, but you need to be careful to do it properly. • When using more than one stream, never use default stream –Makes it easier to debug default stream problems –Able to verify in NVVP correct behavior - thrust, accidental non-stream commands, etc 21
Example Stream Sync Code for (int n = 0; n < numIterations; ++n) { // Perform initial work as separate streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for previous loop main stream gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(10+ii*50, ii); // Create event record for stream ii Sync Streams 0-3 => 4 gpuPtr->timerStop(ii); // Break synchronization on last stream if (syncStreams || ii != 3) { // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } } Sync Stream 4 => 0-3 // Main stream "Compute" gpuPtr->sleep(100, EventStream); // Synchronization point for other streams gpuPtr->timerStop(EventStream); // Perform additional work as individual streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for main stream to be complete. gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(30+10*ii, ii); // Create event record for stream ii gpuPtr->timerStop(ii); // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } // Again, consolidate and run on a single stream gpuPtr->sleep(100, EventStream); // Synchronization point for other streams gpuPtr->timerStop(EventStream); 22 }
Example Stream Sync Code for (int n = 0; n < numIterations; ++n) { // Perform initial work as separate streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for previous loop main stream gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(10+ii*50, ii); // Create event record for stream ii gpuPtr->timerStop(ii); // Break synchronization on last stream if (syncStreams || ii != 3) { // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } } Compute & // Main stream "Compute" gpuPtr->sleep(100, EventStream); Checkpoint Stream // Synchronization point for other streams gpuPtr->timerStop(EventStream); // Perform additional work as individual streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for main stream to be complete. gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(30+10*ii, ii); // Create event record for stream ii gpuPtr->timerStop(ii); // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } // Again, consolidate and run on a single stream gpuPtr->sleep(100, EventStream); // Synchronization point for other streams gpuPtr->timerStop(EventStream); 23 }
Stream Synchronization Stream 17 not in sync with Stream 16 24
Porting Recommendations
Pick a Good Wave • Select a well defined region . This might mean reorganizing or restructuring to provide adequate separation. –This lets you to plug in a replacement more easily or support multiple configurations. –Can also allow asynchronous CPU processing during GPU sections. • GPU hardware and software is advancing at a phenomenal rate. Take advantage of it. –Spending lots of time optimizing a particular section might be better spent elsewhere. Make sure you have basics covered first. 26
Recommend
More recommend