Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: - PowerPoint PPT Presentation

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA Performance Optimization 2 Marc Moreno Maza Parallel Reduction 3 University of Western Ontario, London, Ontario (Canada) Parallel Scan 4 UWO-CS4402-CS9535 Exercises 5 (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 1 / 113 (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 2 / 113 Optimizing Matrix Transpose with CUDA Optimizing Matrix Transpose with CUDA Plan Matrix Transpose Characteristics (1/2) We optimize a transposition code for a matrix of floats. This operates out-of-place: input and output matrices address separate memory locations. Optimizing Matrix Transpose with CUDA 1 For simplicity, we consideran n × n matrix where 32 divides n . We focus on the device code: the host code performs typical tasks: data allocation and transfer Performance Optimization 2 between host and device, the launching and timing of several kernels, result validation, and the deallocation of host and device memory. Parallel Reduction 3 Benchmarks illustrate this section: we compare our matrix transpose kernels against a matrix copy kernel, Parallel Scan 4 for each kernel, we compute the effective bandwidth , calculated in GB/s as twice the size of the matrix (once for reading the matrix and once for writing) divided by the time of execution, Exercises 5 Each operation is run NUM REFS times (for normalizing the measurements ), This looping is performed once over the kernel and once within the kernel , The difference between these two timings is kernel launch and (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 3 / 113 (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 4 / 113

Optimizing Matrix Transpose with CUDA Optimizing Matrix Transpose with CUDA Matrix Transpose Characteristics (2/2) A simple copy kernel (1/2) We present hereafter different kernels called from the host code, each __global__ void copy(float *odata, float* idata, int width, addressing different performance issues. int height, int nreps) All kernels in this study launch thread blocks of dimension 32x8, { where each block transposes (or copies) a tile of dimension 32x32. int xIndex = blockIdx.x*TILE_DIM + threadIdx.x; int yIndex = blockIdx.y*TILE_DIM + threadIdx.y; As such, the parameters TILE DIM and BLOCK ROWS are set to 32 and int index = xIndex + width*yIndex; 8, respectively. Using a thread block with fewer threads than elements in a tile is for (int r=0; r < nreps; r++) { // normalization outer loop advantageous for the matrix transpose: for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { each thread transposes several matrix elements, four in our case, and odata[index+i*width] = idata[index+i*width]; much of the cost of calculating the indices is amortized over these } elements. } This study is based on a technical report by Greg Ruetsch (NVIDIA) } and Paulius Micikevicius (NVIDIA). (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 5 / 113 (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 6 / 113 Optimizing Matrix Transpose with CUDA Optimizing Matrix Transpose with CUDA A simple copy kernel (2/2) A naive transpose kernel odata and idata are pointers to the input and output matrices, _global__ void transposeNaive(float *odata, float* idata, width and height are the matrix x and y dimensions, int width, int height, int nreps) nreps determines how many times the loop over data movement { between matrices is performed. int xIndex = blockIdx.x*TILE_DIM + threadIdx.x; In this kernel, xIndex and yIndex are global 2D matrix indices, used to calculate index , the 1D index used to access matrix elements. int yIndex = blockIdx.y*TILE_DIM + threadIdx.y; int index_in = xIndex + width * yIndex; __global__ void copy(float *odata, float* idata, int width, int index_out = yIndex + height * xIndex; int height, int nreps) for (int r=0; r < nreps; r++) { { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { int xIndex = blockIdx.x*TILE_DIM + threadIdx.x; int yIndex = blockIdx.y*TILE_DIM + threadIdx.y; odata[index_out+i] = idata[index_in+i*width]; int index = xIndex + width*yIndex; } } for (int r=0; r < nreps; r++) { } for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index+i*width] = idata[index+i*width]; } } } (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 7 / 113 (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 8 / 113

Optimizing Matrix Transpose with CUDA Optimizing Matrix Transpose with CUDA Naive transpose kernel vs copy kernel Coalesced Transpose (1/11) Because device memory has a much higher latency and lower The performance of these two kernels on a 2048x2048 matrix using a bandwidth than on-chip memory, special attention must be paid to: GTX280 is given in the following table: how global memory accesses are performed? The simultaneous global memory accesses by each thread of a half-warp (16 threads on G80) during the execution of a single read or write instruction will be coalesced into a single access if: The size of the memory element accessed by each thread is either 4, 8, 1 or 16 bytes. The address of the first element is aligned to 16 times the element’s 2 size. The elements form a contiguous block of memory. 3 The i -th element is accessed by the i -th thread in the half-warp. 4 The minor differences in code between the copy and nave transpose Last two requirements are relaxed with compute capabilities of 1.2. kernels have a profound effect on performance. Coalescing happens even if some threads do not access memory ( divergent warp ) (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 9 / 113 (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 10 / 113 Optimizing Matrix Transpose with CUDA Optimizing Matrix Transpose with CUDA Coalesced Transpose (2/11) Coalesced Transpose (3/11) (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 11 / 113 (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 12 / 113

Optimizing Matrix Transpose with CUDA Optimizing Matrix Transpose with CUDA Coalesced Transpose (4/11) Coalesced Transpose (5/11) Allocating device memory through cudaMalloc() and choosing TILE DIM to be a multiple of 16 ensures alignment with a segment of memory, therefore all loads from idata are coalesced. Coalescing behavior differs between the simple copy and naive transpose kernels when writing to odata . In the case of the naive transpose, for each iteration of the i -loop a half warp writes one half of a column of floats to different segments of memory: resulting in 16 separate memory transactions, regardless of the compute capability. (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 13 / 113 (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 14 / 113 Optimizing Matrix Transpose with CUDA Optimizing Matrix Transpose with CUDA Coalesced Transpose (6/11) Coalesced Transpose (7/11) __global__ void transposeCoalesced(float *odata, float *idata, int width, int height) // no nreps param { The way to avoid uncoalesced global memory access is __shared__ float tile[TILE_DIM][TILE_DIM]; to read the data into shared memory and, 1 int xIndex = blockIdx.x*TILE_DIM + threadIdx.x; have each half warp access noncontiguous locations in shared memory 2 int yIndex = blockIdx.y*TILE_DIM + threadIdx.y; in order to write contiguous data to odata. int index_in = xIndex + (yIndex)*width; xIndex = blockIdx.y * TILE_DIM + threadIdx.x; There is no performance penalty for noncontiguous access patterns in yIndex = blockIdx.x * TILE_DIM + threadIdx.y; shared memory as there is in global memory. int index_out = xIndex + (yIndex)*height; for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { a synchthreads() call is required to ensure that all reads from tile[threadIdx.y+i][threadIdx.x] = idata to shared memory have completed before writes from shared idata[index_in+i*width]; memory to odata commence. } __syncthreads(); for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } } (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 15 / 113 (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 16 / 113

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: - PowerPoint PPT Presentation

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA Performance Optimization 2 Marc Moreno Maza Parallel Reduction 3 University of Western Ontario, London, Ontario (Canada) Parallel Scan 4

PHASE IA PLAN ULTIMATE PLAN 13 PHASE IB PLAN ULTIMATE PLAN 14 ULTIMATE PLAN ULTIMATE PLAN

NEW COURTHOUSE 1 ST FLOOR PLAN ANNEX 3rd FLOOR PLAN 2nd FLOOR PLAN BASEMENT FLOOR PLAN ANNEX

Medical Plan Comparison Central Care Plan Medical / Prescription Benefit Summary Advantage HDHP/HSA

Master Plan Open House #3 Preferred Alternative Master Plan Master Plan Process What is a

Site Plan May 2009 Site Plan February 2010 Site Plan May 5, 2010 Site Plan

To Plan or Not To Plan Failure to plan is a plan to fail Questions to ask yourself Who is this

Retirement Plan Changes Update Overview Retirement Plan Options at W&M No changes to

Impr Improvement ement Plan Orienta Plan Orientation tion Building Your Plan for Academic

San Pedro Community Plan San Pedro Community Plan Presentation Overview Community Plan

Local Development Plan Local Development Plan Local Development Plan Local Development Plan

Project Area Vilas Park Vilas Park Master Plan Vilas Park Master Plan Plan Maestro De Vilas Park

( ) + ( ) + ( ) + What is an Area Plan ( ) + What is an Area Plan A Comprehensive

803 SOUTH AVENUE AERIAL MAP SITE PLAN CRAWL SPACE & FOUNDATION PLAN FIRST FLOOR PLAN

Site plan Refused application Proposed Block plan Proposed site layout plan N Ground floor

DISTRICT PLANS 2 Financial Plan Instructional Plan Strategic Plan STRATEGIC PLAN GOALS 3

MESA/POWDERHORN PLAN UPDATE Meeting 3 October 23, 2012 Agenda 2 Plan Purpose Plan Area

Variational mapping particle filter combining particle filters with local optimal transport Manuel

Uniqueness and purity in multi-agent matching problems Brendan Pass (joint with Y.-H. Kim (UBC))

110401367 Transportation Engineering and Planning Dr. Randa Oqab Mujalli Introduction of

Network Flow VI - Min-Cost Flow Applications Lecture 17 October 24, 2013 Sariel (UIUC) CS573

Quiz Define the vectors a 1 and a 2 as follows: a 1 = # $ % # $ % 1 and a 2 = 2 1 1

Announcements Class is 170. Matlab Grader homework, 1 and 2 (of less than 9) homeworks Due 22

McBits Revisited ia.cr/2017/793 Tung Chou Osaka University, Japan Code-based cryptography

Ken Shiozaki RIKEN Corroborators: Hassan Shapourian University of Chicago Shinsei Ryu

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: - PowerPoint PPT Presentation

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA Performance Optimization 2 Marc Moreno Maza Parallel Reduction 3 University of Western Ontario, London, Ontario (Canada) Parallel Scan 4

PHASE IA PLAN ULTIMATE PLAN 13 PHASE IB PLAN ULTIMATE PLAN 14 ULTIMATE PLAN ULTIMATE PLAN

NEW COURTHOUSE 1 ST FLOOR PLAN ANNEX 3rd FLOOR PLAN 2nd FLOOR PLAN BASEMENT FLOOR PLAN ANNEX

Medical Plan Comparison Central Care Plan Medical / Prescription Benefit Summary Advantage HDHP/HSA

Master Plan Open House #3 Preferred Alternative Master Plan Master Plan Process What is a

Site Plan May 2009 Site Plan February 2010 Site Plan May 5, 2010 Site Plan

To Plan or Not To Plan Failure to plan is a plan to fail Questions to ask yourself Who is this

Retirement Plan Changes Update Overview Retirement Plan Options at W&amp;M No changes to

Impr Improvement ement Plan Orienta Plan Orientation tion Building Your Plan for Academic

San Pedro Community Plan San Pedro Community Plan Presentation Overview Community Plan

Local Development Plan Local Development Plan Local Development Plan Local Development Plan

Project Area Vilas Park Vilas Park Master Plan Vilas Park Master Plan Plan Maestro De Vilas Park

( ) + ( ) + ( ) + What is an Area Plan ( ) + What is an Area Plan A Comprehensive

803 SOUTH AVENUE AERIAL MAP SITE PLAN CRAWL SPACE &amp; FOUNDATION PLAN FIRST FLOOR PLAN

Site plan Refused application Proposed Block plan Proposed site layout plan N Ground floor

DISTRICT PLANS 2 Financial Plan Instructional Plan Strategic Plan STRATEGIC PLAN GOALS 3

MESA/POWDERHORN PLAN UPDATE Meeting 3 October 23, 2012 Agenda 2 Plan Purpose Plan Area

Variational mapping particle filter combining particle filters with local optimal transport Manuel

Uniqueness and purity in multi-agent matching problems Brendan Pass (joint with Y.-H. Kim (UBC))

110401367 Transportation Engineering and Planning Dr. Randa Oqab Mujalli Introduction of

Network Flow VI - Min-Cost Flow Applications Lecture 17 October 24, 2013 Sariel (UIUC) CS573

Quiz Define the vectors a 1 and a 2 as follows: a 1 = # $ % # $ % 1 and a 2 = 2 1 1

Announcements Class is 170. Matlab Grader homework, 1 and 2 (of less than 9) homeworks Due 22

McBits Revisited ia.cr/2017/793 Tung Chou Osaka University, Japan Code-based cryptography

Ken Shiozaki RIKEN Corroborators: Hassan Shapourian University of Chicago Shinsei Ryu

Retirement Plan Changes Update Overview Retirement Plan Options at W&M No changes to

803 SOUTH AVENUE AERIAL MAP SITE PLAN CRAWL SPACE & FOUNDATION PLAN FIRST FLOOR PLAN