Lecture 6.2 Performance Considerations Memory Coalescing in CUDA - PowerPoint PPT Presentation

GPU Teaching Kit Accelerated Computing Lecture 6.2 – Performance Considerations Memory Coalescing in CUDA

Objective – To learn that memory coalescing is important for effectively utilizing memory bandwidth in CUDA – Its origin in DRAM burst – Checking if a CUDA memory access is coalesced – Techniques for improving memory coalescing in CUDA code 2

3 DRAM Burst – A System View Burst section Burst section Burst section Burst section 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 – Each address space is partitioned into burst sections – Whenever a location is accessed, all other locations in the same section are also delivered to the processor – Basic example: a 16-byte address space, 4-byte burst sections – In practice, we have at least 4GB address space, burst section sizes of 128-bytes or more 3

4 Memory Coalescing Coalesced Loads Coalesced Loads T 0 T 1 T 2 T 3 T 0 T 1 T 2 T 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Burst section Burst section Burst section Burst section – When all threads of a warp execute a load instruction, if all accessed locations fall into the same burst section, only one DRAM request will be made and the access is fully coalesced. 4

5 Un-coalesced Accesses Un-coalesced Loads Un-coalesced Loads T 0 T 1 T 2 T 3 T 0 T 1 T 2 T 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Burst section Burst section Burst section Burst section – When the accessed locations spread across burst section boundaries: – Coalescing fails – Multiple DRAM requests are made – The access is not fully coalesced. – Some of the bytes accessed and transferred are not used by the threads 5

6 How to judge if an access is coalesced? – Accesses in a warp are to consecutive locations if the index in an array access is in the form of – A[(expression with terms independent of threadIdx.x) + threadIdx.x]; 6

7 A 2D C Array in Linear Memory Space M 0,0 M 0,1 M 0,2 M 0,3 M 1,0 M 1,1 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 3,0 M 3,1 M 3,2 M 3,3 M M 0,0 M 0,1 M 0,2 M 0,3 M 1,0 M 1,1 M 1,2 M 1,3 M 2,0 M 2,1 M 2,2 M 2,3 M 3,0 M 3,1 M 3,2 M 3,3 linearized order in increasing address 7

Two Access Patterns of Basic Matrix Multiplication A B HEIGHT Thread 1 Thread 2 WIDTH A[Row*n+i] B[i*k+Col] i is the loop counter in the inner product loop of the kernel code A is m × n, B is n × k Col = blockIdx.x*blockDim.x + threadIdx.x 8

B accesses are coalesced Load iteration 0 Load iteration 1 T 0 T 1 T 2 T 3 T 0 T 1 T 2 T 3 N B 0,0 B 0,1 B 0,2 B 0,3 B 1,0 B 1,1 B 1,2 B 1,3 B 2,0 B 2,1 B 2,2 B 2,3 B 3,0 B 3,1 B 3,2 B 3,3 B 0,0 B 0,1 B 0,2 B 0,3 Access direction in B 1,0 B 1,1 B 1,2 B 1,3 kernel code B 2,0 B 2,1 B 2,2 B 2,3 B 3,0 B 3,1 B 3,2 B 3,3 9

A Accesses are Not Coalesced … Load iteration 1 T 0 T 1 T 2 T 3 Load iteration 0 T 0 T 1 T 2 T 3 A 0,0 A 0,1 A 0,2 A 0,3 A 1,0 A 1,1 A 1,2 A 1,3 A 2,0 A 2,1 A 2,2 A 2,3 A 3,0 A 3,1 A 3,2 A 3,3 A 0,0 A 0,1 A 0,2 A 0,3 Access A 1,0 A 1,1 A 1,2 A 1,3 direction in kernel code A 2,0 A 2,1 A 2,2 A 2,3 A 3,0 A 3,1 A 3,2 A 3,3 10

Loading an Input Tile Have each thread load an A element and a B element at the same relative B position as its C element. Col int tx = threadIdx.x n int ty = threadIdx.y Accessing tile 0 2D indexing: k A[Row][tx] B[ty][Col] A C Row m m WIDTH n k 11

Corner Turning d_M d_N Original H IDT Access W Pattern WIDTH Copy into shared memory d_M d_N Tiled Access Pattern Perform multiplication with shared memory values 12

GPU Teaching Kit The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Lecture 6.2 Performance Considerations Memory Coalescing in CUDA - PowerPoint PPT Presentation

GPU Teaching Kit Accelerated Computing Lecture 6.2 Performance Considerations Memory Coalescing in CUDA Objective To learn that memory coalescing is important for effectively utilizing memory bandwidth in CUDA Its origin in DRAM

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Synthetic Biology Considerations in Synthetic Biology Considerations in Synthetic Biology

2 1. Top Issues and Considerations Top Issues Considerations Aubrey McClendon starts new

2 1. Top Issues and Considerations Considerations Top Issues Interest from drillers is shifting

2 1. Top Issues and Considerations Top Issues Considerations Major drillers and environmental

Some Usability Some Usability Some Usability Considerations in Considerations in

Lecture 11: Energy and security Lecture 11: Energy and security considerations in wireless PHY +

Regional Conference Presentation Planning Considerations Planning Considerations Planning for

CS 147: Computer Systems Performance Analysis Higher Designs and Other Considerations 1 / 25

New Considerations for: New Considerations for: If Its Silica If Its Silica

Challenges and Considerations for Challenges and Considerations for Minor Uses in LATAM Minor

Considerations on DOR Considerations on DOR Tone Frequency and Tone Frequency and Structure

Green Jobs Considerations November, 2012 General Scheme Scheme of document Considerations for

ESE Legal Considerations ESE Legal Considerations in Light of in Light of COVID COVID-19 19

Strategy, Policy, Regulatory Considerations Strategy, Policy, Regulatory Considerations and the

Cultural Considerations for Cultural Considerations for Supporting Aboriginal Victims Supporting

Charmed hadrons from heavy ion collisions Hadron Interactions and Polarizations from Lattice QCD,

Fourth Quarter 2012 Results Donald W. Seale Executive Vice President and Chief Marketing Officer

Clean Energy Transition In Vietnam: Current Status and Perspectives Presenter: VU XUAN NGUYET

Adoption and Adaptation Making technology work for us https://flic.kr/p/odhWZ8 From the

from Heavy Ion Collisions New Frontiers in QCD 27-28 October 2011 Yonsei University Sungtae Cho

Omnithermal perfect simulation for M / G / c queues Stephen Connor

Register Allocation Akim Demaille tienne Renault Roland Levillain first . last @lrde.epita.fr

Compact binary coalescence parameter estimations for 2.5 post- Newtonian aligned spinning