Institute of Computational Science CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014 1 / 40 Efficient CPU ↔ GPU data transfers
Motivation Impact of data transfers on overall application performance Juraj Kardoš Efficient GPU data transfers July 9, 2014 2 / 40
??? When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void ) Initializing __device__ variables Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40 When GPU ↔ CPU memory transfers are performed?
??? When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void ) Initializing __device__ variables Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40 When GPU ↔ CPU memory transfers are performed?
??? When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void ) Initializing __device__ variables Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40 When GPU ↔ CPU memory transfers are performed?
??? When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void ) Initializing __device__ variables Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40 When GPU ↔ CPU memory transfers are performed?
??? When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void ) Initializing __device__ variables Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40 When GPU ↔ CPU memory transfers are performed?
??? When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void ) Initializing __device__ variables Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40 When GPU ↔ CPU memory transfers are performed?
??? When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void ) Initializing __device__ variables Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40 When GPU ↔ CPU memory transfers are performed?
PCIe Juraj Kardoš Efficient GPU data transfers July 9, 2014 4 / 40
PCI Express overview Computer expansion bus Point-to-point connection Lane sharing Single bus (x1) 500 MB/s per lane (PCI-e v2) Multiple lanes (x2, x4, x8, x16, x32) 8 GB/s for a 16 lane bus Juraj Kardoš Efficient GPU data transfers July 9, 2014 5 / 40
Generations of PCI-Express PCI Express July 9, 2014 Efficient GPU data transfers Juraj Kardoš 31 GB/s 1969 MB/s 15 GB/s 984 MB/s 8 GB/s 500 MB/s 4 GB/s 250 MB/s Bandwidth x16 Bandwidth Per Lane version 6 / 40 1 . 0 (2003) 2 . 0 (2007) 3 . 0 (2010) 4 . 0 (2014-15)
PCI-E Bandwidth Test Juraj Kardoš Efficient GPU data transfers July 9, 2014 7 / 40
Remember PCI-E Lanes? Juraj Kardoš Efficient GPU data transfers July 9, 2014 8 / 40
Types of data transfers in CUDA Pageable or pinned Explicit or implicit (automatic, UVM) Synchronous or asynchronous Peer to peer (between GPUs of the same host) GPUDirect (between GPU and network interface) Juraj Kardoš Efficient GPU data transfers July 9, 2014 10 / 40
Types of data transfers in CUDA Pageable or pinned Explicit or implicit (automatic, UVM) Synchronous or asynchronous Peer to peer (between GPUs of the same host) GPUDirect (between GPU and network interface) Juraj Kardoš Efficient GPU data transfers July 9, 2014 11 / 40
Pageable and pinned memory transfer Juraj Kardoš July 9, 2014 Efficient GPU data transfers 12 / 40 12 GB GDDR5 42 GB /sec 288 GB/sec CPU GPU 8 GB/sec ~670 GFLOPS ~4 TFLOPS PCI-Express (Ivy Bridge EX) (Tesla K40)
Pageable and pinned memory transfer Juraj Kardoš July 9, 2014 Efficient GPU data transfers 13 / 40 12 GB GDDR5 42 GB /sec 288 GB/sec CPU GPU 8 GB/sec ~670 GFLOPS ~4 TFLOPS PCI-Express (Ivy Bridge EX) (Tesla K40)
Pageable and pinned memory transfer Juraj Kardoš July 9, 2014 Efficient GPU data transfers 14 / 40 12 GB GDDR5 42 GB /sec 288 GB/sec CPU GPU 8 GB/sec ~670 GFLOPS ~4 TFLOPS PCI-Express (Ivy Bridge EX) (Tesla K40)
Pageable and pinned memory transfer Juraj Kardoš July 9, 2014 Efficient GPU data transfers 15 / 40 12 GB GDDR5 42 GB /sec 288 GB/sec CPU GPU 8 GB/sec ~670 GFLOPS ~4 TFLOPS PCI-Express (Ivy Bridge EX) (Tesla K40)
Pageable and pinned memory transfer Juraj Kardoš Efficient GPU data transfers July 9, 2014 16 / 40
Pageable and pinned memory transfer //allocate memory July 9, 2014 Efficient GPU data transfers Juraj Kardoš Listing 2: Pinned cudaMemcpyDeviceToHost); //memcopy wave13pt_d <<<...>>>( ..., w0_dev, ...); //kernel compute cudaMemcpyHostToDevice); //memcopy cudaMalloc(&w0_dev, szarrayb); //allocate memory cudaMallocHost(&w0, szarrayb); Listing 1: Pageable //kernel compute w0 = (real*)malloc( szarrayb); cudaMalloc(&w0_dev, szarrayb); //memcopy cudaMemcpyDeviceToHost); cudaMemcpyHostToDevice); wave13pt_d <<<...>>>( ..., w0_dev, ...); //memcopy 17 / 40 cudaMemcpy(w0_dev, w0, szarrayb, ← ֓ cudaMemcpy(w0_dev, w0, szarrayb, ← ֓ cudaMemcpy(w0, w0_dev, szarrayb, ← ֓ cudaMemcpy(w0, w0_dev, szarrayb, ← ֓
Pageable and pinned memory transfer - Summary Pageable memory - user memory space, requires extra mem-copy Pinned memory - kernel memory space Pinned memory performs better (higher bandwidth) Do not over-allocate pinned memory - reduces amount of physical memory available for OS Juraj Kardoš Efficient GPU data transfers July 9, 2014 18 / 40
Types of data transfers in CUDA Pageable or pinned Explicit or implicit (UVM) Synchronous or asynchronous Peer to peer (between GPUs of the same host) GPUDirect (between GPU and network interface) Juraj Kardoš Efficient GPU data transfers July 9, 2014 19 / 40
Unified Memory Developer view on memory model Still two distinct physical memories on HW level Juraj Kardoš Efficient GPU data transfers July 9, 2014 20 / 40 Unified Memory 12 GB GDDR5 CPU GPU ~670 GFLOPS ~4 TFLOPS (Ivy Bridge EX) (Tesla K40)
Unified Memory - Usage f(wO); July 9, 2014 Efficient GPU data transfers Juraj Kardoš Listing 4: UVM f(w0); //host function wave13pt_d <<<...>>>( ..., w0, ...); //kernel compute cudaMallocManaged(&w0, szarrayb); //allocate memory Listing 3: Explicit memory //host function //allocate memory cudaMemcpyDeviceToHost); //memcopy wave13pt_d <<<...>>>( ..., w0_dev, ...); //kernel compute cudaMemcpyHostToDevice); //memcopy cudaMalloc(&w0_dev, szarrayb); w0 = (real*)malloc( szarrayb); 21 / 40 cudaMemcpy(w0_dev, w0, szarrayb, ← ֓ cudaMemcpy(w0, w0_dev, szarrayb, ← ֓
Unified Memory - Use Case Juraj Kardoš July 9, 2014 Efficient GPU data transfers 22 / 40 32 GB 12 GB DDR3 GDDR5 42 GB/sec 288 GB/sec CPU GPU 8 GB/sec ~670 GFLOPS ~4 TFLOPS PCI-Express (Ivy Bridge EX) (Tesla K40)
Unified Memory - Use Case Juraj Kardoš July 9, 2014 Efficient GPU data transfers 23 / 40 32 GB 12 GB DDR3 GDDR5 CPU GPU 8 GB/sec ~670 GFLOPS ~4 TFLOPS PCI-Express (Ivy Bridge EX) (Tesla K40)
Unified Memory - Use Case Juraj Kardoš July 9, 2014 Efficient GPU data transfers 24 / 40 32 GB 12 GB DDR3 GDDR5 CPU GPU 8 GB/sec ~670 GFLOPS ~4 TFLOPS PCI-Express (Ivy Bridge EX) (Tesla K40)
Unified Memory - Use Case How does UVM perform when compared to explicit memory movements? July 9, 2014 Efficient GPU data transfers Juraj Kardoš 25 / 40 32 GB 12 GB DDR3 GDDR5 CPU GPU 8 GB/sec ~670 GFLOPS ~4 TFLOPS PCI-Express (Ivy Bridge EX) (Tesla K40)
Implicit memory transfers: UVM Juraj Kardoš Efficient GPU data transfers July 9, 2014 26 / 40
Implicit memory transfers: UVM How does UVM perform in case of multi-threading? Juraj Kardoš Efficient GPU data transfers July 9, 2014 27 / 40
Recommend
More recommend