EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay Sakharnykh, 3/27/2018
SINGLE POINTER CPU vs GPU CPU code GPU code w/ Unified Memory void *data; void *data; data = malloc(N); data = malloc(N); cpu_func1(data, N); cpu_func1(data, N); cpu_func2(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cpu_func3(data, N); cpu_func3(data, N); free(data); free(data); 2
SINGLE POINTER Explicit vs Unified Memory Explicit Memory Management GPU code w/ Unified Memory void *data, *d_data; void *data; data = malloc(N); data = malloc(N); cudaMalloc(&d_data, N); cpu_func1(data, N); cpu_func1(data, N); cudaMemcpy(d_data, data, N, ...) gpu_func2<<<...>>>(d_data, N); gpu_func2<<<...>>>(data, N); cudaMemcpy(data, d_data, N, ...) cudaDeviceSynchronize(); cudaFree(d_data); cpu_func3(data, N); cpu_func3(data, N); free(data); free(data); 3
SINGLE POINTER Full Control with Prefetching Explicit Memory Management Unified Memory + Prefetching void *data, *d_data; void *data; data = malloc(N); data = malloc(N); cudaMalloc(&d_data, N); cpu_func1(data, N); cpu_func1(data, N); cudaMemcpy(d_data, data, N, ...) cudaMemPrefetchAsync(data, N, GPU) gpu_func2<<<...>>>(d_data, N); gpu_func2<<<...>>>(data, N); cudaMemcpy(data, d_data, N, ...) cudaMemPrefetchAsync(data, N, CPU) cudaFree(d_data); cudaDeviceSynchronize(); cpu_func3(data, N); cpu_func3(data, N); free(data); free(data); 4
SINGLE POINTER Deep Copy Explicit Memory Management GPU code w/ Unified Memory char **data; char **data; // allocate and initialize data on the CPU // allocate and initialize data on the CPU char **d_data; char **h_data = (char**)malloc(N*sizeof(char*)); for (int i = 0; i < N; i++) { cudaMalloc(&h_data[i], N); cudaMemcpy(h_data[i], data[i], N, ...); } cudaMalloc(&d_data, N*sizeof(char*)); cudaMemcpy(d_data, h_data, N*sizeof(char*), ...); gpu_func<<<...>>>(d_data, N); gpu_func<<<...>>>(data, N); 5
UNIFIED MEMORY BASICS GPU A GPU B page1 page2 Single virtual memory shared between processors 6
UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem GPU A GPU B page2 page2 page1 page2 Single virtual memory shared between processors 7
UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem *addr1 = 1 local access page2 page2 *addr3 = 1 page fault page3 page3 8
UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 *addr3 = 1 access replay page3 page3 page3 populated and mapped into B’s memory 9
UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 *addr2 = 1 page fault page3 page3 *addr3 = 1 page fault 10
UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 *addr2 = 1 page fault page3 page3 *addr3 = 1 page fault page2 and page3 unmapped from B’s memory 11
UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 *addr2 = 1 page fault page3 page3 *addr3 = 1 page fault pages data migrated to A’s physical memory 12
UNIFIED MEMORY BASICS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 *addr2 = 1 access replay page3 page3 *addr3 = 1 access replay 13
MEMORY OVERSUBSCRIPTION A’s page table B’s page table page1 page1 page1 page1 A’s phys mem B’s phys mem page3 page3 page4 page4 page5 page5 *addr5 = 1 page fault 14
MEMORY OVERSUBSCRIPTION A’s page table B’s page table page1 page1 page1 page1 A’s phys mem B’s phys mem page3 page3 page4 page4 page5 page5 *addr5 = 1 page fault page4 unmapped from A’s memory and migrated 15
MEMORY OVERSUBSCRIPTION A’s page table B’s page table page1 page1 page1 page1 A’s phys mem B’s phys mem page3 page3 page4 page4 page5 page5 *addr5 = 1 page fault page4 mapped in B’s memory, page5 unmapped and migrated to A 16
MEMORY OVERSUBSCRIPTION A’s page table B’s page table page1 page1 page1 page1 A’s phys mem B’s phys mem page3 page3 page4 page4 page5 page5 *addr5 = 1 access replay 17
SIMPLIFYING DL FRAMEWORK DESIGN class ConvolutionLayer { Existing Design public: Eliminated 3,000 lines of repetitive and void cpu_data() error-prone code in Caffe void cpu_diff() void gpu_data() void gpu_diff() Developers can add new inherited Layer void mutable_cpu_data() class ConvolutionLayer void mutable_cpu_diff() classes in a much simpler manner { void mutable_gpu_data() public: void mutable_gpu_diff() void data() The final call to a CPU function or a GPU void diff() void Forward_cpu() void Forward_gpu() kernel (caffe_gpu_gemm) still need to be void mutable_data() void forward_cpu_gemm() Unified void mutable_diff() explicit void forward_gpu_gemm() Memory void forward_cpu_bias() void Forward() void forward_gpu_bias() Design void forward_gemm() void forward_bias() void Backward_cpu() void Backward_gpu() void Backward() void backward_cpu_gemm() void backward_gemm() void backward_gpu_gemm() void backward_bias() void backward_cpu_bias() } void backward_gpu_bias() } A. A. Awan, C- H Chu, H. Subramoni, X. Lu, D.K. Panda, “ OC-DNN: Designing Out-of-Core Deep Neural Network Training by Exploiting Unified Memory on 18 Pascal and Volta GPUs ”, <double -blind submission under review>
CAN THIS DESIGN OFFER GOOD PERF? DL training with Unified Memory VGG19 training on 1xV100 (16GB) ResNet-50 training on 1xV100 (16GB) Caffe (BVLC) OC-Caffe Caffe (BVLC) OC-Caffe 160 160 140 140 120 120 Images/sec Images/sec 100 100 in-memory out-of-core in-memory out-of-core 80 80 60 60 40 40 20 20 0 0 Batch 110 Batch 120 Batch 40 Batch 45 19 OC-Caffe will be released by the HiDL Team@OSU: hidl.cse.ohio-state.edu, mvapich.cse.ohio-state.edu
CONCURRENT ACCESS A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem page2 page2 page3 page3 20
CONCURRENT ACCESS Exclusive Access* A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem atomicAdd_system atomicAdd_system (addr2, 1) (addr2, 1) page fault local access page2 page2 page3 page3 21 *this is a possible implementation and to guarantee this behavior you need to use cudaMemAdvise policies
CONCURRENT ACCESS Exclusive Access A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem atomicAdd_system (addr2, 1) page fault page2 page2 page3 page3 page2 unmapped in B’s memory and migrated to A 22
CONCURRENT ACCESS Exclusive Access A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem atomicAdd_system (addr2, 1) local access page2 page2 page3 page3 23
CONCURRENT ACCESS Atomics over NVLINK* A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem atomicAdd_system atomicAdd_system (addr2, 1) (addr2, 1) remote access local access page2 page2 page3 page3 24 *both processors need to support atomic operations
CONCURRENT ACCESS Read duplication* A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem val = *addr2; val = *addr2; local access local access page2 page2 page3 page3 25 *each processor must maintain its own page table
CONCURRENT ACCESS Read duplication: write A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem *addr2 = val2; local access page2 page2 page3 page3 a write will collapse all copies into one 26
CONCURRENT ACCESS Read duplication: read after write A’s page table B’s page table page1 page1 A’s phys mem B’s phys mem val = *addr2; local access page2 page2 page3 page3 pages are duplicated again on faults 27
ANALYTICS USE CASE Design of a Concurrent Hybrid Hash Table Multiple CPU Cores Concurrent Inserts and Fetches Hash table implemented via Unified Memory Non-blocking updates using atomic compare&swap Concurrent Fetches Concurrent Fetches GPU 0 GPU 1 28 S8172 - Evaluation of Hybrid Cache-Coherent Concurrent Hash Table on POWER9 System with NVLink 2.0 – Thu 11:00 Room 210F
ANALYTICS USE CASE Concurrent Access To Hash Table SM SM SM L2 HBM SYSMEM page fault x86 29
ANALYTICS USE CASE Concurrent Access To Hash Table SM SM SM L2 HBM page migration SYSMEM access replay page fault x86 30
ANALYTICS USE CASE Concurrent Access To Hash Table SM SM SM SM SM SM L2 L2 HBM HBM page migration P9 can directly update hash entry in GPU memory SYSMEM cache access replay page fault no page faults or P9 migrations! x86 31 S8172 - Evaluation of Hybrid Cache-Coherent Concurrent Hash Table on POWER9 System with NVLink 2.0 – Thu 11:00 Room 210F
UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7 Single memory view shared by all GPUs 512 GB Unified Memory Automatic migration of data between GPUs GPU GPU GPU GPU GPU GPU GPU GPU 8 9 10 11 12 13 14 15 User control of data locality 32
Recommend
More recommend