unified memory on
play

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, - PowerPoint PPT Presentation

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS ARCHITECTURES GPU 0 GPU 1 GPU 2 CPU GPU 0 GPU 1 GPU 2 MEM MEM MEM SYS MEM 2 UNIFIED MEMORY FUNDAMENTALS Single Pointer CPU code GPU code void


  1. UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1

  2. HETEROGENEOUS ARCHITECTURES GPU 0 GPU 1 GPU 2 CPU GPU 0 GPU 1 GPU 2 MEM MEM MEM SYS MEM 2

  3. UNIFIED MEMORY FUNDAMENTALS Single Pointer CPU code GPU code void *data; void *data; data = malloc(N); data = malloc(N); cpu_func1(data, N); cpu_func1(data, N); cpu_func2(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cpu_func3(data, N); cpu_func3(data, N); free(data); free(data); 3

  4. UNIFIED MEMORY FUNDAMENTALS Single Pointer Explicit Memory Unified Memory Management void *h_data, *d_data; void *data; h_data = malloc(N); data = malloc(N); cudaMalloc(&d_data, N); cpu_func1(h_data, N); cpu_func1(data, N); cudaMemcpy(d_data, h_data, N, ...) gpu_func2<<<...>>>(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cudaMemcpy(h_data, d_data, N, ...) cpu_func3(h_data, N); cpu_func3(data, N); free(h_data); free(data); cudaFree(d_data); 4

  5. UNIFIED MEMORY FUNDAMENTALS Deep Copy Nightmare Explicit Memory Unified Memory Management char **data; char **data; data = (char**)malloc(N*sizeof(char*)); data = (char**)malloc(N*sizeof(char*)); for (int i = 0; i < N; i++) for (int i = 0; i < N; i++) data[i] = (char*)malloc(N); data[i] = (char*)malloc(N); char **d_data; gpu_func<<<...>>>(data, N); char **h_data = (char**)malloc(N*sizeof(char*)); for (int i = 0; i < N; i++) { cudaMalloc(&h_data2[i], N); cudaMemcpy(h_data2[i], h_data[i], N, ...); } cudaMalloc(&d_data, N*sizeof(char*)); cudaMemcpy(d_data, h_data2, N*sizeof(char*), ...); gpu_func<<<...>>>(data, N); 5

  6. UNIFIED MEMORY FUNDAMENTALS On-Demand Migration memory A memory B page1 page1 proc A proc B page2 page2 page3 page3 6

  7. UNIFIED MEMORY FUNDAMENTALS On-Demand Migration memory A memory B *addr1 = 1 page1 page1 local access proc A proc B page2 page2 *addr3 = 1 page3 page3 page fault 7

  8. UNIFIED MEMORY FUNDAMENTALS On-Demand Migration memory A memory B page1 page1 proc A proc B page2 page2 *addr3 = 1 page3 page3 page is populated 8

  9. UNIFIED MEMORY FUNDAMENTALS On-Demand Migration memory A memory B page1 page1 proc A proc B *addr2 = 1 page2 page2 page fault *addr3 = 1 page3 page3 page fault 9

  10. UNIFIED MEMORY FUNDAMENTALS On-Demand Migration memory A memory B page1 page1 proc A proc B *addr2 = 1 page2 page2 page fault page migration *addr3 = 1 page3 page3 page migration page fault 10

  11. UNIFIED MEMORY FUNDAMENTALS On-Demand Migration memory A memory B page1 page1 proc A proc B *addr2 = 1 page2 page2 local access *addr3 = 1 page3 page3 local access 11

  12. UNIFIED MEMORY FUNDAMENTALS When Is This Helpful? When it doesn’t matter how data moves to a processor 1) Quick and dirty algorithm prototyping 2) Iterative process with lots of data reuse, migration cost can be amortized 3) Simplify application debugging When it’s difficult to isolate the working set 1) Irregular or dynamic data structures, unpredictable access 2) Data partitioning between multiple processors 12

  13. UNIFIED MEMORY FUNDAMENTALS Memory Oversubscription memory A memory B physical memory proc A proc B capacity is full *addr3 = 1 page fault 13

  14. UNIFIED MEMORY FUNDAMENTALS Memory Oversubscription memory A memory B physical memory proc A proc B capacity is full page eviction *addr3 = 1 page fault 14

  15. UNIFIED MEMORY FUNDAMENTALS Memory Oversubscription memory A memory B proc A proc B *addr3 = 1 page migration page fault 15

  16. UNIFIED MEMORY FUNDAMENTALS Memory Oversubscription memory A memory B physical memory proc A proc B capacity is full 16

  17. UNIFIED MEMORY FUNDAMENTALS Memory Oversubscription Benefits When you have large dataset and not enough physical memory Moving pieces by hand is error-prone and requires tuning for memory size Better to run slowly than get fail with out-of-memory error You can actually get high performance with Unified Memory! 17

  18. UNIFIED MEMORY FUNDAMENTALS System-Wide Atomics with Exclusive Access memory A memory B page1 page1 atomicAdd_system atomicAdd_system (addr2, 1) (addr2, 1) page2 page2 page fault local access proc A proc B page3 page3 18

  19. UNIFIED MEMORY FUNDAMENTALS System-Wide Atomics with Exclusive Access memory A memory B page1 page1 atomicAdd_system (addr2, 1) page2 page2 page migration page fault proc A proc B page3 page3 19

  20. UNIFIED MEMORY FUNDAMENTALS System-Wide Atomics with Exclusive Access memory A memory B page1 page1 atomicAdd_system (addr2, 1) page2 page2 local access proc A proc B page3 page3 20

  21. UNIFIED MEMORY FUNDAMENTALS System-Wide Atomics over NVLINK* memory A memory B page1 page1 atomicAdd_system atomicAdd_system (addr2, 1) (addr2, 1) page2 page2 remote access local access proc A proc B page3 page3 21 *both processors need to support atomic operations

  22. UNIFIED MEMORY FUNDAMENTALS System-Wide Atomics GPUs are very good at handling atomics from thousands of threads Makes sense to utilize atomics between GPUs or between CPU and GPU We will see this in action on a realistic example later on 22

  23. Unified Memory Fundamentals Under the Hood Details AGENDA Performance Analysis and Optimizations Applications Deep Dive 23

  24. UNIFIED MEMORY ALLOCATOR Available Options CUDA C: cudaMallocManaged is your most reliable way to opt in today CUDA Fortran: managed attribute (per allocation) OpenACC: -ta=managed compiler option (all dynamic allocations) malloc support is coming on Pascal+ architectures (Linux only) Note: you can write your own malloc hook to use cudaMallocManaged 24

  25. HETEROGEENOUS MEMORY MANAGER Work In Progress Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration will be triggered the same way as for cudaMallocManaged Ongoing testing and reviews, planning next phase of optimizations More details on HMM today at 4:00 in Room 211B by John Hubbard 25

  26. UNIFIED MEMORY Evolution of GPU Architectures Volta Maxwell Access counters, No new features copy engine related to Unified faults, cache Pascal Kepler Memory coherence, ATS On-demand First release of support migration, the new “single - oversubscription, pointer” system-wide programming atomics model NVLINK2 NVLINK1 2012 2014 2016 2017 26

  27. UNIFIED MEMORY ON KEPLER Available since CUDA 6 Kepler GPU: no page fault support, limited virtual space memory A memory B page1 page1 GPU CPU page2 page2 page3 page3 27

  28. UNIFIED MEMORY ON KEPLER Available since CUDA 6 Bulk migration of all pages attached to current stream on kernel launch memory A memory B kernel page1 page1 launch GPU CPU page2 page2 page migration page3 page3 page migration 28

  29. UNIFIED MEMORY ON KEPLER Available since CUDA 6 No on-demand migration for the GPU, no oversubscription, no system-wide atomics memory A memory B page1 page1 local access GPU CPU page2 page2 local access page3 page3 29

  30. UNIFIED MEMORY ON PASCAL Available since CUDA 8 Pascal GPU: page fault support, extended virtual address space (48-bit) memory A memory B page1 page1 proc A proc B page2 page2 page3 page3 30

  31. UNIFIED MEMORY ON PASCAL Available since CUDA 8 On-demand migration to accessing processor on first touch memory A memory B page1 page1 local access proc A proc B page2 page2 page migration page fault page3 page3 31

  32. UNIFIED MEMORY ON PASCAL Available since CUDA 8 All features : on-demand migration, oversubscription, system-wide atomics memory A memory B page1 page1 proc A proc B page2 page2 local access page3 page3 32

  33. UNIFIED MEMORY ON VOLTA Default model Volta GPU: uses fault on first touch for migration, same as Pascal GPU memory CPU memory page1 page1 local access GPU CPU page2 page2 page migration page fault page3 page3 33

  34. UNIFIED MEMORY ON VOLTA New Feature: Access Counters If memory is mapped to the GPU, migration can be triggered by access counters GPU memory CPU memory page1 page1 local access GPU CPU page2 page2 remote access page3 page3 remote access 34

  35. UNIFIED MEMORY ON VOLTA New Feature: Access Counters With access counters migration only hot pages will be moved to the GPU GPU memory CPU memory page1 page1 GPU CPU page2 page2 page3 page3 page migration local access 35

  36. UNIFIED MEMORY ON VOLTA+P9 NVLINK2: Cache Coherence CPU can directly access and cache GPU memory; native CPU-GPU atomics GPU memory CPU memory page1 page1 GPU CPU page2 page2 remote access page3 page3 remote access local access 36

  37. DRIVER HEURISTICS Things You Didn’t Know Exist The Unified Memory driver is doing intelligent things under the hood: Prefetching : migrate pages proactively to reduce number of faults Thrashing mitigation : heuristics to avoid frequent migration of shared pages Eviction : what pages to evict when we need to make the room for new ones You can’t control them but you can override most of these with hints 37

Recommend


More recommend