UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL
Unified Memory Dramatically Lower Developer Effort Developer View Today Developer View With Unified Memory System GPU Memory Unified Memory Memory
Super Simplified Memory Management Code CPU Code CUDA 6 Code with Unified Memory void sortfile(FILE *fp, int N) { void sortfile(FILE *fp, int N) { char *data; char *data; data = (char *)malloc(N); cudaMallocManaged(&data, N); fread(data, 1, N, fp); fread(data, 1, N, fp); qsort(data, N, 1, compare); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); use_data(data); free(data); cudaFree(data); } }
Unified Memory Delivers Single pointer to data, accessible anywhere 1. Simpler Programming & Tight language integration Memory Model Greatly simplifies code porting Migrate data to accessing processor 2. Performance Through Guarantee global coherency Data Locality Still allows cudaMemcpyAsync() hand tuning
Simpler Memory Model: Eliminate Deep Copies CPU Memory dataElem prop1 struct dataElem prop2 { “Hello World” *text int prop1; int prop2; char *text; }; GPU Memory
Simpler Memory Model: Eliminate Deep Copies CPU Memory dataElem prop1 struct dataElem prop2 { “Hello World” *text int prop1; int prop2; char *text; }; Two Copies Required dataElem prop1 prop2 “Hello World” *text GPU Memory
Simpler Memory Model: Eliminate Deep Copies void launch(dataElem *elem) { CPU Memory dataElem *g_elem; char *g_text; dataElem prop1 int textlen = strlen(elem->text); prop2 // Allocate storage for struct and text “Hello World” *text cudaMalloc(&g_elem, sizeof(dataElem)); cudaMalloc(&g_text, textlen); // Copy up each piece separately, including // new “text” pointer value Two Copies cudaMemcpy(g_elem, elem, sizeof(dataElem)); Required cudaMemcpy(g_text, elem->text, textlen); dataElem cudaMemcpy(&(g_elem->text), &g_text, prop1 sizeof(g_text)); prop2 // Finally we can launch our kernel, but “Hello World” // CPU & GPU use different copies of “ elem ” *text kernel<<< ... >>>(g_elem); GPU Memory }
Simpler Memory Model: Eliminate Deep Copies CPU Memory void launch(dataElem *elem) { kernel<<< ... >>>(elem); } Unified Memory dataElem prop1 prop2 “Hello World” *text GPU Memory
Simpler Memory Model Example: GPU & CPU Shared Linked Lists CPU Memory key key key key data data data data next next next next GPU Memory
Simpler Memory Model Example: GPU & CPU Shared Linked Lists CPU Memory key key key key Only practical option is to use data data data data zero-copy (pinned system) memory next next next next GPU accesses at PCIe bandwidth All GPU accesses at very high latency data access over PCIe GPU Memory
Simpler Memory Model Example: GPU & CPU Shared Linked Lists CPU Memory Local Can pass list elements between data Host & Device access Unified Memory Can insert and delete elements from Host or Device* key key key key Single list - no complex data data data data synchronization next next next next Local data access *Program must still ensure no race conditions. *Data is coherent between CPU & GPU GPU Memory at kernel launch & sync only
Unified Memory with C++ Host/Device C++ integration has been difficult in CUDA Cannot construct GPU class from CPU CPU Memory References fail because of kernel<<< >>>(data); dataElem no deep copies prop1 prop2 “Hello text World” // Ideal C++ version of class class dataElem { int prop1; int prop2; String text; void kernel(dataElem data) }; { } GPU Memory
Unified Memory with C++ Host/Device C++ integration has been difficult in CUDA Cannot construct GPU class from CPU CPU Memory References fail because of kernel<<< >>>(data); dataElem no deep copies prop1 prop2 “Hello Pass-by-value text World” uses copy constructor // Ideal C++ version of class class dataElem { int prop1; int prop2; String text; void kernel(dataElem data) }; { } CPU cannot constuct on GPU GPU Memory
Unified Memory with C++ C++ objects migrate easily when allocated on managed heap Overload new operator* to use C++ in unified memory region class Managed { void *operator new(size_t len) { void *ptr; cudaMallocManaged(&ptr, len); return ptr; } void operator delete(void *ptr) { cudaFree(ptr); } }; * (or use placement-new)
Unified Memory with C++ Pass-by-reference enabled with new overload // Deriving from “Managed” allows pass -by-reference class String : public Managed { int length; char *data; }; NOTE: CPU/GPU class sharing is restricted to POD-classes only (i.e. no virtual functions)
Unified Memory with C++ Pass-by-value enabled by managed memory copy constructors // Deriving from “Managed” allows pass -by-reference class String : public Managed { int length; char *data; // Unified memory copy constructor allows pass-by- value String (const String &s) { length = s.length; cudaMallocManaged(&data, length); memcpy(data, s.data, length); } }; NOTE: CPU/GPU class sharing is restricted to POD-classes only (i.e. no virtual functions)
Unified Memory with C++ Combination of C++ and Unified Memory is very powerful Concise and explicit: let C++ handle deep copies Pass by-value or by-reference without memcpy shenanigans CPU Program dataElem *data = new dataElem; Unified Memory // Note “managed” on this class, too. // C++ now handles our deep copies dataElem class dataElem : public Managed { prop1 int prop1; int prop2; prop2 String text; “Hello text }; World” GPU Program
C++ Pass By Reference Single pointer to data makes object references just work CPU Program kernel<<< ... >>>(data); Unified Memory dataElem prop1 Reference prop2 points to same “Hello object text World” __global__ void kernel_by_ref(dataElem &data) { } GPU Program
C++ Pass By Value Copy constructors from CPU create GPU-usable objects By-value CPU Program copy in managed memory kernel<<< ... >>>(data); Unified Memory dataElem dataElem prop1 prop1 copy prop2 prop2 “Hello “Hello text text World” World” __global__ void kernel_by_val(dataElem data) { } GPU Program
Unified Memory Roadmap CUDA 6: Ease of Use Next: Optimizations Maxwell Single Pointer to Data No Memcopy Required Coherence @ launch & sync Prefetching Shared C/C++ Data System Allocator Unified Migration Hints Structures Stack Memory Unified Additional OS Support HW-Accelerated Coherence
1 Unified Memory CUDA 2 XT and Drop-in Libraries 6 3 GPUDirect RDMA in MPI 4 Developer Tools
CUDA 6 Dramatically Simplifies Parallel Programming with Unified Memory More on Parallel Forall Blog http://devblogs.nvidia.com/parallelforall/unified- memory-in-cuda-6/ Sign up for CUDA Registered Developer Program https://developer.nvidia.com/cuda-toolkit
Recommend
More recommend