memory management
play

Memory Management Tips, Tricks & Techniques Stephen Jones, - PowerPoint PPT Presentation

Memory Management Tips, Tricks & Techniques Stephen Jones, SpaceX, GTC 2015 Conclusion 1. Wrap malloc/cudaMalloc with your own allocator Non-Blockin, Host/Device Data Leak Detection, High Performance Management Debugging &


  1. Memory Management Tips, Tricks & Techniques Stephen Jones, SpaceX, GTC 2015

  2. Conclusion 1. Wrap malloc/cudaMalloc with your own allocator Non-Blockin, Host/Device Data Leak Detection, High Performance Management Debugging & Sub-Allocation Profiling

  3. Conclusion 2. There are three types of memory allocation Persistent, For allocations spanning multiple program iterations Long-Lived main data storage  C++ objects & configuration data  Storage Working Space, For data which does not persist outside of one iteration Lifetime Of per-iteration derived quantities  operation working space, double buffers, etc. Single Iteration  For transient allocations with single-procedure lifetime Temporary, local queues, stacks & objects  Local Allocation  function-scope working space

  4. Take Control Of Memory Allocation

  5. Take Control Of Memory Allocation Debug & Leak Lightweight Define your own Detection Allocators allocate/free functions Overload new & delete for all classes Never call native malloc() Non-Blocking Host/Device or free() Allocation Management

  6. It’s Easy! C++ new & delete C malloc() & free() // Please don’t name these “ malloc ” & “free” // Every class should be “public AllocBase ” void *hostAlloc(size_t len) { class AllocBase { return malloc(len); public: } void *operator new(size_t len) { return hostAlloc(len); void freeMem(void *ptr) { } free(ptr); } void operator delete(void *ptr) { freeMem(ptr); } };

  7. Also Control Device Allocation C++ new & delete C malloc() & free() // Please don’t name these “ malloc ” & “free” // Every class should be “public AllocBase ” void *hostAlloc(size_t len) { class AllocBase { return malloc(len); public: } void *operator new(size_t len) { return hostAlloc(len); void freeMem(void *ptr) { } free(ptr); } void operator delete(void *ptr) { freeMem(ptr); void *deviceAlloc(size_t len) { } void *ptr; }; cudaMalloc(&ptr, len); return ptr; }

  8. Allocation Tracking, Leak Detection & Profiling

  9. Memory Leak Detection Track each allocation with unique identifier  Allocate extra space for tracking ID  Store ID in front of allocation  Record IDs assigned & released Allocation allocation ID counter Record 1008 bytes char *ptr = (char *)hostAlloc(1000); ID requested space 1000 bytes Return offset address actual allocation start

  10. Memory Leak Detection Allocate 0 1 2 3 4 5 6 7 8 9

  11. Memory Leak Detection Free 0 1 2 3 4 5 6 7 8 9

  12. Memory Leak Detection Identify Memory Leaks 2 6 7

  13. Memory Leak Detection // Use a C++11 atomic to count up allocation ownership static std::atomic<long long>alloc_id = 0; static std::vector<long long>allocationList; void *hostAlloc(size_t len) { long long id = alloc_id++; // Count up allocation ID allocationList[id] = 1; // Record ID as “allocated” // Store allocation ID in front of returned memory void *ptr = malloc(len + 8); *(int *)ptr = id; return (char *)ptr + 8; } void freeMem(void *ptr) { // Extract allocation ID from front of allocation id = *(long long *)((char *)ptr – 8); allocationList[id] = 0; // Record ID as “released” free((char *)ptr - 8); }

  14. Displaying Unreleased Allocations class TrackingObject { For global-scope objects: public: // Set up initial data in constructor  TrackingObject() { Constructor called before main() InitTrackingData();  } Destructor called after main() exits // Analyse tracking data in destructor virtual ~TrackingObject() { ProcessTrackingData(); WARNING } virtual void InitTrackingData() {}  Order of static object construction virtual void ProcessTrackingData() {} }; & destruction is undefined  // Create global-scope static object. Destructor Tracking objects should not // is called automatically when program exits. static TrackingObject dataTracker; interact

  15. Displaying Unreleased Allocations // Walks the allocation list looking for unallocated data class AllocationTracker : public TrackingObject { public: void ProcessTrackingData() { for( long long i=0; i<alloc_id; i++ ) { if( allocationList[i] != 0 ) { printf (“Allocation %d not freed \ n”, i); } } } } // Creates a tracker which will be called on program shutdown static AllocationTracker __allocationTracker;

  16. Complete Leak Tracking Code // Auto display of memory leaks // Allocator with leak tracking static std::atomic<long long>alloc_id = 0; void *hostAlloc(size_t len) { static std::vector<long long>allocationList; long long id = alloc_id++; allocationList[id] = 1; class AllocationTracker { public: void *ptr = malloc(len + 8); void ~AllocationTracker() { *ptr = id; for( long long i=0; i<alloc_id; i++ ) { return (char *)ptr + 8; if( allocationList[i] != 0 ) { } printf (“Allocation %d not freed \ n”, i); } void freeMem(void *ptr) { } id = *(long long *)((char *)ptr – 8); } allocationList[id] = 0; } free((char *)ptr - 8); static AllocationTracker __allocationTracker; }

  17. Host / Device Data Management

  18. Managing Data Movement Minimise Code Impact Large Separate Use managed memory  GPU & CPU C++ operator & casting shenanigans  Code Sections Focus on memory layout  Explicit Locality Control Interleaved Streams & copy/compute overlap  CPU & GPU Carefully managed memory  Execution No One-Size-Fits-All Concurrent Fine-grained memory regions  CPU & GPU Signaling between host & device  Execution Consider zero-copy memory 

  19. Always Use Streams

  20. Always Use Streams Whenever you launch a kernel Whenever you copy data Whenever you synchronize

  21. Streams & Copy/Compute Overlap Copy Up Copy Back Tesla & Quadro GPUs support bi-directional copying

  22. Streams & Copy/Compute Overlap CPU GPU

  23. Streams & Copy/Compute Overlap CPU GPU

  24. Streams & Copy/Compute Overlap CPU GPU Step 1

  25. Streams & Copy/Compute Overlap CPU GPU Step 2

  26. Streams & Copy/Compute Overlap CPU GPU Step 3

  27. Streams & Copy/Compute Overlap CPU GPU Step 4

  28. Streams & Copy/Compute Overlap CPU GPU Step 5

  29. Streams & Copy/Compute Overlap CPU GPU Step 6

  30. Streams & Copy/Compute Overlap CPU GPU Step 7

  31. Streams & Copy/Compute Overlap CPU GPU Step 8

  32. Streams & Copy/Compute Overlap CPU GPU Step 9

  33. Streams & Copy/Compute Overlap CPU GPU

  34. Streams & Copy/Compute Overlap copy up CPU GPU Step 1

  35. Streams & Copy/Compute Overlap compute copy up CPU GPU Step 2

  36. Streams & Copy/Compute Overlap copy back compute copy up CPU GPU Step 3

  37. Streams & Copy/Compute Overlap copy back compute CPU GPU Step 4

  38. Streams & Copy/Compute Overlap copy back CPU GPU Step 5

  39. Streams & Copy/Compute Overlap 1 copy back 2 compute 3 copy up CPU GPU Three Simultaneous Operations

  40. Overlapping Copy & Compute copy compute copy time start finish

  41. Overlapping Copy & Compute time saved time start finish non-overlapped finish

  42. In More Detail... copy up Stream 1 compute copy back Stream 2 Stream 3 time start finish

  43. Compute/Copy Overlap, in Code // Convert cats to dogs in “N” chunks void catsToDogs(char *cat, char *dog, int width, int height, int N) { // Loop copy+compute+copy for each chunk for( int h=0; h<height; h+=(height/N) ) { // Create a stream for this iteration cudaStream_t s; cudaStreamCreate( &s ); // Allocate device data for this chunk char *deviceData; cudaMalloc( &deviceData, width * (height/N) ); // Copy up then convert then copy back, in our stream cudaMemcpyAsync( deviceData, cat+h*width, ...hostToDevice, s ); convert<<< width, height/N, 0, s >>>( deviceData ); cudaMemcpyAsync( dog+h*width, deviceData, ...deviceToHost, s ); // Free up this iteration’s resources cudaStreamDestroy( s ); cudaFree( deviceData ); } }

  44. Managed Memory Very convenient for minimising code impact  Can access same pointer from CPU & GPU, directly  Data moves automatically  Allows full-bandwidth access from GPU  Tricky to use because of concurrency constraints (see next slides) int *data; cudaMallocManaged( &data, 10000000 ); data[100] = 1234; // Access on CPU first launch<<< 1, 1 >>>( data ); // Access on GPU second

  45. Drawback Of Managed Memory CPU cannot touch managed memory while the GPU is active  “active” means any launch or copy since last synchronize() int *data; cudaMallocManaged( &data, 10000000 ); launch<<< 1, 1 >>>( data ); // Access on GPU first data[100] = 1234; // CPU access fails // because GPU is busy

  46. Drawback Of Managed Memory CPU cannot touch managed memory while the GPU is active  “active” means any launch or copy since last synchronize()  Even if the GPU kernel is not actually using the data int *data; cudaMallocManaged( &data, 10000000 ); launch<<< 1, 1 >>>(); // GPU does not touch data data[100] = 1234; // CPU access still fails // because GPU is busy!

Recommend


More recommend