Memory Management Tips, Tricks & Techniques Stephen Jones, - PowerPoint PPT Presentation

Memory Management Tips, Tricks & Techniques Stephen Jones, SpaceX, GTC 2015

Conclusion 1. Wrap malloc/cudaMalloc with your own allocator Non-Blockin, Host/Device Data Leak Detection, High Performance Management Debugging & Sub-Allocation Profiling

Conclusion 2. There are three types of memory allocation Persistent, For allocations spanning multiple program iterations Long-Lived main data storage  C++ objects & configuration data  Storage Working Space, For data which does not persist outside of one iteration Lifetime Of per-iteration derived quantities  operation working space, double buffers, etc. Single Iteration  For transient allocations with single-procedure lifetime Temporary, local queues, stacks & objects  Local Allocation  function-scope working space

Take Control Of Memory Allocation

Take Control Of Memory Allocation Debug & Leak Lightweight Define your own Detection Allocators allocate/free functions Overload new & delete for all classes Never call native malloc() Non-Blocking Host/Device or free() Allocation Management

It’s Easy! C++ new & delete C malloc() & free() // Please don’t name these “ malloc ” & “free” // Every class should be “public AllocBase ” void *hostAlloc(size_t len) { class AllocBase { return malloc(len); public: } void *operator new(size_t len) { return hostAlloc(len); void freeMem(void *ptr) { } free(ptr); } void operator delete(void *ptr) { freeMem(ptr); } };

Also Control Device Allocation C++ new & delete C malloc() & free() // Please don’t name these “ malloc ” & “free” // Every class should be “public AllocBase ” void *hostAlloc(size_t len) { class AllocBase { return malloc(len); public: } void *operator new(size_t len) { return hostAlloc(len); void freeMem(void *ptr) { } free(ptr); } void operator delete(void *ptr) { freeMem(ptr); void *deviceAlloc(size_t len) { } void *ptr; }; cudaMalloc(&ptr, len); return ptr; }

Allocation Tracking, Leak Detection & Profiling

Memory Leak Detection Track each allocation with unique identifier  Allocate extra space for tracking ID  Store ID in front of allocation  Record IDs assigned & released Allocation allocation ID counter Record 1008 bytes char *ptr = (char *)hostAlloc(1000); ID requested space 1000 bytes Return offset address actual allocation start

Memory Leak Detection Allocate 0 1 2 3 4 5 6 7 8 9

Memory Leak Detection Free 0 1 2 3 4 5 6 7 8 9

Memory Leak Detection Identify Memory Leaks 2 6 7

Memory Leak Detection // Use a C++11 atomic to count up allocation ownership static std::atomic<long long>alloc_id = 0; static std::vector<long long>allocationList; void *hostAlloc(size_t len) { long long id = alloc_id++; // Count up allocation ID allocationList[id] = 1; // Record ID as “allocated” // Store allocation ID in front of returned memory void *ptr = malloc(len + 8); *(int *)ptr = id; return (char *)ptr + 8; } void freeMem(void *ptr) { // Extract allocation ID from front of allocation id = *(long long *)((char *)ptr – 8); allocationList[id] = 0; // Record ID as “released” free((char *)ptr - 8); }

Displaying Unreleased Allocations class TrackingObject { For global-scope objects: public: // Set up initial data in constructor  TrackingObject() { Constructor called before main() InitTrackingData();  } Destructor called after main() exits // Analyse tracking data in destructor virtual ~TrackingObject() { ProcessTrackingData(); WARNING } virtual void InitTrackingData() {}  Order of static object construction virtual void ProcessTrackingData() {} }; & destruction is undefined  // Create global-scope static object. Destructor Tracking objects should not // is called automatically when program exits. static TrackingObject dataTracker; interact

Displaying Unreleased Allocations // Walks the allocation list looking for unallocated data class AllocationTracker : public TrackingObject { public: void ProcessTrackingData() { for( long long i=0; i<alloc_id; i++ ) { if( allocationList[i] != 0 ) { printf (“Allocation %d not freed \ n”, i); } } } } // Creates a tracker which will be called on program shutdown static AllocationTracker __allocationTracker;

Complete Leak Tracking Code // Auto display of memory leaks // Allocator with leak tracking static std::atomic<long long>alloc_id = 0; void *hostAlloc(size_t len) { static std::vector<long long>allocationList; long long id = alloc_id++; allocationList[id] = 1; class AllocationTracker { public: void *ptr = malloc(len + 8); void ~AllocationTracker() { *ptr = id; for( long long i=0; i<alloc_id; i++ ) { return (char *)ptr + 8; if( allocationList[i] != 0 ) { } printf (“Allocation %d not freed \ n”, i); } void freeMem(void *ptr) { } id = *(long long *)((char *)ptr – 8); } allocationList[id] = 0; } free((char *)ptr - 8); static AllocationTracker __allocationTracker; }

Host / Device Data Management

Managing Data Movement Minimise Code Impact Large Separate Use managed memory  GPU & CPU C++ operator & casting shenanigans  Code Sections Focus on memory layout  Explicit Locality Control Interleaved Streams & copy/compute overlap  CPU & GPU Carefully managed memory  Execution No One-Size-Fits-All Concurrent Fine-grained memory regions  CPU & GPU Signaling between host & device  Execution Consider zero-copy memory 

Always Use Streams

Always Use Streams Whenever you launch a kernel Whenever you copy data Whenever you synchronize

Streams & Copy/Compute Overlap Copy Up Copy Back Tesla & Quadro GPUs support bi-directional copying

Streams & Copy/Compute Overlap CPU GPU

Streams & Copy/Compute Overlap CPU GPU Step 1

Streams & Copy/Compute Overlap CPU GPU

Streams & Copy/Compute Overlap copy up CPU GPU Step 1

Streams & Copy/Compute Overlap compute copy up CPU GPU Step 2

Streams & Copy/Compute Overlap copy back compute copy up CPU GPU Step 3

Streams & Copy/Compute Overlap copy back compute CPU GPU Step 4

Streams & Copy/Compute Overlap copy back CPU GPU Step 5

Streams & Copy/Compute Overlap 1 copy back 2 compute 3 copy up CPU GPU Three Simultaneous Operations

Overlapping Copy & Compute copy compute copy time start finish

Overlapping Copy & Compute time saved time start finish non-overlapped finish

In More Detail... copy up Stream 1 compute copy back Stream 2 Stream 3 time start finish

Compute/Copy Overlap, in Code // Convert cats to dogs in “N” chunks void catsToDogs(char *cat, char *dog, int width, int height, int N) { // Loop copy+compute+copy for each chunk for( int h=0; h<height; h+=(height/N) ) { // Create a stream for this iteration cudaStream_t s; cudaStreamCreate( &s ); // Allocate device data for this chunk char *deviceData; cudaMalloc( &deviceData, width * (height/N) ); // Copy up then convert then copy back, in our stream cudaMemcpyAsync( deviceData, cat+h*width, ...hostToDevice, s ); convert<<< width, height/N, 0, s >>>( deviceData ); cudaMemcpyAsync( dog+h*width, deviceData, ...deviceToHost, s ); // Free up this iteration’s resources cudaStreamDestroy( s ); cudaFree( deviceData ); } }

Managed Memory Very convenient for minimising code impact  Can access same pointer from CPU & GPU, directly  Data moves automatically  Allows full-bandwidth access from GPU  Tricky to use because of concurrency constraints (see next slides) int *data; cudaMallocManaged( &data, 10000000 ); data[100] = 1234; // Access on CPU first launch<<< 1, 1 >>>( data ); // Access on GPU second

Drawback Of Managed Memory CPU cannot touch managed memory while the GPU is active  “active” means any launch or copy since last synchronize() int *data; cudaMallocManaged( &data, 10000000 ); launch<<< 1, 1 >>>( data ); // Access on GPU first data[100] = 1234; // CPU access fails // because GPU is busy

Drawback Of Managed Memory CPU cannot touch managed memory while the GPU is active  “active” means any launch or copy since last synchronize()  Even if the GPU kernel is not actually using the data int *data; cudaMallocManaged( &data, 10000000 ); launch<<< 1, 1 >>>(); // GPU does not touch data data[100] = 1234; // CPU access still fails // because GPU is busy!

Memory Management Tips, Tricks & Techniques Stephen Jones, - PowerPoint PPT Presentation

Memory Management Tips, Tricks & Techniques Stephen Jones, SpaceX, GTC 2015 Conclusion 1. Wrap malloc/cudaMalloc with your own allocator Non-Blockin, Host/Device Data Leak Detection, High Performance Management Debugging &

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Memory Management Memory Manager Requirements Minimize primary memory access time

Memory Management Ideally programmers want memory that is large fast non

Memory Management Memory Management 5A. Memory Management and Address Spaces 1. allocate/assign

Memory Management Memory Management 5A. Memory Management and Address Spaces 1. allocate/assign

Operating Systems: Operating Systems: Memory management Memory management Fall 2008 Fall 2008

Chapter 4: Memory Management Part 1: Mechanisms for Managing Memory Memory management Basic

Chapter 4: Memory Management Part 1: Mechanisms for Managing Memory Memory management n Basic

Lecture 5: Memory Management 1 / 54 Memory Management Administrivia Assignment 1 is due on

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

U.S. ARMY COMBAT CAPABILITIES DEVELOPMENT COMMAND ARMY RESEARCH LABORATORY Algorithmically

Investigating the Fragmentation of Excited Nuclear Systems Jennifer Erchinger 2010 Cyclotron REU

Decision Procedures and Verifjcation NAIL094 Petr Kuera Charles University 2019/20 (11th

Separation Logic Competition SL-COMP 2018 Presented by Mihaela Sighireanu joint work with Radu

Modern Fast Streaming Data Todd L. Montgomery @toddlmontgomery Why Should We Care? Myths &

Some Rules for Making a Presentation Here's a 10-minute Powerpoint talk with the essentials.

The Dilemma of Early Warning Against Debris Caused by Successful Ballistic Missile Interception

The GOLD Community Vision Scott Farrar Transregional Collaborative Research Center,

Sambuz

Useful Links

Newsletter

Mail Us

Memory Management Tips, Tricks & Techniques Stephen Jones, - PowerPoint PPT Presentation

Memory Management Tips, Tricks & Techniques Stephen Jones, SpaceX, GTC 2015 Conclusion 1. Wrap malloc/cudaMalloc with your own allocator Non-Blockin, Host/Device Data Leak Detection, High Performance Management Debugging &

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Memory Management Memory Manager Requirements Minimize primary memory access time

Memory Management Ideally programmers want memory that is large fast non

Memory Management Memory Management 5A. Memory Management and Address Spaces 1. allocate/assign

Memory Management Memory Management 5A. Memory Management and Address Spaces 1. allocate/assign

Operating Systems: Operating Systems: Memory management Memory management Fall 2008 Fall 2008

Chapter 4: Memory Management Part 1: Mechanisms for Managing Memory Memory management Basic

Chapter 4: Memory Management Part 1: Mechanisms for Managing Memory Memory management n Basic

Lecture 5: Memory Management 1 / 54 Memory Management Administrivia Assignment 1 is due on

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

U.S. ARMY COMBAT CAPABILITIES DEVELOPMENT COMMAND ARMY RESEARCH LABORATORY Algorithmically

Investigating the Fragmentation of Excited Nuclear Systems Jennifer Erchinger 2010 Cyclotron REU

Decision Procedures and Verifjcation NAIL094 Petr Kuera Charles University 2019/20 (11th

Separation Logic Competition SL-COMP 2018 Presented by Mihaela Sighireanu joint work with Radu

Modern Fast Streaming Data Todd L. Montgomery @toddlmontgomery Why Should We Care? Myths &amp;

Some Rules for Making a Presentation Here's a 10-minute Powerpoint talk with the essentials.

The Dilemma of Early Warning Against Debris Caused by Successful Ballistic Missile Interception

The GOLD Community Vision Scott Farrar Transregional Collaborative Research Center,

Sambuz

Useful Links

Newsletter

Mail Us

Modern Fast Streaming Data Todd L. Montgomery @toddlmontgomery Why Should We Care? Myths &