April 2016 Automated Creation of Tests from CUDA Kernels Oleg Rasskazov, Andrey Zhezherun, Antti Lamberg (JP Morgan)
GPUs in JP Morgan JP Morgan is extensively using GPUs to speed up risk calculations and reduce computational costs since 2011. Speedup as of 2011 ~ 40x Large Cross Asset Quant Library (C++, Cuda) Monte Carlo and PDEs GPU code Hand-written Cuda Kernels Thrust Auto-Generated Cuda Kernels Hardest part in delivering GPUs to production Bugs 2
Auto-generating GPU code Putting all of a Quant library to GPU is hard Parts of the code change frequently, so need to be rewritten Domain specific languages (DSL) could help Interpreted Compiled We auto-generate lots of GPU code Auto-generator is simplistic, converting DSL to .cu file We rely on CUDA compilers For optimizations For understanding our horrible auto-generated .cu code We need a regression test harness around it 3
(Rare) Compiler issues Sources (ways to notice) Driver upgrades SDK upgrades Hardware upgrades Mitigation Hand written code Modify the code to go around the issue Auto-generated code ??? Modifying “generator” is hard Complex code Performance ? Backward compatibility ? Share an extensive set of our regression tests with NVidia 4
Pointers that we have a compiler issue How to verify that issue is not your code bug May be it is, but Different behaviour on different cards Different behaviour with different versions of Cuda CPU/GPU code match (in some cases) Ptx inspection Assume the issue could be reproduced by run of a standalone kernel, e.g. No concurrent execution issues Not related to special objects allocated by the driver Streams data, Local memory, Etc => Create a small reproducer 5
Creating standalone kernel tests/reproducers Capture kernel code Auto-generated .cu file kernel inputs correct outputs Capture current GPU memory that is being operated on by the Kernel Memory state was created as a complex interaction of previous kernels and cpu calls We would like to be very generic at that point Replay Restore the GPU memory How? Cuda Malloc does not allow one to choose address range for newly alloc-ed memory Compile and load kernel Pass in parameters and Run Compare the outputs 6
Why dump/restore memory is hard Dump an array from GPU memory Restored Array can be allocated at different address Ok, as long as we know all the pointers to array and re-point them to new allocation What if we had array of pointers to objects? Complex data structures? Ideally we want to snapshot/restore current state of GPU memory No public API from NVidia Problem is hard because there is “private” memory for the driver which depends on kernels loaded, local memory configurations, etc. We came up with a set of tricks 7
32 bit GPU code: restoring / dumping memory Assume GPU memory fits into 2GB Intercept GPU memory allocations in your code and replace with your custom allocator from preallocated blob. Allocate 3GB block of GPU memory BB On 32bit we have 4GB address space 3GB block always has virtual address range 1GB-3GB Custom allocate from range 1GB in BB Dump all the custom-allocated memory from 1GB Replay simply allocates 3GB and always guaranteed to have address range 1GB-3GB So simply load the dump starting from 1GB address All the internal data pointers guaranteed to work as addresses are exactly the same 8
64 bit GPU code: Dumping memory Assume GPU memory for the kernel fits into 1GB Intercept GPU memory allocations in your code and replace with your custom allocator from preallocated blob. Assume we do not store pointers back to CPU memory on the GPU Run 1: Allocate 2GB block of GPU memory BB BB has at least 1 ranges, M starting from 1GB boundary Of size 1GB Use custom allocator starting from M, and before running kernel, dump GPU memory BB_M Run 2: Repeat run 1 but with 1GB address range starting from N, M!=N and dump GPU memory BB_N 9
64 bit GPU code: Restoring memory Allocate 2GB block of GPU memory, BB Find 1GB stride starting with 1GB memory, P Assuming code path in run1 and run2 of the application are deterministic and are the same Assume preallocated BB was set to 0 in both runs Relocate BB_N’s addresses into P: Unless non-linear address arithmetic was involved Dumps BB_N and BB_M would only be different where GPU memory stores the addresses to GPU memory Difference being N-M Size of the difference can be used to validate our assumptions about dumps Starting with BB_N dump, replace add the different bytes (e.g. addresses in Nth GB) with addresses to Pth GB Now we ready to run the kernel. 10
Summary If many conditions are held We can automatically create standalone cuda test cases out our auto-generated kernels Surprisingly, preconditions hold for us say 99% of time 100GB worth of standalone tests from snapshot of our production (uncompressed) 64 bit GPU codes Based on hundreds of trades Could be shipped outside of JPMorgan without sharing proprietary quant library Refreshed tests are to be shipped to NVidia (pending internal clearance). 11
Recommend
More recommend