automated creation of tests from cuda kernels
play

Automated Creation of Tests from CUDA Kernels Oleg Rasskazov, Andrey - PowerPoint PPT Presentation

April 2016 Automated Creation of Tests from CUDA Kernels Oleg Rasskazov, Andrey Zhezherun, Antti Lamberg (JP Morgan) GPUs in JP Morgan JP Morgan is extensively using GPUs to speed up risk calculations and reduce computational costs since


  1. April 2016 Automated Creation of Tests from CUDA Kernels Oleg Rasskazov, Andrey Zhezherun, Antti Lamberg (JP Morgan)

  2. GPUs in JP Morgan  JP Morgan is extensively using GPUs to speed up risk calculations and reduce computational costs since 2011.  Speedup as of 2011 ~ 40x  Large Cross Asset Quant Library (C++, Cuda)  Monte Carlo and PDEs  GPU code  Hand-written Cuda Kernels  Thrust  Auto-Generated Cuda Kernels  Hardest part in delivering GPUs to production  Bugs 2

  3. Auto-generating GPU code  Putting all of a Quant library to GPU is hard  Parts of the code change frequently, so need to be rewritten  Domain specific languages (DSL) could help  Interpreted  Compiled  We auto-generate lots of GPU code  Auto-generator is simplistic, converting DSL to .cu file  We rely on CUDA compilers  For optimizations  For understanding our horrible auto-generated .cu code  We need a regression test harness around it 3

  4. (Rare) Compiler issues  Sources (ways to notice)  Driver upgrades  SDK upgrades  Hardware upgrades  Mitigation  Hand written code  Modify the code to go around the issue  Auto-generated code  ???  Modifying “generator” is hard  Complex code  Performance ?  Backward compatibility ?  Share an extensive set of our regression tests with NVidia 4

  5. Pointers that we have a compiler issue  How to verify that issue is not your code bug  May be it is, but  Different behaviour on different cards  Different behaviour with different versions of Cuda  CPU/GPU code match (in some cases)  Ptx inspection  Assume the issue could be reproduced by run of a standalone kernel, e.g.  No concurrent execution issues  Not related to special objects allocated by the driver  Streams data,  Local memory,  Etc  => Create a small reproducer 5

  6. Creating standalone kernel tests/reproducers  Capture  kernel code  Auto-generated .cu file  kernel inputs  correct outputs  Capture current GPU memory that is being operated on by the Kernel  Memory state was created as a complex interaction of previous kernels and cpu calls  We would like to be very generic at that point  Replay  Restore the GPU memory  How? Cuda Malloc does not allow one to choose address range for newly alloc-ed memory  Compile and load kernel  Pass in parameters and Run  Compare the outputs 6

  7. Why dump/restore memory is hard  Dump an array from GPU memory  Restored Array can be allocated at different address  Ok, as long as we know all the pointers to array and re-point them to new allocation  What if we had array of pointers to objects?  Complex data structures?  Ideally we want to snapshot/restore current state of GPU memory  No public API from NVidia  Problem is hard because there is “private” memory for the driver which depends on kernels loaded, local memory configurations, etc.  We came up with a set of tricks 7

  8. 32 bit GPU code: restoring / dumping memory  Assume GPU memory fits into 2GB  Intercept GPU memory allocations in your code and replace with your custom allocator from preallocated blob.  Allocate 3GB block of GPU memory BB  On 32bit we have 4GB address space  3GB block always has virtual address range 1GB-3GB  Custom allocate from range 1GB in BB  Dump all the custom-allocated memory from 1GB  Replay simply allocates 3GB and always guaranteed to have address range 1GB-3GB  So simply load the dump starting from 1GB address  All the internal data pointers guaranteed to work as addresses are exactly the same 8

  9. 64 bit GPU code: Dumping memory  Assume GPU memory for the kernel fits into 1GB  Intercept GPU memory allocations in your code and replace with your custom allocator from preallocated blob.  Assume we do not store pointers back to CPU memory on the GPU  Run 1:  Allocate 2GB block of GPU memory BB  BB has at least 1 ranges, M  starting from 1GB boundary  Of size 1GB  Use custom allocator starting from M, and before running kernel, dump GPU memory BB_M  Run 2:  Repeat run 1 but with 1GB address range starting from N, M!=N and dump GPU memory BB_N 9

  10. 64 bit GPU code: Restoring memory  Allocate 2GB block of GPU memory, BB  Find 1GB stride starting with 1GB memory, P  Assuming code path in run1 and run2 of the application are deterministic and are the same  Assume preallocated BB was set to 0 in both runs  Relocate BB_N’s addresses into P:  Unless non-linear address arithmetic was involved  Dumps BB_N and BB_M would only be different where GPU memory stores the addresses to GPU memory  Difference being N-M  Size of the difference can be used to validate our assumptions about dumps  Starting with BB_N dump, replace add the different bytes (e.g. addresses in Nth GB) with addresses to Pth GB  Now we ready to run the kernel. 10

  11. Summary  If many conditions are held  We can automatically create standalone cuda test cases out our auto-generated kernels  Surprisingly, preconditions hold for us say 99% of time  100GB worth of standalone tests from snapshot of our production (uncompressed)  64 bit GPU codes  Based on hundreds of trades  Could be shipped outside of JPMorgan without sharing proprietary quant library  Refreshed tests are to be shipped to NVidia (pending internal clearance). 11

Recommend


More recommend