accelerating financial applications on the gpu
play

Accelerating Financial Applications on the GPU Scott Grauer-Gray - PowerPoint PPT Presentation

Introduction Experiment Setup Application Results Auto-Tuning Conclusion Accelerating Financial Applications on the GPU Scott Grauer-Gray William Killian Robert Searles John Cavazos Department of Computer and Information Science


  1. Introduction Experiment Setup Application Results Auto-Tuning Conclusion Accelerating Financial Applications on the GPU Scott Grauer-Gray William Killian Robert Searles John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General Purpose Processing Using GPUs

  2. Introduction Execution Environment Future Work Conclusion 5 Alternate Architectures Results Framework Auto-Tuning 4 NVIDIA K20 Results Application Results 3 Compilation Experiment Setup Source Code Modifications Experiment Setup 2 Directive-Based Acceleration QuantLib and Financial Applications Introduction 1 Outline Conclusion Auto-Tuning Application Results Final Notes

  3. Introduction Execution Environment Future Work Conclusion 5 Alternate Architectures Results Framework Auto-Tuning 4 NVIDIA K20 Results Application Results 3 Compilation Experiment Setup Source Code Modifications Experiment Setup 2 Directive-Based Acceleration QuantLib and Financial Applications Introduction 1 Outline Conclusion Auto-Tuning Application Results Final Notes

  4. Introduction Experiment Setup Application Results Auto-Tuning Conclusion QuantLib and Financial Applications QuantLib Open-Source library for Quantitative Finance Written in C++ Contains various financial models and methods Models: yield curves, interest rates, volatility Methods: analytic formulae, finite difference, monte-carlo Financial applications optimized are particular code paths in QuantLib

  5. Introduction Monte-Carlo Each application is data-parallelized Double which are sold and bought back later Repurchase agreement pricing of securities Repo Double forward-curve Bond pricing using a fixed-rate bond with a flat Bonds Single Monte-Carlo method Pricing of a single option using QMB (Sobol) Single Experiment Setup pricing Option pricing using Black-Scholes-Merton Black-Scholes Precision Description Application Four financial applications selected for parallelization Financial Applications QuantLib and Financial Applications Conclusion Auto-Tuning Application Results Algorithm for each application is parallelized where possible

  6. Introduction run on an accelerator between scientists Simplifies interaction implementation of code Preserves serial parallelism of code Focuses on highlighting Annotates what code should Experiment Setup to OpenMP Syntax comparable Overview on Directive-Based Acceleration Directive-Based Acceleration Conclusion Auto-Tuning Application Results and programmers

  7. Introduction Directive syntax near identical to OpenMP with added data Fundemental execution unit is a codelet Originally developed by CAPS Entreprise HMPP parallelization Introduces a kernel directive that drives compiler-assisted clauses NVIDIA Experiment Setup Joint collaboration between CAPS Entreprise, CRAY, PGI, and OpenACC Directive-Based Programming Languages Directive-Based Acceleration Conclusion Auto-Tuning Application Results Provides fine-grain control for optimizations

  8. Introduction Execution Environment Future Work Conclusion 5 Alternate Architectures Results Framework Auto-Tuning 4 NVIDIA K20 Results Application Results 3 Compilation Experiment Setup Source Code Modifications Experiment Setup 2 Directive-Based Acceleration QuantLib and Financial Applications Introduction 1 Outline Conclusion Auto-Tuning Application Results Final Notes

  9. Introduction Experiment Setup Application Results Auto-Tuning Conclusion Source Code Modifications Source Code Modifications Implementations derived from Sequential C code Argument passing — Structure of Arrays Verification: Compared all results to original QuantLib code Code flatten QuantLib C++ ⇒ Sequential C code paths. All results were within 3 degrees of precision ( 10 − 3 )

  10. Introduction // flattened code: myObj.addFour(); } }; A inst; inst.foo(); // flattened code: int inst_x; struct A : public B { inst_x += 4; // Alternative flattening: int addFour (int x) { return x + 4; } int inst_x; virtual void foo() { }; Experiment Setup struct C { Application Results Auto-Tuning Conclusion Source Code Modifications Code Flattening // C++ code: int x; virtual void foo() = 0; void addFour() { x += 4; } }; struct B { C myObj; inst_x = addFour (inst_x);

  11. Introduction Experiment Setup Application Results Auto-Tuning Conclusion Compilation Compilation Host code compiled with GCC 4.7.0 -O2 flag used for serial -O3 -march=native flag used for OpenMP OpenACC and HMPP compiled with HMPP Workbench 3.2.1 CUDA compiled with CUDA 5 Toolkit OpenCL used NVIDIA driver version 304.54

  12. Introduction Experiment Setup Application Results Auto-Tuning Conclusion Compilation Compile Workflow Using HMPP Workbench HMPP Workbench used for HMPP and OpenACC code compilation Target CUDA and OpenCL code generation

  13. Introduction CUDA Cores Kepler GK110 NVIDIA K20c 1344 Kepler GK104 NVIDIA GTX 670 448 Fermi NVIDIA C2050 240 Tesla NVIDIA C1060 Architecture Experiment Setup NVIDIA GPU Auto-Tuning Targets: NOTE: Also ran all experiments on NVIDIA C2050 2.6GHz ECC RAM GPU — NVIDIA K20c (2496 CUDA Cores @ 706MHz) with 5GB GDDR5 DDR3-1066 ECC RAM CPU — Dual Xeon X5530 (Quad-Core @ 2.40GHz) with 24GB Execution Environment Execution Environment Conclusion Auto-Tuning Application Results 2496

  14. Introduction Execution Environment Future Work Conclusion 5 Alternate Architectures Results Framework Auto-Tuning 4 NVIDIA K20 Results Application Results 3 Compilation Experiment Setup Source Code Modifications Experiment Setup 2 Directive-Based Acceleration QuantLib and Financial Applications Introduction 1 Outline Conclusion Auto-Tuning Application Results Final Notes

  15. Introduction Number of Options OpenCL HMPP OpenACC Speedup over Sequential Number of Options OpenMP CUDA HMPP OpenACC Experiment Setup Speedup over Sequential OpenMP NVIDIA K20 Results OpenCL Results Black-Scholes — K20 Results CUDA Results Conclusion Application Results Auto-Tuning 10 2 10 2 10 1 10 1 10 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 1 2 5 0 0 0 1 2 5 0 0 0 1 2 5 1 2 5

  16. Introduction Experiment Setup Application Results Auto-Tuning Conclusion NVIDIA K20 Results Black-Scholes — K20 Results CUDA outperformed OpenCL on NVIDIA K20 461x speedup for CUDA 446x speedup for OpenCL HMPP and OpenACC targeting the same language achieved near-identical speedup HMPP and OpenACC targeting OpenCL was faster than targeting CUDA 369x speedup for CUDA 380x speedup for OpenCL

  17. Introduction OpenACC OpenACC HMPP Experiment Setup OpenMP Number of Samples Speedup over Sequential HMPP Number of Samples OpenCL OpenMP Random Number Generation: C/OpenMP — rand CUDA — cuRand HMPP/OpenACC/OpenCL — Mersenne Twister Speedup over Sequential CUDA CUDA Results Monte-Carlo — K20 Results NVIDIA K20 Results Application Results Conclusion Auto-Tuning OpenCL Results 10 3 10 3 10 2 10 2 10 1 10 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 1 2 5 0 0 0 1 2 5 0 0 0 1 2 5 1 2 5 Dropoff in speedup for CUDA ⇒ cache misses

  18. Introduction Experiment Setup Application Results Auto-Tuning Conclusion NVIDIA K20 Results Monte-Carlo — K20 Results Manual CUDA outperformed manual OpenCL Up to 1006x vs 180x HMPP and OpenACC performed similarly Targeting CUDA was faster than targeting OpenCL Up to 162x vs up to 130x

  19. Introduction Number of Bonds OpenMP CUDA HMPP OpenACC Speedup over Sequential Number of Repos Experiment Setup OpenMP CUDA HMPP OpenACC Speedup over Sequential Problem: Generating OpenCL code from HMPP and OpenACC Repo (CUDA) Application Results Auto-Tuning Bonds (CUDA) Conclusion NVIDIA K20 Results Bonds and Repo — K20 Results 100 80 80 60 60 40 40 20 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 1 2 5 0 0 0 0 1 2 5 0 0 1 2 5 0 1 2 1

  20. Introduction Experiment Setup Application Results Auto-Tuning Conclusion NVIDIA K20 Results Bonds and Repo — K20 Results Bonds: Up to 87.9x speedup Repo: Up to 94x speedup HMPP and OpenACC versions produced near-identical execution time HMPP and OpenACC versions ran within 2% execution time as manually-written CUDA Speedup flattened as problem size increased beyond 100,000 Bonds and 2,000,000 Repos

Recommend


More recommend