GPU Accelerated Machine Learning for Bond Price Prediction Venkat Bala Rafael Nicolas Fermin Cota
Motivation Primary Goals • Demonstrate potential benefjts of using GPUs over CPUs for machine learning • Exploit inherent parallelism to improve model performance • Real world application using a bond trade dataset 1
Highlights Ensemble • Bagging : Train independent regressors on equal sized bags of samples • Generally, performance is superior to any single individual regressor • Scalable : Each individual model can be trained independently and in parallel Hardware Specifjcations • CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz • GPU: GeForce GTX 1080 Ti • RAM : 1 TB (DDR4 2400 MHZ) 2
Bond Trade Dataset Feature Set • 100+ features per trade • Trade Size/Historical Features • Coupon Rate/Time to Maturity • Bond Rating • Trade Type: Buy/Sell • Reporting Delays • Current Yield/Yield To Maturity Response • Trade Price 3
Modeling Approach
The Machine Learning Pipeline Accelerate each stage in the pipeline for maximum performance 4 CV/TEST SET DATA MODEL EVALUATE PROCESSING BUILDING TRAINING SET DEPLOY
Data Preprocessing Exposing Data Parallelism • Many models rely on input data being on the same scale • Standardization, log transformations, imputations, polynomial/non-linear feature generation, etc. • Most cases, no data dependence so each operation can be executed independently • Signifjcant speedups can be obtained using GPUs, given suffjcient data/computation 5 • Important stage in the pipeline ( Garbage In → Garbage out )
Data Preprocessing: Sequential Approach 6 Apply function F ( · ) sequentially to each element in a feature column F ( · ) . . . a 0 a 1 a 2 a 3 a N
Data Preprocessing: Parallel Approach 7 Apply function F ( · ) in parallel to each element in a feature column . . . a 0 a 1 a 2 a 3 a N F ( · ) F ( · ) F ( · ) F ( · ) F ( · ) . . . b 0 b 1 b 2 b 3 b N
Programming Details Implementation Basics • Task is embarrassingly parallel • Improve CPU code performance • Auto vectorizations + compiler optimizations • Using performance libraries (Intel MKL) • Adopting Threaded (OpenMP)/Distributed computing (MPI) approaches • Great application case for GPUs • Offmoad computations onto the GPU via CUDA kernels • Launch as many threads as there are data elements • Launch several kernels concurrently using CUDA streams 8
Toy Example: Speedup Over Sequential C++ • Log transformation of an array of fmoats 9 • N = 2 p , Number of elements, p = log 2 ( N ) 10 Vectorized C++ Speedup Over Sequential C++ CUDA 8 6 4 2 0 18 19 20 21 22 23 p
Bond Dataset Preprocessing Applied Transformations • Log transformation of highly skewed features (Trade Size, Time to Maturity) • Standardization (Trade Price & historical prices) • Missing value imputation • Winsorizing features to handle outliers • Feature generation (Price differences, Yield measurements) Implementation Details • CPU: C++ implementation using Intel MKL/Armadillo • GPU: CUDA 10
GPU Speedup over CPU implementation • Nearly 10x speedup obtained after CUDA optimizations 11 10 Unoptimized CUDA Optimized CUDA 8 Speedup over CPU 6 4 2 0 20 21 22 23 24 25 p
CUDA Optimizations Standard Tricks • Concurrent kernel executions of kernels using CUDA streams to maximizing GPU utilization • Use of optimized libraries such as cuBLAS/Thrust • Coalesced memory access • Maximizing memory bandwidth for low arithmetic intensive operations • Caching using GPU shared memory 12
Model Building
Ensemble Model Model Choices • GBT : XGBoost, DNN : Tensorfmow/Keras 13 ENSEMBLE MODEL GBT DNN MODELS
Hyperparameter Tuning: Hyperopt GBT: XGBoost • Learning Rate • Max depth • Minimum child weight • Subsample, Colsample-bytree • Regularization parameters DNN: MLPs • Learning Rate/Decay Rate • Batch Size • Epochs • Hidden layers/Layer width • Activations/Dropouts 14
Hyperparameters Tuning: Hyperopt 15 1 . 0 0 . 8 Learning Rate 0 . 6 0 . 4 0 . 2 0 . 0 0 200 400 600 800 1000 Iterations
XGBoost: Training & Hyperparameter Optimization Time 16 CPU GBT, Speedup ≈ 3x GPU Intel(R) Xeon(R) E5-2699, 32 cores GTX 1080 Ti 0 2 4 6 8 Avg. Training Time (H)
TensorFlow/Keras Time Per Epoch 17 18 17 Speedup ≈ 3 x p 16 GTX 1080 Ti 15 Intel(R) Xeon(R) E5-2699, 32 cores 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 Time Per Epoch (s)
Model Test Set Performance 18 160 TEST SET R 2 : 0 . 9858 140 120 Valid 100 80 60 40 20 20 40 60 80 100 120 140 160 Prediction
Summary
Summary Final Remarks • Maximum performance when GPUs incorporated into every stage of the pipeline • Ensembles: Bagging/Boosting to improve model accuracy/throughput • Shorter training times allows more experimentation • Extensive support available • Deploy this pipeline now in our in-house DGX-1 19 • Leveraging the GPU computation power → dramatic speedups
Questions?
Recommend
More recommend