Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Burning on the GPU: Fast and Accurate Chemical Kinetics GPU Technology Conference April 7, 2016 Russell Whitesides Session 6195 LLNL-PRES-687782 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
+ Why? To make it go faster? Lawrence Livermore National Laboratory 2 LLNL-PRES-687782
Why? • Transportation efficiency • Chemistry is vital to predictive simulations • Chemistry can be > 90% of simulation time. We burn a lot of gasoline. Lawrence Livermore National Laboratory 3 LLNL-PRES-687782
Why? Supercomputing @ DOE labs: Strong investment in GPUs with eye towards exascale OEM engine designers: Require fast turnaround with desktop class hardware National lab compute power and industry need. Lawrence Livermore National Laboratory 4 LLNL-PRES-687782
“Typical” engine simulation w/ detailed chemistry Temperature Y O2 “Colorful Fluid Dynamics” Lawrence Livermore National Laboratory 5 LLNL-PRES-687782
Detailed Chemistry in Reacting Flow CFD: Operator Splitting Technique: Solve independent set of ordinary differential equations (ODEs) in each cell to calculate chemical source terms for species and energy advection/diffusion equations. t t+ ∆ t Each cells is treated as an isolated system for chemistry. Lawrence Livermore National Laboratory 6 LLNL-PRES-687782
CPU (un-coupled) chemistry integration t t+ ∆ t Each cells is treated as an isolated system for chemistry. Lawrence Livermore National Laboratory 7 LLNL-PRES-687782
GPU (batched) chemistry integration t t+ ∆ t On the GPU we solve chemistry in batches of cells simultaneously. Lawrence Livermore National Laboratory 8 LLNL-PRES-687782
Previously at GTC: See also Whitesides & McNenly, GTC 2015; McNenly & Whitesides, GTC 2014 Lawrence Livermore National Laboratory 9 LLNL-PRES-687782
Note: most CFD simulations are done on distributed memory systems rank0 CPU rank4 CPU rank1 rank5 CPU CPU rank2 rank6 CPU CPU rank3 rank7 CPU CPU n_gpu = 0; Lawrence Livermore National Laboratory 10 LLNL-PRES-687782
Note: most CFD simulations are done on distributed memory systems rank0 CPU rank4 CPU rank1 rank5 CPU CPU rank2 rank6 CPU CPU rank3 rank7 CPU CPU ++n_gpu; //now what? Lawrence Livermore National Laboratory 11 LLNL-PRES-687782
Ideal CPU-GPU Work-sharing S GPU = walltime ( CPU ) walltime ( GPU ) Here CPU is a single core. Lawrence Livermore National Laboratory 12 LLNL-PRES-687782
Ideal CPU-GPU Work-sharing 8 S GPU = 8 N CPU =4 7 S GPU = walltime ( CPU ) 6 walltime ( GPU ) 5 N CPU =8 S total 4 ( ) ( ) S total = N CPU + N GPU S GPU − 1 N CPU =16 3 2 * N CPU * N CPU =32 1 1 2 3 4 N GPU § # CPU cores = N CPU * TITAN (1.4375) § # GPU devices = N GPU * surface (1.8750) Let’s make use of the whole machine. Lawrence Livermore National Laboratory 13 LLNL-PRES-687782
Good performance in simple case with both CPU and GPU doing work 10000 Chemistry Time (seconds) CPU Chemistry 1000 GPU Chemistry (std work sharing) 100 1 2 4 8 16 Number of Processors Distribute based on number of cells and give more to GPU. Lawrence Livermore National Laboratory 14 LLNL-PRES-687782
Good performance in simple case with both CPU and GPU doing work 10000 Chemistry Time (seconds) CPU Chemistry 1000 GPU Chemistry (std work sharing) GPU Chemistry (custom work sharing) 100 1 2 4 8 16 Number of Processors S total = 1.7 S GPU = 7 ( S GPU = 6.6) Distribute based on number of cells and give more to GPU. Lawrence Livermore National Laboratory 15 LLNL-PRES-687782
First attempt @ engine calculation on GPU+CPU Let’s go! Lawrence Livermore National Laboratory 16 LLNL-PRES-687782
First attempt @ engine calculation on GPU+CPU 21.2 hours § 2x Xeon E5-2670 (16 cores) => 17.6 hours § 2x Xeon E5-2670 + 2 Tesla K40m => ( S GPU = 2.6) § S total = 21.2/17.6 = 1.20 What happened? Lawrence Livermore National Laboratory 17 LLNL-PRES-687782
Integrator performance when doing batch solution vs. If the systems are not similar how much extra work needs to be done? Lawrence Livermore National Laboratory 18 LLNL-PRES-687782
Batches of dissimilar reactors will suffer from excessive extra steps What penalty do we pay when batching? Lawrence Livermore National Laboratory 19 LLNL-PRES-687782
Batches of dissimilar reactors will suffer from excessive extra steps What penalty do we pay when batching? Lawrence Livermore National Laboratory 20 LLNL-PRES-687782
Batches of dissimilar reactors will suffer from excessive extra steps Possibly a lot of extra steps. Lawrence Livermore National Laboratory 21 LLNL-PRES-687782
Sort reactors by how many steps they took to solve on the last CFD step n_steps >100 1 batch3 batch2 batch1 batch0 Easy as pie? Lawrence Livermore National Laboratory 22 LLNL-PRES-687782
Have to manage the sorting and load- balancing in distributed memory system rank4 rank0 rank5 rank1 rank2 rank6 rank7 rank3 Not so fast. Lawrence Livermore National Laboratory 23 LLNL-PRES-687782
Load balance based on expected cost and expected performance. rank4 rank0 rank5 rank1 rank6 rank2 rank7 rank3 MPI communication to re-balance for chemistry. Lawrence Livermore National Laboratory 24 LLNL-PRES-687782
Second attempt @ engine calculation on GPU+CPU Let’s go again! Lawrence Livermore National Laboratory 25 LLNL-PRES-687782
Total steps significantly reduced by batching appropriately How much does difference does it make? Lawrence Livermore National Laboratory 26 LLNL-PRES-687782
Engine results with improved work- sharing and reactor sorting 13.0 hrs 9.1 hrs S total =1.7 7.6 hrs S GPU =6.6 ~40 % reduction in chemistry time; ~36% reduction in overall time J Lawrence Livermore National Laboratory 27 LLNL-PRES-687782
Future directions § Improve S GPU • Derivative kernels • Matrix operations § Extrapolative integration methods • Less “startup” cost when re-initializing • Potentially well suited for GPU § Non-chemistry calc’s on GPU • Multi-species transport • Particle spray Possibilities for significant further improvements. Lawrence Livermore National Laboratory 28 LLNL-PRES-687782
Summary § Much improved CFD chemistry work- sharing with GPU § ~40% reduction in chemistry time for real engine case (~36% total time) § Working on further improvement + Thank you! Lawrence Livermore National Laboratory 29 LLNL-PRES-687782
Recommend
More recommend