GPU Accelerated Solver for the 3D Groundwater Flow Equation GTC 2015 Robert Zigon Sr Staff Research Engineer Beckman Coulter
Outline • Background • Legacy Fortran • The Algorithm and CUDA attempts • Results • Lessons Learned
Background Hydrogeology The study of the distribution and movement of water in the Earth’s crust.
Questions asked by Hydrogeologists • Can an aquifer support another subdivision in a residential area? • Will a dam dry up if irrigation doubles? • Will waste products from a coal mine negatively impact wetlands?
A PDE to model the water flow Freeze, 1971
Discretizing the PDE • First order for time ( , ) x t 1 i i j i ( ) ( ) j j t t • Second order for spatial [ K ( ) ( x t )] 1 2 i i i j , i ( 1 ) ( ) ( 1 ) j j j K ( ) 1 x x x x 2 i 1 i 1 i 1 ( 1 ) ( ) ( 1 ) j j j K ( ) 2 x
Legacy Fortran • About 15 pages of code (Intel compiler) • In use for over 10 years • 7 day simulation, 24 hr step, 1M elements 2 hr run time • 30 day simulation, 24 hr step, 19M elements 8 days run time
Algorithm Overview For each time step t While pressure not converged at (t) 1. Predict Psi (t) 2. Compute K(Psi (t) ) 3. Compute Psi (t) 4. Update Psi (t-2), Psi (t-1) 5. Generate discharge field Q (t)
First CUDA attempt 1. Predict Psi (t) Compute K(Psi (t) ) Compute Psi (t) Update Psi (t-2), Psi (t-1) 2. Generate discharge field Q (t) -Launch 250,000 threads for 19M volume elements -Advance the plane of threads across the volume Results – Not Enough Registers!
Second CUDA attempt 1. Predict Psi (t) Compute K(Psi (t) ) 2. Compute Psi (t) 3. Update Psi (t-2), Psi (t-1) 4. Generate discharge field Q (t) Results – K1 not enough registers!
Third CUDA attempt 1. Predict Psi (t) 2. Compute K(Psi (t) ) 3. Compute Psi (t) 4. Update Psi (t-2), Psi (t-1) 5. Generate discharge field Q (t) Results – K2 nonlinear coefficients expensive K3 warp divergence boundary cond. Numerous matrix reads from GMEM
Results – 7 Day, 19M elements 1 cpu 4 cpu K20c mins mins mins 1 cpu/K20 4 cpu/K20 24 hrs 120 72 10 12.6 7.6 12 hrs 251 165 21 12.0 7.9 6 hrs 532 352 41 13.0 8.6 4 hrs 826 510 63 13.1 8.1 2 hrs 1557 967 123 12.7 7.9 10000 1000 Time (mins) 1 CPU 100 4 CPU Tesla K20C 10 1 0 1 2 3 4 5 6 All arithmetic in double precision CUDA 5.5, K20C, VS 2008, Win7/64
Lessons Learned • Advance a “plane of threads” through the volume • Matrix multi-splitting operator could reduce reads • Simplify non-linear terms with splines • Porting code 10x • Re-architecting code 100x
Collaborators • Prof. Sally Letsinger, Indiana University • Prof. Raymond Chin, Indiana University-Purdue University of Indianapolis References • O’Leary -Multi-splitting of Matrices and Parallel Solution of Linear Systems • Freeze-Three dimensional, transient, saturated unsaturated flow in a ground basin • Micikevicius-3D Finite Difference Computation on GPUs using CUDA
Questions? robert.zigon@beckman.com
Recommend
More recommend