Speeding up a Finite Element Computation on GPU Nelson Inoue
Summary • Introduction • Finite element implementation on GPU • Results • Conclusions 2
University and Researchers • Pontifical Catholic University of Rio de Janeiro – PUC- Rio • Group of Technology in Petroleum Engineering - GTEP • Research Team PhD Sergio Fontoura PhD Nelson Inoue PhD Carlos Emmanuel MSc Guilherme Righetto MSc Rafael Albuquerque Leader Researcher Senior Researcher Researcher Researcher Researcher 3
Introduction • Research & Development (R&D) project with Petrobras • The project began in 2010 • The subject of the project is Reservoir Geomechanics • There are great interest by oil and gas industry in this subject • This subject is still little researched 4
Introduction • What is Reservoir Geomechanics? – Branch of the petroleum engineering that studies the coupling between the problems of fluid flow and rock deformation (stress analysis) • Hydromechanical Coupling – Oil production causes rock deformation – Rock deformation contributes to oil production 5
Motivation • Geomechanical effects during reservoir production 1. Surface subsidence 2. Bedding-parallel slip 3. Fault reactivation 4. Caprock integrity 5. Reservoir compaction 6
Challenge • Evaluate geomechanical effects in a real reservoir • Overcome two major challenges 1. To use a reliable coupling scheme between fluid flow and stress analysis 2. To speed up the stress analysis (Finite Element Method) Finite Element Analysis spends most part of the simulation time 7
Hydromechanical coupling • Theoretical Approach Coupling program flowchart 8
Finite Element Method • Partial Differential Equations arise in the mathematical modelling of many engineering problems • Analytical solution or exact solution is very complicated • Alternative: Numerical Solution – Finite element method , finite difference method, finite volume method, boundary element method, discrete element method, etc. 9
Finite Element Method • Finite element method (FEM) is widely applied in stress analysis • The domain is an assembly of finite elements (FEs) (http://www.mscsoftware.com/product/dytran) Finite Element Domain 10
CHRONOS: FE Program • Chronos has been implemented on GPU CETUS Computer with 4 GPUs – Motivation : to reduce the simulation time in the hydromechanical analysis – Why to use GPU? Much more processing power CPU GPU 4 x GPUs >> 4 - 8 cores 2880 cores GeForce GTX Titan 11
Motivation • GPU Features: (Cuda C Programming Guide) – Highly parallel, multithreaded and manycore processor – Tremendous computational horsepower and very high memory bandwidth Number of FLoating-point Operations Per Second Bandwidth 12
Our Implementation • GPUs have good performance • We have developed and implemented an optimized and parallel finite element program on GPU • Programming Language CUDA is used to implement the finite element code • We have Implemented on GPU: – Assembly of the stiffness matrix – Solution of the system of linear equation – Evaluation of the strain state – Evaluation of the stress state 13
Global Memory Access on GPU • Getting maximum performance on GPU Coalesced Access Sequential/Aligned Strided Random Good Not so good Bad – Memory accesses are fully coalesced as long as all threads in a warp access the same relative address 14
Development on CPU • The assembly of the global stiffness matrix in the conventional FEM – Simple 1D problem – Element Stiffness Matrix a) 1 1 k k • Element 1 1 11 12 k 1 1 k k Real model 21 22 b) 2 2 k k • 1 2 3 4 Element 2 2 11 12 k Model discretization 2 2 k k 21 22 c) 1 3 3 k k • Element 3 3 11 12 1 2 k 2 3 3 k k 21 22 1 2 3 • 1 2 Continuous model is discretized by elements Three Finite elements 15
Development on CPU • In terms of CPU implementation For i=1 , i ≤ numel=3 i =1 i =2 i =3 3 3 k k 2 2 Evaluate Element 1 1 k k k k 3 11 12 2 11 12 k 1 11 12 k k element 3 3 element 2 2 Stiffness Matrix 1 1 k k k k k k 21 22 21 22 21 22 1 1 1 1 k k 0 0 1 1 k k 0 0 k k 0 0 11 12 11 12 11 12 Assembly Global 1 1 2 2 1 1 1 1 2 2 k k 0 0 k k k k 0 k k k k 0 21 22 11 12 21 22 11 12 21 22 k k k Stiffness Matrix global global global 2 2 3 3 2 2 0 k k 0 0 k k k k 0 0 0 0 21 22 21 22 11 12 3 3 0 0 0 0 0 0 k k 0 0 0 0 21 22 – The Storage in the memory Memory access is not coalesced element 1 1 1 1 i =1 k k k 0 0 k k 0 0 0 0 0 0 0 0 0 0 11 12 21 22 1 1 1 1 2 1 1 1 i =2 k k k 0 0 k k k k 0 0 k k 0 0 0 0 0 element 11 12 21 22 11 12 21 22 1 1 1 1 2 2 2 2 3 3 3 3 i =3 k k k 0 0 k k k k 0 0 k k k k 0 0 k k element 11 12 21 22 11 12 21 22 11 12 21 22 16
Development on GPU • The assembly of the global stiffness matrix on GPU – Simple 1D problem – Each row of the global stiffness matrix ] • Node row 1 1 1 1 [ k ] [ k k k k 11 22 11 12 Real model • ] Node 2 row 2 1 1 2 2 [ k ] [ k k k k 21 22 11 12 1 1 2 • ] Node 3 row 3 2 2 3 3 [ k ] [ k k k k 21 22 11 12 2 1 1 2 3 • ] Node 3 row 4 3 3 [ k ] [ k k k k 21 22 11 12 3 2 2 3 4 3 4 3 • Continuous model is discretized by nodes Four finite elements nodes 17
Development on GPU • In terms of GPU implementation Thread = 1 Column = 1 ] row 1 1 1 [ k ] [ 0 k k 11 12 0 Thread = 2 Thread = 1 ] row 2 1 1 2 2 k 1 All the threads do the same calculation [ k ] [ k k k k k Thread = 2 21 22 11 12 global 21 2 k Thread = 3 Thread = 3 21 3 ] k row 3 2 2 3 3 [ k ] [ k k k k 21 21 22 11 12 – The Storage in the memory Column =1 1 2 3 k 0 k k k global 21 21 21 Thread = 1 Thread = 2 Thread = 3 The memory access is sequential and aligned 18
Development on GPU • In terms of GPU implementation Thread = 1 Column = 2 ] row 1 1 1 [ k ] [ 0 k k 11 12 1 0 k Thread = 1 Thread = 2 12 ] k 1 1 2 row 2 1 1 2 2 k k k Thread = 2 [ k ] [ k k k k global 21 22 11 12 21 22 11 2 2 3 k k k Thread = 3 21 22 11 Thread = 3 3 3 k k ] row 3 2 2 3 3 [ k ] [ k k k k 21 22 21 22 11 12 – The Storage in the memory Memory access is coalesced Column =2 1 2 3 1 1 2 2 3 3 k 0 k k k k k k k k k global 21 21 21 12 22 11 22 11 22 Thread = 1 Thread = 2 Thread = 3 19
Development on GPU • Solution of the systems of linear equations Ax = b – Direct solver – Iterative Solver – A = stiffness matrix, x = nodal displacement vector (unknown values) and b = nodal force vector Conjugate Gradient Algorithm – A is a symmetric and positive-definite • It was chosen the Conjugate Gradient Method – Iterative algorithm – Parallelizable algorithm on GPU – The operations of a conjugate gradient algorithm is suitable to implement on GPU 20
Recommend
More recommend