porting and optimization of
play

Porting and Optimization of Search of Neighbour-particle by Using - PowerPoint PPT Presentation

Porting and Optimization of Search of Neighbour-particle by Using OpenACC Takaaki Miyajima and Naoyuki Fujita Neumerical Simulation Tech. Unit Aeronautical Tech. Directorate Japan Aerospace Exploration Agency Japan Aerospace Exploration


  1. Porting and Optimization of Search of Neighbour-particle by Using OpenACC Takaaki Miyajima and Naoyuki Fujita Neumerical Simulation Tech. Unit Aeronautical Tech. Directorate Japan Aerospace Exploration Agency

  2. Japan Aerospace Exploration Agency JAXA is a public agency for researching and developing aerospace science and technology. Supersonic aircraft, sattelites, rockets and space science are our research target.

  3. High-Fidelity Computation Fluid Dynamic Apply Computational Fluid Dynamic (CFD) more broadly • From steady-state simulation to unsteady-state simulation Simulate all flight envelope Simulate unsteady-state Apply to actual aircrafts Understanding behavior of aircraft during takeoff, landing or turn Simulate steady-state Apply newly developed throry to model of actual aircrafts Many research have been done on cruise condition 3

  4. An example of unsteady-state simulation Landing in rainy weather: raindrops have negative effects on wings and tires. • Simulation of interactions between raindrops and aircraft is needed • Higher computing power is necessary as well :) Raindrops decrease the lift coefficient Rain puddle makes landing run longer Cite: http://akihito114.exblog.jp/21064589/ Cite: http://blogs.yahoo.co.jp/qf104j/28794118.htm We adopt MPS (Moving Particle Semi-implicit) method for simulating raindrops 4

  5. Agenda 1. Moving Particle Semi-implicit (MPS) Method 2. OpenACC 3. Porting and Optimization 5

  6. MPS(Moving Particles Semi-Implicit) Method 【 Overview 】 • MPS method is attracting attention in CFD area. • Particle-base simulation (not a stencil computation) • Target fluids are divided to thousands of particles, each particle interacts with its neighbour-particle. MPS simulation: A collapse of water column 【 Features computer science 】 1.The # of particles becomes over ten thausands, parallel computing device is necessary 2.All the particles can be calculated independently 3.Memory-bound application 4.The “search for neighbour-particle ” is the main bottleneck

  7. NSRU-MPS : in-house MPS program We’re developing in -house MPS method program 【 Features of original program 】 • Physicist write the program • +7000 lines of Fortran90 • Physical quantities are single-precision floating-point • Structure of Array (SoA) style data structure • Parallelized only by MPI Simulation done by NSRU-MPS 7

  8. NSRU-MPS : preliminary evaluation Profiling of elapse time on Xeon CPU (IveBridge) Target problemA collapse of water column 40[cm]x40[cm]x8[cm] # of particles2,247,750 # of MPI processes24 CPU Intel Xeon E5-2697 v2 @2.7GHz, 12 cores * 2CPUs Memory 128GB of DDR3-12800 Compiler PGI Fortran 16.10 with "-O3 -fast" option MPI Library OpenMPI 1.10.5 with "-bind-to socket -npersocket 12 -n 24" option Measurment method An average of first 200 steps by MPI_Wtime() function 【 Result 】 • 1 time step : 7093.75[ms] Proc 1, Others, 1911.4[ms] Search for 1645.1[ms] • Search for neighbour-particle neighbour- particle, and MPI related accounted for MPI Proc 4, 3973.5[ms] related, 56% and 21% of the total 1706.5[ms] 1475.0[ms] processing time, respectively. Proc 5, 355.5[ms] 8

  9. Relationship bewteen elapse time and MPI Profile elapse time by changing the # of process from 2 to 24 • Elapse time decreased along with the # of procs • MPI communication increased in proportion to the # of procs 50977.4 50000.0 45000.0 40000.0 Total elapse time : 1/7 Processing Time [msec] 35000.0 MPI related : x15 28551.4 30000.0 25000.0 19159.2 20000.0 15676.4 15000.0 12447.2 10000.0 7093.8 5000.0 0.0 2MPI 4MPI 6MPI 8MPI 12MPI 24MPI Others 9757.1 5562.7 3801.2 3139.9 2562.3 1645.2 MPI related 98.3 901.8 1063.0 1068.2 1045.7 1475.1 Search of neighbour-particles 41122.1 22086.9 14294.9 11468.3 8839.2 3973.5 Decrease elapse time while keeping the # of procs small 9

  10. Search for neighbour particle (w/ bucket) 【 Bucket 】 【 Search for neighbour particle 】 • Divide simulation space into 1.Pickup a target particle (red) squares called “bucket” 2.Traverse adjacent 3^3 buckets • The volume of bucket is equal ✓ No fiexed order to traverse bucket to 3^3 particles 3.Search particles in a bucket • Effect radius (cut-off distance) 4.Calculate distance and weight is 3 buckets between the target particle 5.Accumulate weighted physical value to a target particle ✓ No fixed order to accumulate physical value ※ Other particle-base simulation (Molecular Dynamics or N-body sim) has similar computation 10

  11. Search for neighbour particle in NSRU-MPS Quadraple nested-loop is used Pickup a target particle Traverse adjacent 3x3x3 buckets Search particles in a bucket Calculate distance and weight Accumulate physical value 11

  12. Analysis of Search for neighbour particle • Not easy to vectorize and utilize cache target particle In-direct access; search particles in a bucket bucket Indefinite loop; # of particles in a bucket is uncertain, inefficient access pattern particles • Computation natullary fits to SIMT-model ✓ Each target particle accesses different index of bucket and particle ✓ Thousands of in-flight data request to hide latency ✓ No fixed order to traverse and accumulate value

  13. Agenda 1. Moving Particle Semi-implicit (MPS) Method 2. OpenACC 3. Porting and Optimization 13

  14. An overview of OpenACC Add directives on existing C/C++, Fortran code, and the compiler automatically generates binary for GPU. No need to write CUDA C/Fortran from scratch. Typical target of offload • Loop • Data transfer (CPU from/to GPU) • User defined functions • CUDA Library: cuBLAS, cuFFT, etc can be integrated Not a few practical applications are ported by OpenACC • Sunway TaihuLight added their own extensions. • Most of applications adopts stencil computation PGI Compiler (Community Edition) is free for personal use.

  15. Three directives provided by OpenACC acc data directive • transfers data between the host and the device memory at an arbitrary timing. • data transfer happens at this position. acc kernels directive • specify regions of code for offloading from CPU to GPU • compiler automatically analyzed the loop and the necessary data acc parallel / loop directive • Optimize nested/single loop • Loop can be mapped to block, warp, and thread ※ Each directive can have additional Sample code: Jacobi method clauses to augment information

  16. OpenACC’s three level of parallelism Gang, Worker, Vector are provided to model SIMT • Map loops and functions explicitly • Gand = Block Grid • Worker = Warp Block(0,0) Block(1,0) Block(2,0) • Vector = CUDA Thread = Gang(0,0) = Gang(1,0) = Gang(2,0) num_gang = 3 Block(1,0) Shared mem ( shared clause) num_worker = 3 Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) Thread(4,0) = Vector(0,0) = Vector(1,0) = Vector(2,0) = Vector(3,0) = Vector(4,0) Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1) Thread(4,1) = Vector(0,1) = Vector(1,1) = Vector(2,1) = Vector(3,1) = Vector(4,1) Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2) Thread(4,2) = Vector(0,2) = Vector(1,2) = Vector(2,2) = Vector(3,2) = Vector(4,2) vector_length = 5

  17. Clause and its function Used clauses in our implementation clause Function gang(N) map the loop to the N thread block worker(N) map the loop to the N warp vector(N) map the loop to the N thread seq run the loop sequentially collapse(N) make a N-nested loop to one large loop independent run each iteration independently atomic perform atomic operation 17

  18. Agenda 1. Moving Particle Semi-implicit (MPS) Method 2. OpenACC 3. Porting and Optimization 18

  19. Three optimization: Naive, Atomic, 3-D 1. Naive : 1particle = 1CUDA thread • Simplest optimization • Code modification is not required 2. Atomic : 1bucket = 1CUDA thread • Use atomic operation for accumulation • Small code modification is required 3. 3-D Thread : 1bucket = 1CUDA thread • Consider physical background to map threads • Small code modification is required 19

  20. Naive : 1particle = 1CUDA thread each particle is mapped to CUDA thread inner loops are performed in sequential manner particle 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 bucket traversal 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 … … … … … … … … … … … … … … … 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 128 threads / warp

  21. Atomic : 1bucket = 1CUDA thread each bucket traversal is mapped to CUDA thread move bucket number calculation to here particle bucket traversal … … … … … 0 26 0 26 0 26 0 26 0 26 Atomic Atomic Atomic Atomic Atomic Add Add Add Add Add 128 threads / warp atomic operation is used for accumulation

  22. 3-D Thread : 1bucket = 1CUDA thread • Bucket traversal is mapped to threadIdx.{x,y,z}, respectively • Physical background is considered particle theadIdx.z bucket traversal theadIdx.y (3-D index) theadIdx.x Atomic add 27 threads / warp atomic operation is used for accumulation

Recommend


More recommend