an optimized solver for unsteady transonic aerodynamics
play

An Optimized Solver for Unsteady Transonic Aerodynamics and - PowerPoint PPT Presentation

An Optimized Solver for Unsteady Transonic Aerodynamics and Aeroacoustics Around Wing Profiles Jean-Marie Le Gouez, Onera CFD Department Jean-Matthieu Etancelin, ROMEO HPC Center, Universit de Reims Champagne-Ardennes Thanks to Nikolay


  1. An Optimized Solver for Unsteady Transonic Aerodynamics and Aeroacoustics Around Wing Profiles Jean-Marie Le Gouez, Onera CFD Department Jean-Matthieu Etancelin, ROMEO HPC Center, Université de Reims Champagne-Ardennes Thanks to Nikolay Markovskiy dev-tech at NVIDIA research Center, GB, and to Carlos Carrascal, research master intern GTC 2016, April 7th, San José California 1

  2. Unsteady CFD for aerodynamics profiles •Context •State of the art for unsteady Fluid Dynamics simulations of aerodynamics profiles •Prototypes for new generation flow solvers •NextFlow GPU prototype : Development stages, data models, programming languages, co- processing tools •Capacity of TESLA networks for LES simulations •Performance measurements, tracks for further optimizations •Outlook GTC 2016, April 7th, San José California 2

  3. General context : CFD at Onera Système Cassiopee for application productivity, modularity and coupling, associated to the elsA solver, partly OpenSource The expectations of the external users: • Extended simulation domains: � effects of wake on downstream components, blade-vortex interaction on helicopters, thermal loadings by the reactor jets on composite structures, • Model of full systems and not only the individual components : multi-stage turbomachinery internal flows, couplings between the combustion chamber and the turbine aerodynamics, … • More multi-scale effects : representation of technological effects to improve the overall flow system efficiency : grooves in the walls, local injectors for flow / acoustics control, • Advanced usage of CFD : adjoint modes for automatic shape optimization and grid refinement, uncertainty management, input parameters defined as pdf, GTC 2016, April 7th, San José California 3

  4. CFD at Onera Expectations from the internal users : - to develop and validate state of the art physical models : transition to turbulence, wall models, sub-grid closure models, flame stability, - to propose novel designs in rupture for aeronautics in terms of aerodynamics, propulsion integration, noise mitigation, … - to tackle the CFD grand challenges � New classes of numerical methods, less dependent on the grids, more robust and versatile, � Computational efficiency near the hardware design performance, high parallel scalability, Decision to launch research projects : � On deployment of the DG method for complex cases : AGHORA code � On modular multi-solver architecture within the Cassiopee set of tools elsA Onera GTC 2016, April 7th, San José California 4

  5. Improvement of predictive capabilities in the last 5 year RANS / zonal LES of the flow around a High-Lift wing 2014 2009 • Optimized on a CPU architecture MPI / OpenMP / vectorization • CPU ressources for 70ms of simulation : JADE computer (CINES) Cpu time alloted by Genci • N xyz ~ 2 600 Mpts 4096 cores / 10688 domains T CPU ~ 6 200 000 h Residence time : 63 days Mach 0.18 Rey 1 400 000/corde 2D steady RANS and 3D LES 7,5 Mpts LEISA project Onera FUNK software GTC 2016, April 7th, San José California 5

  6. NextFlow : Spatially High-Order Finite Volume method for RANS / LES Demonstration of the feasability of porting these algorithms on heterogeneous architectures , , GTC 2016, April 7th, San José California

  7. NextFlow : Spatially High-Order Finite Volume method for RANS / LES Demonstration of the feasability of porting these algorithms on heterogeneous architectures , , GTC 2016, April 7th, San José California

  8. Multi-GPU implementation of a High Order Finite Volume solver Main Choices : CUDA, Thrust, mvapich, Reasons : resource-aware programming, productivity librairies Hierarchy of memories correspond to the algorithm phases 1/ main memory for field and metrics variables : 40 million cells on a K40 (12Gb), and for the communication buffers (halo of cells for other partitions) 2/ shared memory at the stream multi-processor level for stencil operations 3/ careful use of registers for node, cell, face algorithms Stages of the project •Initial porting with the same data model organization than on the CPU •Generic refinement of coarse triangular elements with curved faces : hierarchy of grids •Multi-GPU implementation of a highly space-parallel model : extruded in the span direction and periodic •On-going work on a 3D generalization of the preceding phases : embedded grids inside a regular distribution (Octree-type) GTC 2016, April 7th, San José California 8

  9. 1st Approach: Block Structuration of a Regular Linear Grid Partition the mesh into small blocks Map the GPU scalable structure SM SM SM SM Block Block Block Block SM: Stream Multiprocessor GTC 2016, April 7th, San José California 9

  10. Relative advantage of the small block partition ● Bigger blocks provide Better occupancy • Less latency due to kernel launch • Less transfers between blocks • ● Smaller blocks provide Much more data caching • 100 1,2 1 80 1 0,8 0,8 60 0,6 256 0,6 40 1024 0,4 0,4 4096 20 0,2 0,2 24097 0 0 0 L1 hit rate Fluxes time (normalized) Overall time (normalized) Final speedup wrt. to 2 hyperthreaded Westmere CPU: ~2 ● GTC 2016, April 7th, San José California 10

  11. 2 nd approach : Embedded grids, hierachical data model NXO-GPU Imposing a sub-structuration to the grid and data model (inspired by the ‘tesselation’ mechanism in surface rendering) Unique grid connectivity for the inner algorithm Optimal to organize data for coalescent memory access during the algorithm and communication phases Each coarse element in a block is allocated to an inner thread (threadId.x) Hierachical model for the grid : high order (quartic polynomial) triangles generated by gmsh refined on the GPU the whole fine grid as such could remain unknown to the host CPU GTC 2016, April 7th, San José California 11

  12. Code structure Preprocessing Mesh generation and block and generic refinement generation Solver Fortran Allocation and initialization of data structure from the modified mesh file Fortran Computational routine C GPU allocation and initialization binders C Computational binders CUDA CUDA kernels Time stepping C Data fetching binder Postprocessing Visualization and data analysis GTC 2016, April 7th, San José California 12

  13. Version 2 : Measured efficiency on Tesla K20C (with respect to 2 Cpu Xeon 5650, OMP loop-based) Initial results on a K20C : Max. Acceleration = 38 wrt to 2 Westmere sockets Improvement of the Westmere CPU efficiency : OpenMP task-based rather than inner-loop Same block data model on the CPU also, then the K20C GPU / CPU acceleration drops to 13 ( 1 K20c = 150 Westmere cores) � In fact this method is memory bounded, and GPU bandwidth is critical. More CPU optimisation needed (cache blocking, vectorisation ?) Flop count : around 80 Gflops DP /K20C These are valuable flop, not Ax=b, but highly non linear Riemann solver flop with high order (4th, 5th ) extrapolated values, characteristic splitting to avoid interference between waves, … : it requires a very high memory traffic to permit theses flops : wide stencils method Thanks to the NVIDIA GB dev-tech group for their support, “my flop is rich” GTC 2016, April 7th, San José California 13

  14. Version 3 : 2.5D periodic spanwise (circular shift vectors), MULTI-GPU / MPI High CPU vectorisation (all variables are vectors of length 256 to 512) in the 3rd homogeneous direction Full data parallel Cuda kernels with coalesced memory access Objective : one Billion cells on a cluster with only 64 TESLA K20 or 16 K80 (40 000 cells * 512 spanwise stations per partition : 20 million cells addressed to each TESLA K20) The CPU (MPI / Fortran, OpenMP inner loop-based) and GPU ( GPUDirect / C/ Cuda) versions are in the same executable, for efficiency and accuracy • Coarse partitionning : number of partitions equal comparisons to the number of sockets / accelerators GTC 2016, April 7th, San José California 14

  15. Version 3 : 2.5D periodic spanwise (cshift vectors), MULTI-GPU / MPI Initial performance measurements GTC 2016, April 7th, San José California 15

  16. Initial Kernel Optimization and analysis performed by NVIDIA DevTech After this first optimization : ratio of 14 in performance K40 / 8-core Ivy-Bridge socket Strategy for further optimization of performances: Increase occupancy, reduce registers’ use, reduce amount of operations with global memory and texture cache for wide arrays in read-only in a kernel Put stencil coefficients in shared memory, Use constant memory, __launch_bounds__(128, 4) GTC 2016, April 7th, San José California 16

  17. Next stage of optimizations Work done by Jean-Matthieu • - Use thread collaboration to transfer stencil data • from main memory to shared memory • - Refactor the kernel where face-stencil operations are done : • split in two phases to reduce stress on registers • - Use the thrust library to class the face and cell indices into lists to template the kernels according to the list number and avoid internal conditional switches Enable an overlapping : - computations in the center of the partition, - transfer of the halo cells at the periphery, use of mvapich2 by using multiple streams and further classification of the cell and face indices : center � periphery (thrust) GTC 2016, April 7th, San José California 17

Recommend


More recommend