An Optimized Solver for Unsteady Transonic Aerodynamics and - PowerPoint PPT Presentation

An Optimized Solver for Unsteady Transonic Aerodynamics and Aeroacoustics Around Wing Profiles Jean-Marie Le Gouez, Onera CFD Department Jean-Matthieu Etancelin, ROMEO HPC Center, Université de Reims Champagne-Ardennes Thanks to Nikolay Markovskiy dev-tech at NVIDIA research Center, GB, and to Carlos Carrascal, research master intern GTC 2016, April 7th, San José California 1

Unsteady CFD for aerodynamics profiles •Context •State of the art for unsteady Fluid Dynamics simulations of aerodynamics profiles •Prototypes for new generation flow solvers •NextFlow GPU prototype : Development stages, data models, programming languages, co- processing tools •Capacity of TESLA networks for LES simulations •Performance measurements, tracks for further optimizations •Outlook GTC 2016, April 7th, San José California 2

General context : CFD at Onera Système Cassiopee for application productivity, modularity and coupling, associated to the elsA solver, partly OpenSource The expectations of the external users: • Extended simulation domains: � effects of wake on downstream components, blade-vortex interaction on helicopters, thermal loadings by the reactor jets on composite structures, • Model of full systems and not only the individual components : multi-stage turbomachinery internal flows, couplings between the combustion chamber and the turbine aerodynamics, … • More multi-scale effects : representation of technological effects to improve the overall flow system efficiency : grooves in the walls, local injectors for flow / acoustics control, • Advanced usage of CFD : adjoint modes for automatic shape optimization and grid refinement, uncertainty management, input parameters defined as pdf, GTC 2016, April 7th, San José California 3

CFD at Onera Expectations from the internal users : - to develop and validate state of the art physical models : transition to turbulence, wall models, sub-grid closure models, flame stability, - to propose novel designs in rupture for aeronautics in terms of aerodynamics, propulsion integration, noise mitigation, … - to tackle the CFD grand challenges � New classes of numerical methods, less dependent on the grids, more robust and versatile, � Computational efficiency near the hardware design performance, high parallel scalability, Decision to launch research projects : � On deployment of the DG method for complex cases : AGHORA code � On modular multi-solver architecture within the Cassiopee set of tools elsA Onera GTC 2016, April 7th, San José California 4

Improvement of predictive capabilities in the last 5 year RANS / zonal LES of the flow around a High-Lift wing 2014 2009 • Optimized on a CPU architecture MPI / OpenMP / vectorization • CPU ressources for 70ms of simulation : JADE computer (CINES) Cpu time alloted by Genci • N xyz ~ 2 600 Mpts 4096 cores / 10688 domains T CPU ~ 6 200 000 h Residence time : 63 days Mach 0.18 Rey 1 400 000/corde 2D steady RANS and 3D LES 7,5 Mpts LEISA project Onera FUNK software GTC 2016, April 7th, San José California 5

NextFlow : Spatially High-Order Finite Volume method for RANS / LES Demonstration of the feasability of porting these algorithms on heterogeneous architectures , , GTC 2016, April 7th, San José California

Multi-GPU implementation of a High Order Finite Volume solver Main Choices : CUDA, Thrust, mvapich, Reasons : resource-aware programming, productivity librairies Hierarchy of memories correspond to the algorithm phases 1/ main memory for field and metrics variables : 40 million cells on a K40 (12Gb), and for the communication buffers (halo of cells for other partitions) 2/ shared memory at the stream multi-processor level for stencil operations 3/ careful use of registers for node, cell, face algorithms Stages of the project •Initial porting with the same data model organization than on the CPU •Generic refinement of coarse triangular elements with curved faces : hierarchy of grids •Multi-GPU implementation of a highly space-parallel model : extruded in the span direction and periodic •On-going work on a 3D generalization of the preceding phases : embedded grids inside a regular distribution (Octree-type) GTC 2016, April 7th, San José California 8

1st Approach: Block Structuration of a Regular Linear Grid Partition the mesh into small blocks Map the GPU scalable structure SM SM SM SM Block Block Block Block SM: Stream Multiprocessor GTC 2016, April 7th, San José California 9

Relative advantage of the small block partition ● Bigger blocks provide Better occupancy • Less latency due to kernel launch • Less transfers between blocks • ● Smaller blocks provide Much more data caching • 100 1,2 1 80 1 0,8 0,8 60 0,6 256 0,6 40 1024 0,4 0,4 4096 20 0,2 0,2 24097 0 0 0 L1 hit rate Fluxes time (normalized) Overall time (normalized) Final speedup wrt. to 2 hyperthreaded Westmere CPU: ~2 ● GTC 2016, April 7th, San José California 10

2 nd approach : Embedded grids, hierachical data model NXO-GPU Imposing a sub-structuration to the grid and data model (inspired by the ‘tesselation’ mechanism in surface rendering) Unique grid connectivity for the inner algorithm Optimal to organize data for coalescent memory access during the algorithm and communication phases Each coarse element in a block is allocated to an inner thread (threadId.x) Hierachical model for the grid : high order (quartic polynomial) triangles generated by gmsh refined on the GPU the whole fine grid as such could remain unknown to the host CPU GTC 2016, April 7th, San José California 11

Code structure Preprocessing Mesh generation and block and generic refinement generation Solver Fortran Allocation and initialization of data structure from the modified mesh file Fortran Computational routine C GPU allocation and initialization binders C Computational binders CUDA CUDA kernels Time stepping C Data fetching binder Postprocessing Visualization and data analysis GTC 2016, April 7th, San José California 12

Version 2 : Measured efficiency on Tesla K20C (with respect to 2 Cpu Xeon 5650, OMP loop-based) Initial results on a K20C : Max. Acceleration = 38 wrt to 2 Westmere sockets Improvement of the Westmere CPU efficiency : OpenMP task-based rather than inner-loop Same block data model on the CPU also, then the K20C GPU / CPU acceleration drops to 13 ( 1 K20c = 150 Westmere cores) � In fact this method is memory bounded, and GPU bandwidth is critical. More CPU optimisation needed (cache blocking, vectorisation ?) Flop count : around 80 Gflops DP /K20C These are valuable flop, not Ax=b, but highly non linear Riemann solver flop with high order (4th, 5th ) extrapolated values, characteristic splitting to avoid interference between waves, … : it requires a very high memory traffic to permit theses flops : wide stencils method Thanks to the NVIDIA GB dev-tech group for their support, “my flop is rich” GTC 2016, April 7th, San José California 13

Version 3 : 2.5D periodic spanwise (circular shift vectors), MULTI-GPU / MPI High CPU vectorisation (all variables are vectors of length 256 to 512) in the 3rd homogeneous direction Full data parallel Cuda kernels with coalesced memory access Objective : one Billion cells on a cluster with only 64 TESLA K20 or 16 K80 (40 000 cells * 512 spanwise stations per partition : 20 million cells addressed to each TESLA K20) The CPU (MPI / Fortran, OpenMP inner loop-based) and GPU ( GPUDirect / C/ Cuda) versions are in the same executable, for efficiency and accuracy • Coarse partitionning : number of partitions equal comparisons to the number of sockets / accelerators GTC 2016, April 7th, San José California 14

Version 3 : 2.5D periodic spanwise (cshift vectors), MULTI-GPU / MPI Initial performance measurements GTC 2016, April 7th, San José California 15

Initial Kernel Optimization and analysis performed by NVIDIA DevTech After this first optimization : ratio of 14 in performance K40 / 8-core Ivy-Bridge socket Strategy for further optimization of performances: Increase occupancy, reduce registers’ use, reduce amount of operations with global memory and texture cache for wide arrays in read-only in a kernel Put stencil coefficients in shared memory, Use constant memory, __launch_bounds__(128, 4) GTC 2016, April 7th, San José California 16

Next stage of optimizations Work done by Jean-Matthieu • - Use thread collaboration to transfer stencil data • from main memory to shared memory • - Refactor the kernel where face-stencil operations are done : • split in two phases to reduce stress on registers • - Use the thrust library to class the face and cell indices into lists to template the kernels according to the list number and avoid internal conditional switches Enable an overlapping : - computations in the center of the partition, - transfer of the halo cells at the periphery, use of mvapich2 by using multiple streams and further classification of the cell and face indices : center � periphery (thrust) GTC 2016, April 7th, San José California 17

An Optimized Solver for Unsteady Transonic Aerodynamics and - PowerPoint PPT Presentation

An Optimized Solver for Unsteady Transonic Aerodynamics and Aeroacoustics Around Wing Profiles Jean-Marie Le Gouez, Onera CFD Department Jean-Matthieu Etancelin, ROMEO HPC Center, Universit de Reims Champagne-Ardennes Thanks to Nikolay

Detached Eddy Sim ulation Analysis of a Transonic Rocket Booster for Steady & Unsteady Buffet

Global Transonic Solutions of Planetary Atmospheric Escape Model in Hydrodynamic Region Bo-Chih

Precomputed Panel Solver for Aerodynamics Simulation Haoran Xie The University of Tokyo / JAIST

Unsteady Vehicle Simulation Why run unsteady? Steady State Advantages Quick turn-around

Systerel Smart Solver Forum Mthodes Formelles October 2014 S3 S3 for C Systerel Smart Solver

A CDCL(LA) Solver SPASS-SATT A CDCL(LA) Solver Translation: fun (=SPASS) sated (=SATT)

Spectral Difference Solution of Unsteady Compressible Micropolar Equations on Moving and

Jens Nitzsche NACA0012 Unsteady lift w.r.t. pitch (2-d RANS) NACA0012 =17.00 Re=10 6

Unsteady Advection-Diffusion Problem Ryan Grove Department of Mathematical Sciences Clemson

Stability of Transonic Shock Solutions for Euler-Poisson and Euler Equations Chunjing XIE

RESEARCH INTERESTS DYNAMICS OF FLUID-STRUCTURE INTERACTION (AEROELASTICITY) UNSTEADY (TIME

Ivn Sidorovich Aerodynamicist Agenda Bicycle aerodynamics background information

Missile External Aerodynamics Using Star-CCM+ Star European Conference 03/22-23/2011

AERODYNAMICS TOOLS AND METHODS IN AIRCRAFT DESIGN MONDAY 14 & TUESDAY 15 OCTOBER 2019

Fundamentals of Fluid Dynamics: Ideal Flow Theory & Basic Aerodynamics Introductory Course on

Periodically generated Vortex Rings Equations: 3D, unsteady , incompressible Navier-Stokes

Weights and dimensions of heavy duty vehicles EP TRAN, 17 September 2013 Michael Nielsen IRU

Aerodynamic possibilities for heavy road vehicles virtual boat tail Panu Sainio, Kimmo

in ultra-low NOx lean combustion grid plate flame stabilizer JOS RAMN QUIONEZ ARCE, DR.

THE SCIENCE OF FLIGHT A science course in an aviation wrapper! Shelly Whisenhant Embry-Riddle

WIND INFLUENCE IN NUMERICAL ANALYSIS OF NSHEVS PERFORMANCE M.Sc. FSE Wojciech Wgrzyski M.Sc.

Reflections on Four Decades of CFD A Personal Perspective Antony Jameson Aerospace Computing

An Overview of Wind Engineering Where Climate Meets Design Presented by Derek Kelly, M.Eng., P

AERODYNAMICS POWER DESIGN ENDURANCE MAN&MACHINE A series of short films explaining the

An Optimized Solver for Unsteady Transonic Aerodynamics and - PowerPoint PPT Presentation

An Optimized Solver for Unsteady Transonic Aerodynamics and Aeroacoustics Around Wing Profiles Jean-Marie Le Gouez, Onera CFD Department Jean-Matthieu Etancelin, ROMEO HPC Center, Universit de Reims Champagne-Ardennes Thanks to Nikolay

Detached Eddy Sim ulation Analysis of a Transonic Rocket Booster for Steady &amp; Unsteady Buffet

Global Transonic Solutions of Planetary Atmospheric Escape Model in Hydrodynamic Region Bo-Chih

Precomputed Panel Solver for Aerodynamics Simulation Haoran Xie The University of Tokyo / JAIST

Unsteady Vehicle Simulation Why run unsteady? Steady State Advantages Quick turn-around

Systerel Smart Solver Forum Mthodes Formelles October 2014 S3 S3 for C Systerel Smart Solver

A CDCL(LA) Solver SPASS-SATT A CDCL(LA) Solver Translation: fun (=SPASS) sated (=SATT)

Spectral Difference Solution of Unsteady Compressible Micropolar Equations on Moving and

Jens Nitzsche NACA0012 Unsteady lift w.r.t. pitch (2-d RANS) NACA0012 =17.00 Re=10 6

Unsteady Advection-Diffusion Problem Ryan Grove Department of Mathematical Sciences Clemson

Stability of Transonic Shock Solutions for Euler-Poisson and Euler Equations Chunjing XIE

RESEARCH INTERESTS DYNAMICS OF FLUID-STRUCTURE INTERACTION (AEROELASTICITY) UNSTEADY (TIME

Ivn Sidorovich Aerodynamicist Agenda Bicycle aerodynamics background information

Missile External Aerodynamics Using Star-CCM+ Star European Conference 03/22-23/2011

AERODYNAMICS TOOLS AND METHODS IN AIRCRAFT DESIGN MONDAY 14 &amp; TUESDAY 15 OCTOBER 2019

Fundamentals of Fluid Dynamics: Ideal Flow Theory &amp; Basic Aerodynamics Introductory Course on

Periodically generated Vortex Rings Equations: 3D, unsteady , incompressible Navier-Stokes

Weights and dimensions of heavy duty vehicles EP TRAN, 17 September 2013 Michael Nielsen IRU

Aerodynamic possibilities for heavy road vehicles virtual boat tail Panu Sainio, Kimmo

in ultra-low NOx lean combustion grid plate flame stabilizer JOS RAMN QUIONEZ ARCE, DR.

THE SCIENCE OF FLIGHT A science course in an aviation wrapper! Shelly Whisenhant Embry-Riddle

WIND INFLUENCE IN NUMERICAL ANALYSIS OF NSHEVS PERFORMANCE M.Sc. FSE Wojciech Wgrzyski M.Sc.

Reflections on Four Decades of CFD A Personal Perspective Antony Jameson Aerospace Computing

An Overview of Wind Engineering Where Climate Meets Design Presented by Derek Kelly, M.Eng., P

AERODYNAMICS POWER DESIGN ENDURANCE MAN&amp;MACHINE A series of short films explaining the

Detached Eddy Sim ulation Analysis of a Transonic Rocket Booster for Steady & Unsteady Buffet

AERODYNAMICS TOOLS AND METHODS IN AIRCRAFT DESIGN MONDAY 14 & TUESDAY 15 OCTOBER 2019

Fundamentals of Fluid Dynamics: Ideal Flow Theory & Basic Aerodynamics Introductory Course on

AERODYNAMICS POWER DESIGN ENDURANCE MAN&MACHINE A series of short films explaining the