Asynchronous Distributed-Memory Task-Parallel Algorithm for Compressible Flows on 3D Unstructured Grids J. Bakosi, M. Charest, A. Pandare , J. Waltz Los Alamos National Laboratory, Los Alamos, NM, USA October 20, 2020 LA-UR-20-28309
Project goals ◮ Large-scale Computational Fluid Dynamics (CFD) capability ◮ Simulation use cases ◮ shocked flow over surrogate reentry bodies ◮ blast loading on vehicles or other complex structures ◮ weapons effects calculations in urban environments ◮ Distinguishing characteristics ◮ external flows over complex 3D geometries ◮ high-speed compressible flow ◮ Capability requirements compared to internal flow calculations ◮ complex domain must be explicitly meshed (rather than modeled) ◮ multiple orders of magnitude larger computational meshes ◮ larger demand for HPC: O (10 9 ) cells, O (10 4 ) CPUs must be routine calculations
Quinoa::Inciter: Built on Charm++ ◮ Compressible hydro (single or multiple materials) ◮ Unstructured 3D (tetrahedra only) grids ◮ Continuous and discontinuous Galerkin finite elements ◮ Adaptive: mesh refinement (WIP), polynomial-degree refinement ◮ Native Charm++ code interoperating with MPI libs ◮ Overdecomposition ◮ Parallel I/O ◮ SMP, non-SMP ◮ Automatic load balancing ◮ Open source: quinoacomputing.org
Quinoa::Inciter: ALECG hydro scheme, numerical method ◮ Edge-based finite element (or node-centered finite volume) method ◮ Compressible single-material (Euler, ideal gas) flow ρ ρu j ∂U ∂t + ∂F j = 0 , U = ρu i , F j = ρu i u j + pδ ij ∂x j u j ( ρE + p ) ρE ◮ Galerkin lumped-mass, locally conservative formulation � � � d U v = − 1 � � D vw j F vw B vw � F v j + F w � + B v j F v + j j j j V v d t j vw ∈ v vw ∈ v N v ∂N w − N w ∂N v = 1 � � � � � N v ( � x ) U v , D vw U ( � x ) = d Ω j 2 ∂x j ∂x j Ω h v ∈ Ω h Ω h ∈ vw = 1 � � � � B vw N v N w n j d Γ , B v N v N v n j d Γ j = j 2 Γ h Γ h Γ h ∈ vw Γ h ∈ v
Quinoa::Inciter: ALECG hydro scheme, References I ◮ [1, 2, 3] J. Waltz, N. Morgan, T.R. Canfield, M.R.J. Charest, L.D. Risinger, and J.G. Wohlbier. A three-dimensional finite element arbitrary Lagrangian-Eulerian method for shock hydrodynamics on unstructured grids. Computers & Fluids , 92:172–187, 2014. J. Waltz, T.R. Canfield, N.R. Morgan, L.D. Risinger, and J.G. Wohlbier. Verification of a three-dimensional unstructured finite element method using analytic and manufactured solutions. Computers & Fluids , 81:57 – 67, 2013. J. Waltz, T.R. Canfield, N.R. Morgan, L.D. Risinger, and J.G. Wohlbier. Manufactured solutions for the three-dimensional Euler equations with relevance to Inertial Confinement Fusion. J. Comp. Phys. , 267:196 – 209, 2014.
Solution verification: Vortical flow -3 10 ρ ρu 1 ρu 2 ρu 3 ρE -4 2nd order 10 log( L 2 ) -5 10 -6 10 -3 -2 -1 10 10 10 log( h ) Figure: Left: initial (first column) and final (second column) velocity, pressure (third column), and total energy distributions (fourth column). Right: L 2 errors as a function of mesh resolution.
Solution verification: Sedov 4 Mesh 1 Mesh 2 Mesh 3 Mesh 4 3 semi-analytic density 2 1 0 0 0.2 0.4 0.6 0.8 1 1.2 x -1 10 Log(L1) rho (Slope = 0.9592) 1st order -2 10 -3 -2 10 10 Log(h)
Solution validation: square cavity, domain and initial conditions State 1 State 2 6 0.5 5 5 5 10 Figure: Domain and initial conditions for square cavity problem. Dimensions are in cm.
Solution validation: square cavity, solution with experimental data Figure: Solutions with increasingly finer meshes for the square cavity problem. Lines S 1 , Sr 1 , and Sr 2 denote experimental shock positions.
Solution validation: Onera M6 wing, mesh and numerical solution Figure: Top – upper and lower surface mesh used for the ONERA M6 wing configuration. Bottom – computed pressure contours on the upper and lower surface.
Solution validation: Onera M6 wing, simulation & experiments Surface pressure coefficient at 20% semispan Surface pressure coefficient at 44% semispan Surface pressure coefficient at 65% semispan 1.2 1.2 1.5 experiment experiment experiment computation (coarse mesh) computation (coarse mesh) computation (coarse mesh) 1 computation (finer mesh) 1 computation (finer mesh) computation (finer mesh) 0.8 0.8 1 0.6 0.6 0.4 0.4 0.5 0.2 -C p -C p 0.2 -C p 0 0 0 -0.2 -0.2 -0.4 -0.4 -0.5 -0.6 -0.8 -0.6 -1 -0.8 -1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x/c x/c x/c Surface pressure coefficient at 80% semispan Surface pressure coefficient at 90% semispan Surface pressure coefficient at 95% semispan 1.4 1.4 1.4 experiment experiment experiment computation (coarse mesh) computation (coarse mesh) computation (coarse mesh) 1.2 1.2 1.2 computation (finer mesh) computation (finer mesh) computation (finer mesh) 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 -C p -C p -C p 0.2 0.2 0.2 0 0 0 -0.2 -0.2 -0.2 -0.4 -0.4 -0.4 -0.6 -0.6 -0.6 -0.8 -0.8 -0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x/c x/c x/c Figure: Comparison beween the computed and experimental surface pressure coefficient for the ONERA wing section at 20%, 44%, 65%, 80%, 90%, and 95% semispans.
Quinoa::Inciter: ALECG, on-node performance Time step profile: µ s % rhs 8482724 91 bgrad 34333 0.4 diag 48549 0.5 solve 40355 0.4 total 27830000 100 RHS profile: µ s % grad 1109746 51 domain 677741 30 bnd 2565 src 413999 19 total 2183459 100
Quinoa::Inciter: ALECG, on-node performance improvements 1. Remove unnecessary code for generating unused derived data structures: 1.6x . 2. Replace a tree-based data structure with a flat one, enabling a streaming-style (contiguous) access to normals associated to edges: 1.3x . 3. Re-write domain-integral from a nested loop (over mesh points and over edges connected to a point) as a single loop over unique edges: 1.3x . 4. Optimize data access in the source term: 1.4x . 5. Re-write the loop computing primitive-variable gradients from a gather-scatter loop over elements to a nested loop over mesh points with an inner loop over edges connected to a point: 1.5x . Altogether: 6.2x speedup
Quinoa::Inciter: 3 hydro schemes, strong scaling 4 10 Single-material hydro, 794M cells (100 time steps, no I/O) 900 CG, non-SMP CG, SMP 1800 DG(P1), non-SMP Wall clock time, sec DG(P1), SMP ALECG, non-SMP 3 3600 10 ALECG, SMP ideal 360 7200 14400 900 28800 2 1800 50400 10 3600 7200 14400 36000 28800 1 10 2 3 4 5 10 10 10 10 Number of CPUs (36/node)
Quinoa::Inciter: Parallel load imbalance triggered by physics Figure: Spatial distributions of extra load in each cell whose fluid density exceeds the value of 1.5, during time evolution of the Sedov problem: (left) shortly after the onset of load imbalance, (right) at a later time of the simulation.
Quinoa::Inciter: Automatic load balancing yields 10x speedup 20000 no extra load, virt=0, noLB no extra load, virt=100x, noLB extra load, virt=0, noLB extra load, virt=10x, GreedyCommLB 15000 extra load, virt=100x, GreedyCommLB extra load, virt=100x, DistributedLB grind-time, ms/timestep extra load, virt=100x, NeighborLB 10000 5000 0 0 100 200 300 400 500 time step Figure: Grind-time during time stepping computing a Sedov problem with load imbalance, using various built-in load balancers in Charm++. Run on 10 compute nodes with 36CPUs/node.
Current and future work 1. Multi-material FV/DG at large scales 2. P-adaptation 3. Productization (SBIR, PI:Charmworks) 4. 3D mesh-to-mesh solution transfer toward large-scale fluid-structure interaction (see next talk by Eric Mikida)
Recommend
More recommend