ExaDG: High-order discontinuous Galerkin for the exa-scale Guido - PowerPoint PPT Presentation

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen ExaDG: High-order discontinuous Galerkin for the exa-scale Guido Kanschat 1 Katharina Kormann 2 Martin Kronbichler 3 Wolfgang A. Wall 3 1 Interdisziplin¨ ares Institut f¨ ur Wissenschaftliches Rechnen, Universit¨ at Heidelberg 2 Max–Planck–Institut f¨ ur Plasmaphysik 3 Institute for Computational Mechanics Technische Universit¨ at M¨ unchen January 27, 2016 ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Outline ExaDG: Operator evaluation framework Matrix-free algorithm Parallel adaptive multigrid Applications Summary ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen ExaDG: Motivation of project ◮ Modern PDE simulators spend about 60–95 % of CPU time on matrix-vector products; unstructured/adaptive case typically done with sparse matrices ◮ Bottleneck: memory-limited, only 1–5 % of arithmetic peak ◮ Matrix-free alternative based on cellwise integration 1 ◮ (Much) lower memory transfer 10 -5 SpMV usual SpMV stat cond matrix-free general ◮ CPUs: 30–70 % of arithmetic peak matrix-free Cartesian 10 -6 Time [s] / DoF 2–5 times as fast for Q 2 elements on ◮ 10 -7 petascale machine (SuperMUC) ◮ Medium/high order 2 ≤ p ≤ 6 honors 10 -8 1 2 3 4 5 6 7 8 local computations and cache hierarchy Element degree Laplacian on continuous elements rather than data transfer CPU: Xeon E5 (SNB @ SuperMUC) Matrix-free implementation of general-purpose PDE operator evaluation promising for the exascale 1 Kronbichler, Kormann, Comput. Fluids, 2012 ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Generic operator evaluation framework ◮ Integrated in deal.II finite element library ( www.dealii.org ) ◮ Target generic weak forms on quadrilaterals, input by user: operator in one quadrature point → C++ templated function ◮ Mesh loop according to specific algorithm structure (operator evaluation on active cells, evaluation on geometric multigrid levels, multigrid smoothing on cell patches) ◮ Library should handle loops in a hardware-specific way ◮ Loop over quadrature points and cells/faces/vertices ◮ Full support for mesh adaptivity with hanging nodes ( L 2 -, H div -, H curl -, H 1 -conforming elements) ◮ Parallelization: MPI (domain decomposition), threads ◮ Vectorization ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Infrastructure for parallel meshes in deal.II ◮ deal.II implements quad/hex only meshes, forest-of-trees concept ◮ deal.II integrates functionality of the p4est library for dynamically adapted meshes based on hierarchical refinement into trees (with hanging nodes) 2 ◮ Each processor has its view of mesh (refined from coarse mesh) 3 ◮ Scaling results in deal.II to 65,536 cores ( p4est has been scaled to 1.5m MPI ranks) P 0 P 1 P 2 P 3 2 www.p4est.org 3 W. Bangerth, C. Burstedde, T. Heister, M. Kronbichler, ACM TOMS 38(2), 2011 ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Faster matrix-vector products for high order elements Matrix-vector product Matrix-free algorithm: matrix-based: ◮ v = 0  � P T A = K A K P K (assembly) ◮ loop over cells  K ∈{ cells } (i) Extract local vector v = Au  values on cell: u K = P K u (ii) Apply operation locally on cell: v K = A K u K matrix-free: ( without forming A K ; sum factorization, like � P T v = K A K ( P K u ) EXA-DUNE) K ∈{ cells } (iii) Sum results from (ii) into the global solution vector: v = v + P T K v K ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Characterization matrix-free vs. sparse matrices Operation count per degree of freedom Memory per degree of freedom ( ≈ (arithmetics) main memory transfer) 1000 SpMV SpMV 10000 900 matrix-free general matrix-free matrix-free Cartesian matrix-free GL Arithmetic operations / DoF 800 Memory [bytes] / DoF 700 1000 600 500 100 400 300 10 200 100 1 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Element degree Element degree 3D Laplacian, hexahedral elements Q p , periodic b.c. (= no boundary), pre-evaluated Jacobians on all quadrature points, “even-odd decomposition” to decrease work for 1D kernels � p from ( p + 1) 2 to ( p + 1)( � + 2) 2 Sparse matrix-vector product is memory bandwidth bound → speedup for Q 2 and Q 3 : trade memory transfer for arithmetics ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Performance continuous elements 1000 SpMV SpMV 900 matrix-free 10000 matrix-free general matrix-free GL matrix-free Cartesian Arithmetic operations / DoF 800 Memory [bytes] / DoF 700 1000 600 + = 500 100 400 300 10 200 100 1 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Element degree Element degree 10 -5 SpMV usual 3D Laplacian, Q p SpMV stat cond elements, Cartesian matrix-free general matrix-free Cartesian & curved geometry 10 -6 Time [s] / DoF Intel Xeon E5 Sandy Bridge 2.7 GHz (Su- perMUC) 10 -7 time per degree of freedom (DoF) per core (but node fully loaded) 10 -8 1 2 3 4 5 6 7 8 Element degree ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Performance discontinuous Galerkin 10 -5 SpMV usual 3D Laplacian, Q p matrix-free curved elements, symmetric matrix-free Cartesian interior penalty DG , Cartesian & curved 10 -6 Time [s] / DoF geometry Intel Xeon E5 Sandy Bridge 2.7 GHz (Su- 10 -7 perMUC) time per degree of freedom per 10 -8 core (but node fully 1 2 3 4 5 6 7 8 loaded) Element degree ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Characteristics of element kernels Operator evaluation: 30–70% of arithmetic peak ◮ Vectorization over elements . . . ◮ data layout: array-of-struct-of-array vmovapd 0x988(%rsp),%ymm4 ◮ intrinsics via C++ template class vmovapd 0xbc8(%rsp),%ymm6 vmovapd 0x9a8(%rsp),%ymm14 vsubpd %ymm4,%ymm6,%ymm11 VectorizedArray<double> vmulpd %ymm15,%ymm6,%ymm3 ◮ everything but vector read/write fully vmulpd %ymm6,%ymm13,%ymm8 vfmadd231pd %ymm13,%ymm4,%ymm3 vectorized (DG: fully vectorized) vfmadd231pd %ymm4,%ymm15,%ymm8 vmulpd %ymm6,%ymm12,%ymm10 ◮ Loop bounds known at compile time vmovapd 0xbe8(%rsp),%ymm0 vmovapd 0x268(%rsp),%ymm9 ◮ element degree and number of 1D vfmsub231pd %ymm9,%ymm4,%ymm10 vmulpd %ymm9,%ymm6,%ymm9 quadrature points as templates vmovapd 0xaa8(%rsp),%ymm2 ◮ unroll loops of length ⌊ p / 2 ⌋ + 1 or p + 1 vfmsub231pd %ymm4,%ymm12,%ymm9 vmovapd %ymm11,0x748(%rsp) ◮ Challenge: Mix of operations vmovapd 0xac8(%rsp),%ymm1 vmulpd %ymm15,%ymm0,%ymm5 ◮ memory intensive: vector read/write, vmulpd %ymm0,%ymm13,%ymm4 vfmadd231pd %ymm7,%ymm2,%ymm3 pre-evaluated Jacobians, coefficients, etc. vfmadd231pd %ymm7,%ymm2,%ymm8 vfmadd231pd %ymm13,%ymm14,%ymm5 on quadrature points vmovapd %ymm3,0x2c8(%rsp) ◮ arithmetic intensive: sum factorization vsubpd %ymm14,%ymm0,%ymm3 . . . → Develop performance model! ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Project goal: Parallel adaptive multigrid Implicit solvers for large-scale problems ◮ Many established solvers need more than just operator evaluation (mat-vec) ◮ Matrix-free evaluation requires specific preconditioners ◮ Simple example: Geometric multigrid with polynomial (Chebyshev) smoother ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Project goal: Parallel adaptive multigrid Implicit solvers for large-scale problems ◮ Many established solvers need more than just operator evaluation (mat-vec) ◮ Matrix-free evaluation requires specific preconditioners ◮ Simple example: Geometric multigrid with polynomial (Chebyshev) smoother ◮ Goal: GMG with overlapping Schwarz smoothers for high order bases ◮ Schwarz smoothers approximately invert differential operator on patches of elements by tensorial techniques ◮ Integrate with MPI and domain decomposition ◮ Combine with algebraic multigrid for coarse meshes with millions of elements, using low-order bases ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

ExaDG: High-order discontinuous Galerkin for the exa-scale Guido - PowerPoint PPT Presentation

Institute for Computational Mechanics Martin Kronbichler Technische Universit at M unchen ExaDG: High-order discontinuous Galerkin for the exa-scale Guido Kanschat 1 Katharina Kormann 2 Martin Kronbichler 3 Wolfgang A. Wall 3 1

EDGE: Extreme Scale Fused Seismic Simulations with the Discontinuous Galerkin Method Alexander

Novel non-hydrostatic numerical schemes based on discontinuous Galerkin finite element method

The Discontinuous Galerkin Method for the Compressible Navier-Stokes Equations Miloslav

A reconstruction-enhanced discontinuous Galerkin method for hyperbolic problems V aclav Ku

Unsteady CFD Optimization using High-Order Discontinuous Galerkin Finite Element Methods Matthew

Entropy stable high order discontinuous Galerkin methods for hyperbolic conservation laws

A High-Order Discontinuous Galerkin Method for the Unsteady Incompressible Navier-Stokes

High-Order, Time-Dependent Aerodynamic Optimization using a Discontinuous Galerkin Discretization

Application of the discontinuous Galerkin method to 3D compressible RANS simulations of a high

Exa & Yotta Scale Data SC 08 Panel November 21 2008, Austin, TX Garth Gibson Carnegie

Adaptive discontinuous Galerkin method for tsunami modeling and prediction on a global scale Jan

EDGE: Extreme-scale Discontinuous Galerkin Environment Alexander Breuer, Alexander Heinecke

Space-Time Discontinuous Galerkin Discretizations for Linear First-Order Hyperbolic Evolution

The energy based discontinuous Galerkin method Daniel Appel Applied Mathematics University of

Challenges and Solutions for Peta- and Exa-Sacle Programming Tasuku Hiraishi Academic Center for

A Multiscale Discontinuous Galerkin Method P. Bochev Computational Mathematics and Algorithms

Luiz Velhos Legacy on Geometry Processing Prof. Afonso Paiva ICMC-USP, S ao Carlos, SP,

GEOMetrics Exploiting Geometric Structure for Graph-Encoded Objects Edward Smith, Scott Fujimoto,

Deformable Model with Adaptive Mesh and Automated Topology Changes Jacques-Olivier Lachaud

Implementation of Anisotropic Mesh Refinement in OpenFOAM Jonas Karlsson Hydrodynamics Group

High Order Methods with Adaptive Mesh Refinement for the Solution of the Relativistic MHD

The DWR method for numerical simulations related to nuclear waste disposals Roland Becker

Estimating Baryonic Properties of Galaxies from Their Dark Matter universetoday.com MPA/Springel

An Adaptive Pursuit Strategy for Allocating Operator Probabilities Dirk Thierens Department of

ExaDG: High-order discontinuous Galerkin for the exa-scale Guido - PowerPoint PPT Presentation

Institute for Computational Mechanics Martin Kronbichler Technische Universit at M unchen ExaDG: High-order discontinuous Galerkin for the exa-scale Guido Kanschat 1 Katharina Kormann 2 Martin Kronbichler 3 Wolfgang A. Wall 3 1

EDGE: Extreme Scale Fused Seismic Simulations with the Discontinuous Galerkin Method Alexander

Novel non-hydrostatic numerical schemes based on discontinuous Galerkin finite element method

The Discontinuous Galerkin Method for the Compressible Navier-Stokes Equations Miloslav

A reconstruction-enhanced discontinuous Galerkin method for hyperbolic problems V aclav Ku

Unsteady CFD Optimization using High-Order Discontinuous Galerkin Finite Element Methods Matthew

Entropy stable high order discontinuous Galerkin methods for hyperbolic conservation laws

A High-Order Discontinuous Galerkin Method for the Unsteady Incompressible Navier-Stokes

High-Order, Time-Dependent Aerodynamic Optimization using a Discontinuous Galerkin Discretization

Application of the discontinuous Galerkin method to 3D compressible RANS simulations of a high

Exa &amp; Yotta Scale Data SC 08 Panel November 21 2008, Austin, TX Garth Gibson Carnegie

Adaptive discontinuous Galerkin method for tsunami modeling and prediction on a global scale Jan

EDGE: Extreme-scale Discontinuous Galerkin Environment Alexander Breuer, Alexander Heinecke

Space-Time Discontinuous Galerkin Discretizations for Linear First-Order Hyperbolic Evolution

The energy based discontinuous Galerkin method Daniel Appel Applied Mathematics University of

Challenges and Solutions for Peta- and Exa-Sacle Programming Tasuku Hiraishi Academic Center for

A Multiscale Discontinuous Galerkin Method P. Bochev Computational Mathematics and Algorithms

Luiz Velhos Legacy on Geometry Processing Prof. Afonso Paiva ICMC-USP, S ao Carlos, SP,

GEOMetrics Exploiting Geometric Structure for Graph-Encoded Objects Edward Smith, Scott Fujimoto,

Deformable Model with Adaptive Mesh and Automated Topology Changes Jacques-Olivier Lachaud

Implementation of Anisotropic Mesh Refinement in OpenFOAM Jonas Karlsson Hydrodynamics Group

High Order Methods with Adaptive Mesh Refinement for the Solution of the Relativistic MHD

The DWR method for numerical simulations related to nuclear waste disposals Roland Becker

Estimating Baryonic Properties of Galaxies from Their Dark Matter universetoday.com MPA/Springel

An Adaptive Pursuit Strategy for Allocating Operator Probabilities Dirk Thierens Department of

Exa & Yotta Scale Data SC 08 Panel November 21 2008, Austin, TX Garth Gibson Carnegie