Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen ExaDG: High-order discontinuous Galerkin for the exa-scale Guido Kanschat 1 Katharina Kormann 2 Martin Kronbichler 3 Wolfgang A. Wall 3 1 Interdisziplin¨ ares Institut f¨ ur Wissenschaftliches Rechnen, Universit¨ at Heidelberg 2 Max–Planck–Institut f¨ ur Plasmaphysik 3 Institute for Computational Mechanics Technische Universit¨ at M¨ unchen January 27, 2016 ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary
Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Outline ExaDG: Operator evaluation framework Matrix-free algorithm Parallel adaptive multigrid Applications Summary ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary
Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen ExaDG: Motivation of project ◮ Modern PDE simulators spend about 60–95 % of CPU time on matrix-vector products; unstructured/adaptive case typically done with sparse matrices ◮ Bottleneck: memory-limited, only 1–5 % of arithmetic peak ◮ Matrix-free alternative based on cellwise integration 1 ◮ (Much) lower memory transfer 10 -5 SpMV usual SpMV stat cond matrix-free general ◮ CPUs: 30–70 % of arithmetic peak matrix-free Cartesian 10 -6 Time [s] / DoF 2–5 times as fast for Q 2 elements on ◮ 10 -7 petascale machine (SuperMUC) ◮ Medium/high order 2 ≤ p ≤ 6 honors 10 -8 1 2 3 4 5 6 7 8 local computations and cache hierarchy Element degree Laplacian on continuous elements rather than data transfer CPU: Xeon E5 (SNB @ SuperMUC) Matrix-free implementation of general-purpose PDE operator eval- uation promising for the exascale 1 Kronbichler, Kormann, Comput. Fluids, 2012 ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary
Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Generic operator evaluation framework ◮ Integrated in deal.II finite element library ( www.dealii.org ) ◮ Target generic weak forms on quadrilaterals, input by user: operator in one quadrature point → C++ templated function ◮ Mesh loop according to specific algorithm structure (operator evaluation on active cells, evaluation on geometric multigrid levels, multigrid smoothing on cell patches) ◮ Library should handle loops in a hardware-specific way ◮ Loop over quadrature points and cells/faces/vertices ◮ Full support for mesh adaptivity with hanging nodes ( L 2 -, H div -, H curl -, H 1 -conforming elements) ◮ Parallelization: MPI (domain decomposition), threads ◮ Vectorization ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary
Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Infrastructure for parallel meshes in deal.II ◮ deal.II implements quad/hex only meshes, forest-of-trees concept ◮ deal.II integrates functionality of the p4est library for dynamically adapted meshes based on hierarchical refinement into trees (with hanging nodes) 2 ◮ Each processor has its view of mesh (refined from coarse mesh) 3 ◮ Scaling results in deal.II to 65,536 cores ( p4est has been scaled to 1.5m MPI ranks) P 0 P 1 P 2 P 3 2 www.p4est.org 3 W. Bangerth, C. Burstedde, T. Heister, M. Kronbichler, ACM TOMS 38(2), 2011 ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary
Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Faster matrix-vector products for high order elements Matrix-vector product Matrix-free algorithm: matrix-based: ◮ v = 0 � P T A = K A K P K (assembly) ◮ loop over cells K ∈{ cells } (i) Extract local vector v = Au values on cell: u K = P K u (ii) Apply operation locally on cell: v K = A K u K matrix-free: ( without forming A K ; sum factorization, like � P T v = K A K ( P K u ) EXA-DUNE) K ∈{ cells } (iii) Sum results from (ii) into the global solution vector: v = v + P T K v K ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary
Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Characterization matrix-free vs. sparse matrices Operation count per degree of freedom Memory per degree of freedom ( ≈ (arithmetics) main memory transfer) 1000 SpMV SpMV 10000 900 matrix-free general matrix-free matrix-free Cartesian matrix-free GL Arithmetic operations / DoF 800 Memory [bytes] / DoF 700 1000 600 500 100 400 300 10 200 100 1 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Element degree Element degree 3D Laplacian, hexahedral elements Q p , periodic b.c. (= no boundary), pre-evaluated Jacobians on all quadrature points, “even-odd decomposition” to decrease work for 1D kernels � p from ( p + 1) 2 to ( p + 1)( � + 2) 2 Sparse matrix-vector product is memory bandwidth bound → speedup for Q 2 and Q 3 : trade memory transfer for arithmetics ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary
Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Performance continuous elements 1000 SpMV SpMV 900 matrix-free 10000 matrix-free general matrix-free GL matrix-free Cartesian Arithmetic operations / DoF 800 Memory [bytes] / DoF 700 1000 600 + = 500 100 400 300 10 200 100 1 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Element degree Element degree 10 -5 SpMV usual 3D Laplacian, Q p SpMV stat cond elements, Cartesian matrix-free general matrix-free Cartesian & curved geometry 10 -6 Time [s] / DoF Intel Xeon E5 Sandy Bridge 2.7 GHz (Su- perMUC) 10 -7 time per degree of freedom (DoF) per core (but node fully loaded) 10 -8 1 2 3 4 5 6 7 8 Element degree ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary
Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Performance discontinuous Galerkin 10 -5 SpMV usual 3D Laplacian, Q p matrix-free curved elements, symmetric matrix-free Cartesian interior penalty DG , Cartesian & curved 10 -6 Time [s] / DoF geometry Intel Xeon E5 Sandy Bridge 2.7 GHz (Su- 10 -7 perMUC) time per degree of freedom per 10 -8 core (but node fully 1 2 3 4 5 6 7 8 loaded) Element degree ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary
Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Characteristics of element kernels Operator evaluation: 30–70% of arithmetic peak ◮ Vectorization over elements . . . ◮ data layout: array-of-struct-of-array vmovapd 0x988(%rsp),%ymm4 ◮ intrinsics via C++ template class vmovapd 0xbc8(%rsp),%ymm6 vmovapd 0x9a8(%rsp),%ymm14 vsubpd %ymm4,%ymm6,%ymm11 VectorizedArray<double> vmulpd %ymm15,%ymm6,%ymm3 ◮ everything but vector read/write fully vmulpd %ymm6,%ymm13,%ymm8 vfmadd231pd %ymm13,%ymm4,%ymm3 vectorized (DG: fully vectorized) vfmadd231pd %ymm4,%ymm15,%ymm8 vmulpd %ymm6,%ymm12,%ymm10 ◮ Loop bounds known at compile time vmovapd 0xbe8(%rsp),%ymm0 vmovapd 0x268(%rsp),%ymm9 ◮ element degree and number of 1D vfmsub231pd %ymm9,%ymm4,%ymm10 vmulpd %ymm9,%ymm6,%ymm9 quadrature points as templates vmovapd 0xaa8(%rsp),%ymm2 ◮ unroll loops of length ⌊ p / 2 ⌋ + 1 or p + 1 vfmsub231pd %ymm4,%ymm12,%ymm9 vmovapd %ymm11,0x748(%rsp) ◮ Challenge: Mix of operations vmovapd 0xac8(%rsp),%ymm1 vmulpd %ymm15,%ymm0,%ymm5 ◮ memory intensive: vector read/write, vmulpd %ymm0,%ymm13,%ymm4 vfmadd231pd %ymm7,%ymm2,%ymm3 pre-evaluated Jacobians, coefficients, etc. vfmadd231pd %ymm7,%ymm2,%ymm8 vfmadd231pd %ymm13,%ymm14,%ymm5 on quadrature points vmovapd %ymm3,0x2c8(%rsp) ◮ arithmetic intensive: sum factorization vsubpd %ymm14,%ymm0,%ymm3 . . . → Develop performance model! ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary
Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Project goal: Parallel adaptive multigrid Implicit solvers for large-scale problems ◮ Many established solvers need more than just operator evaluation (mat-vec) ◮ Matrix-free evaluation requires specific preconditioners ◮ Simple example: Geometric multigrid with polynomial (Chebyshev) smoother ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary
Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen Project goal: Parallel adaptive multigrid Implicit solvers for large-scale problems ◮ Many established solvers need more than just operator evaluation (mat-vec) ◮ Matrix-free evaluation requires specific preconditioners ◮ Simple example: Geometric multigrid with polynomial (Chebyshev) smoother ◮ Goal: GMG with overlapping Schwarz smoothers for high order bases ◮ Schwarz smoothers approximately invert differential operator on patches of elements by tensorial techniques ◮ Integrate with MPI and domain decomposition ◮ Combine with algebraic multigrid for coarse meshes with millions of elements, using low-order bases ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary
Recommend
More recommend