> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 1 Software Sustainability in the Many-Core Era Jonas Thies
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 2 German Aerospace Center (DLR) Aerospace center, project manager and space agency ◮ > 8 000 employees ◮ 16(?) sites in Germany Main areas of research ◮ Aeronautics ◮ Energy ◮ Space ◮ Security For the ESA mission ‘Rosetta’, DLR developed and operates the ‘Philae’ lander ... so who am I to talk to you about software and HPC?
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 3 Institute Simulation and Software Technology Software is developed everywhere at DLR ◮ 2005: ∼ 25 % of personnel expenses spent on software development ◮ cost: > 100 million Euro/year ◮ Examples: CFD, material science, onboard computers, data analysis... Our mission ( ∼ 50 staff) is to increase the efficiency of software development in other institutes by software research , teaching and contributing to key projects .
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 4 Equipping Sparse Solvers for the EXa-scale
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 5 Sparse Eigenvalue Problems Formulation Find some Eigenpairs Applications λ j , v j ) of a large and sparse matrix (pair) in a target region of the spectrum Graphene A v j = λ j B v j Quantum Hubbard model and Anderson localization ◮ A Hermitian or general, real or Fluid complex Mechanics ◮ B may be identity matrix (or not) Driven cavity Rayleigh-Benard convection ◮ ‘some’ may mean ‘quite a few’, 100-1 000 or so DLR applications
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 6 Block Jacobi-Davidson QR ◮ Aim: partial QR decomposition, A Q = QR , R ∈ C k × k upper triangular, 1 2 Q T Q − 1 2 I = 0 , Q ∈ R N × k . ◮ Newton’s method, let Q = ˜ Q + ∆ Q ◮ A ∆ Q − ∆ Q ˜ R = A ˜ Q − ˜ Q ˜ R ◮ ˜ Q T ∆ Q = 0
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 7 Block Jacobi-Davidson QR (2) This leads to a set of correction equations ( I − ˜ Q ˜ Q T ) A ( I − ˜ Q ˜ Q T )∆ Q − ∆ Q ˜ R = A ˜ Q − ˜ Q ˜ R ◮ Subspace acceleration: add corrections to expanding search space V ◮ Ritz-Galerkin: M = V T A V , M = S H RS ◮ Lock converged eigenpairs ⇒ growing projection space ˜ Q ◮ Solve correction eq. using (deflated) GMRES or MINRES Krylov solver
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 8 Projection-Based Eigensolvers Input: Interval I λ , Matrix pair A , B ∈ C N × N Output: ˆ m eigenpairs ( X , Λ) in I λ m , choose random Y ∈ C N × m of rank m > ˜ 1 Estimate ˜ m ≈ ˆ m 2 while not ˜ m pairs converged do Compute U = PY with suitable projector P = P I λ ( A , B ) 3 Compute Rayleigh quotients A U = U ∗ AU and B U = U ∗ BU 4 Update estimate ˜ m of ˆ m and adjust m > ˜ m 5 Solve EVP A U W = B U W Λ 6 X ← UW 7 Orthogonalize X against locked vectors, lock newly converged ones 8 Y ← BX 9 10 end while
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 9 Two Ways of Computing the Projector U = PY BEAST-C/FEAST: contour integration Polynomial expansion of resolvent function 1 � ◮ Chebyshev iteration ( z B − A ) − 1 B d z Y U := 2 π i C ◮ requires very large number of spMMVMs ◮ but no global synchronization ◮ ‘filter polynomials’ to reduce Gibbs oscillations aka ChebFD Requires solving many independent but hard linear systems
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 10 Common Operations of Iterative Methods 1. Memory-bounded linear operations involving small and dense matrices C ∈ R m × k (sdMats) sparse matrices multi-vectors node-local/in shared A ∈ R N × N (sparseMat) X , Y ∈ R N × m (mVecs) memory (e.g. Y ← α AX + β Y , C ← X T Y , X ← Y · C ) Developed in ESSEX/ 2. Algorithms for sdMats 3. Sparse matrix (I)LU factorization ◮ not available in ◮ e.g. eigendecomposition of projected matrix ◮ allow using external libraries via Trilinos interface ◮ use LAPACK/PLASMA/MAGMA
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 11 Comparing Performance Results simple(?) operation: C = V T V , V ∈ R 1 M × 4
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 11 Comparing Performance Results simple(?) operation: C = V T V , V ∈ R 1 M × 4
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 11 Comparing Performance Results simple(?) operation: C = V T V , V ∈ R 1 M × 4
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 11 Comparing Performance Results simple(?) operation: C = V T V , V ∈ R 1 M × 4
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 12 Present Challenges to HPC Users Performance increase on low to intermediate levels ◮ SIMD/SIMT ◮ increasingly non-uniform cache/memory hierarchies ◮ increasing core count Many programming models and (semi-)standards ◮ OpenMP+OpenACC vs. ◮ ca. 15 different tasking OpenCL runtimes ◮ vendor-specific (e.g. CUDA) ◮ C++11, Intel TBB, Kokkos ◮ MPI vs. PGAS (GPI/GASPI, Co-Array Fortran, UPC) imo: MPI is here to stay, the node-level is uncertain
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 13 Our Test System Peak Flop/s 1024 Peak bandwidth 256 DP GFlop/s 64 BLAS1 (ddot) 16 4 1 1/4 1 4 16 64 Compute intensity [Flop / DP element] ◮ 2 × 12 core Haswell EP @2.3 GHz ◮ Tesla K40 GPU ◮ Theoretical Peak: 442 GFlop/s ◮ Theoretical Peak: 1.43 TFlop/s ◮ 128 GB RAM ◮ 12 GB RAM ◮ STREAM-Triad: 42 GB/s / socket ◮ STREAM-Triad: 215 GB/s
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 14 SPMD/OK Programming Model A success story: Chebyshev ◮ SPMD (‘BSP’) vs. task methods on Piz Daint parallelism 100 100% Parallel Efficiency Square, Weak Scaling ◮ Heterogenous cluster: Performance in Tflop/s Bar, Weak Scaling Square, Strong Scaling distribute problem according 10 to limiting resource (e.g. memory bandwidth) 1 ◮ O ptimized K ernels make sure each component runs as fast 0.1 1 4 16 64 256 1024 Number of heterogeneous nodes as possible Only needs sparse matrix times ◮ User sees a simple functional multiple vector (spMMV) products interface (no general-purpose and an occasional vector operation looping constructs etc.)
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 15 Upcoming Challenges (even more) heterogenous memory ◮ Knight’s Landing: additional fast NUMA domain ◮ IBM Power 9 + Nvidia Volta: GPU can read from main memory (at same speed as CPU) Algorithm developer must decide which data should be accessed fast ◮ E.g. eigensolvers often have an outer/inner (project/correct) structure, the complete outer search space may not be needed in inner loop
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 16 PHIST Software Architecture a Pipelined Hybrid-parallel Iterative Solver Toolkit ◮ facilitate algorithm development using ◮ holistic performance engineering ◮ portability and interoperability vertical integration application algorithms eigenproblem holistic performance engineering solver templates preconditioners preconditioners «abstraction» FT strategies C wrapper setup/apply algo core computational core computational core adapter «interface» kernel interface sparseMat mVec sdMat
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 16 PHIST Software Architecture a Pipelined Hybrid-parallel Iterative Solver Toolkit ◮ facilitate algorithm development using ◮ holistic performance engineering ◮ portability and interoperability vertical integration application algorithms eigenproblem holistic performance engineering solver templates BEAST preconditioners preconditioners «abstraction» FT strategies C wrapper setup/apply algo core computational core computational core adapter «interface» kernel interface sparseMat mVec sdMat
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 17 Useful Abstraction: Kernel Interface Choose from several ‘backends’ at compile time, to ◮ easily use PHIST in existing applications ◮ perform the same run with different kernel libraries ◮ compare numerical accuracy and performance ◮ exploit unique features of a kernel library (e.g. preconditioners)
Recommend
More recommend