Software Sustainability in the Many-Core Era Jonas Thies > - PowerPoint PPT Presentation

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 1 Software Sustainability in the Many-Core Era Jonas Thies

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 2 German Aerospace Center (DLR) Aerospace center, project manager and space agency ◮ > 8 000 employees ◮ 16(?) sites in Germany Main areas of research ◮ Aeronautics ◮ Energy ◮ Space ◮ Security For the ESA mission ‘Rosetta’, DLR developed and operates the ‘Philae’ lander ... so who am I to talk to you about software and HPC?

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 3 Institute Simulation and Software Technology Software is developed everywhere at DLR ◮ 2005: ∼ 25 % of personnel expenses spent on software development ◮ cost: > 100 million Euro/year ◮ Examples: CFD, material science, onboard computers, data analysis... Our mission ( ∼ 50 staff) is to increase the efficiency of software development in other institutes by software research , teaching and contributing to key projects .

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 4 Equipping Sparse Solvers for the EXa-scale

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 5 Sparse Eigenvalue Problems Formulation Find some Eigenpairs Applications λ j , v j ) of a large and sparse matrix (pair) in a target region of the spectrum Graphene A v j = λ j B v j Quantum Hubbard model and Anderson localization ◮ A Hermitian or general, real or Fluid complex Mechanics ◮ B may be identity matrix (or not) Driven cavity Rayleigh-Benard convection ◮ ‘some’ may mean ‘quite a few’, 100-1 000 or so DLR applications

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 6 Block Jacobi-Davidson QR ◮ Aim: partial QR decomposition, A Q = QR , R ∈ C k × k upper triangular, 1 2 Q T Q − 1 2 I = 0 , Q ∈ R N × k . ◮ Newton’s method, let Q = ˜ Q + ∆ Q ◮ A ∆ Q − ∆ Q ˜ R = A ˜ Q − ˜ Q ˜ R ◮ ˜ Q T ∆ Q = 0

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 7 Block Jacobi-Davidson QR (2) This leads to a set of correction equations ( I − ˜ Q ˜ Q T ) A ( I − ˜ Q ˜ Q T )∆ Q − ∆ Q ˜ R = A ˜ Q − ˜ Q ˜ R ◮ Subspace acceleration: add corrections to expanding search space V ◮ Ritz-Galerkin: M = V T A V , M = S H RS ◮ Lock converged eigenpairs ⇒ growing projection space ˜ Q ◮ Solve correction eq. using (deflated) GMRES or MINRES Krylov solver

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 8 Projection-Based Eigensolvers Input: Interval I λ , Matrix pair A , B ∈ C N × N Output: ˆ m eigenpairs ( X , Λ) in I λ m , choose random Y ∈ C N × m of rank m > ˜ 1 Estimate ˜ m ≈ ˆ m 2 while not ˜ m pairs converged do Compute U = PY with suitable projector P = P I λ ( A , B ) 3 Compute Rayleigh quotients A U = U ∗ AU and B U = U ∗ BU 4 Update estimate ˜ m of ˆ m and adjust m > ˜ m 5 Solve EVP A U W = B U W Λ 6 X ← UW 7 Orthogonalize X against locked vectors, lock newly converged ones 8 Y ← BX 9 10 end while

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 9 Two Ways of Computing the Projector U = PY BEAST-C/FEAST: contour integration Polynomial expansion of resolvent function 1 � ◮ Chebyshev iteration ( z B − A ) − 1 B d z Y U := 2 π i C ◮ requires very large number of spMMVMs ◮ but no global synchronization ◮ ‘filter polynomials’ to reduce Gibbs oscillations aka ChebFD Requires solving many independent but hard linear systems

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 10 Common Operations of Iterative Methods 1. Memory-bounded linear operations involving small and dense matrices C ∈ R m × k (sdMats) sparse matrices multi-vectors node-local/in shared A ∈ R N × N (sparseMat) X , Y ∈ R N × m (mVecs) memory (e.g. Y ← α AX + β Y , C ← X T Y , X ← Y · C ) Developed in ESSEX/ 2. Algorithms for sdMats 3. Sparse matrix (I)LU factorization ◮ not available in ◮ e.g. eigendecomposition of projected matrix ◮ allow using external libraries via Trilinos interface ◮ use LAPACK/PLASMA/MAGMA

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 11 Comparing Performance Results simple(?) operation: C = V T V , V ∈ R 1 M × 4

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 12 Present Challenges to HPC Users Performance increase on low to intermediate levels ◮ SIMD/SIMT ◮ increasingly non-uniform cache/memory hierarchies ◮ increasing core count Many programming models and (semi-)standards ◮ OpenMP+OpenACC vs. ◮ ca. 15 different tasking OpenCL runtimes ◮ vendor-specific (e.g. CUDA) ◮ C++11, Intel TBB, Kokkos ◮ MPI vs. PGAS (GPI/GASPI, Co-Array Fortran, UPC) imo: MPI is here to stay, the node-level is uncertain

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 13 Our Test System Peak Flop/s 1024 Peak bandwidth 256 DP GFlop/s 64 BLAS1 (ddot) 16 4 1 1/4 1 4 16 64 Compute intensity [Flop / DP element] ◮ 2 × 12 core Haswell EP @2.3 GHz ◮ Tesla K40 GPU ◮ Theoretical Peak: 442 GFlop/s ◮ Theoretical Peak: 1.43 TFlop/s ◮ 128 GB RAM ◮ 12 GB RAM ◮ STREAM-Triad: 42 GB/s / socket ◮ STREAM-Triad: 215 GB/s

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 14 SPMD/OK Programming Model A success story: Chebyshev ◮ SPMD (‘BSP’) vs. task methods on Piz Daint parallelism 100 100% Parallel Efficiency Square, Weak Scaling ◮ Heterogenous cluster: Performance in Tflop/s Bar, Weak Scaling Square, Strong Scaling distribute problem according 10 to limiting resource (e.g. memory bandwidth) 1 ◮ O ptimized K ernels make sure each component runs as fast 0.1 1 4 16 64 256 1024 Number of heterogeneous nodes as possible Only needs sparse matrix times ◮ User sees a simple functional multiple vector (spMMV) products interface (no general-purpose and an occasional vector operation looping constructs etc.)

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 15 Upcoming Challenges (even more) heterogenous memory ◮ Knight’s Landing: additional fast NUMA domain ◮ IBM Power 9 + Nvidia Volta: GPU can read from main memory (at same speed as CPU) Algorithm developer must decide which data should be accessed fast ◮ E.g. eigensolvers often have an outer/inner (project/correct) structure, the complete outer search space may not be needed in inner loop

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 16 PHIST Software Architecture a Pipelined Hybrid-parallel Iterative Solver Toolkit ◮ facilitate algorithm development using ◮ holistic performance engineering ◮ portability and interoperability vertical integration application algorithms eigenproblem holistic performance engineering solver templates preconditioners preconditioners «abstraction» FT strategies C wrapper setup/apply algo core computational core computational core adapter «interface» kernel interface sparseMat mVec sdMat

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 16 PHIST Software Architecture a Pipelined Hybrid-parallel Iterative Solver Toolkit ◮ facilitate algorithm development using ◮ holistic performance engineering ◮ portability and interoperability vertical integration application algorithms eigenproblem holistic performance engineering solver templates BEAST preconditioners preconditioners «abstraction» FT strategies C wrapper setup/apply algo core computational core computational core adapter «interface» kernel interface sparseMat mVec sdMat

> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 17 Useful Abstraction: Kernel Interface Choose from several ‘backends’ at compile time, to ◮ easily use PHIST in existing applications ◮ perform the same run with different kernel libraries ◮ compare numerical accuracy and performance ◮ exploit unique features of a kernel library (e.g. preconditioners)

Software Sustainability in the Many-Core Era Jonas Thies > - PowerPoint PPT Presentation

> Software Sustainability in the Many-Core Era > J.Thies slides > Erlangen, July 11 2016 DLR.de Chart 1 Software Sustainability in the Many-Core Era Jonas Thies > Software Sustainability in the Many-Core Era > J.Thies

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. & Law Response to ERA I ( ii)

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

E RA- MIN 2 Sta rting De c 1 st 2016 2 About ERA MIN 2 ERA MIN 2 is an ERA NET

Reactive Systems Why now? Electronic Commerce Era Multicore Era Cloud Era Backlash to the BOFH

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

Urban Urban Sustainability Urban Urban Sustainability Sustainability Sustainability I di I

Lisa Randall, Harvard University Entering LHC Era Entering LHC Era Many challenges as LHC

Building Sustainability: Building Sustainability: Building Sustainability: Building

Sustainability Sustainability Alyssa Dolher + Elenor Methven ARC 503 Sustainability For the

Sustainability Strategy Ask SMG Sustainability Sustainability is one of the four themes of

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

PSHE curriculum Robert Willmott Core Themes Core Theme 1: Health and Core Theme 2: Core Theme

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

FLAG-ERA Presentation FLAG-ERA JTC 2017 Project Kick-off Seminar March 21-22, 2018 Edouard

FASHION THE VICTORIAN ERA & THE CORSET THE VICTORIAN ERA & THE CORSET THE VICTORIAN

Model Comparison A Systematic Mapping Study Lucian Gonales, Kleinner Farias, Murillo Scholl,

Spatial Data Ahmed Eldawy Computer Science and Engineering Claudius Ptolemy (AD 90 AD 168) Al

Infinitely often equal trees and Cohen reals Yurii Khomskii joint with Giorgio Laguzzi Arctic

Testing for C1P using PQ-Trees C1P: order U so that certain sets S U are consecutive

New Detectors for SuperCDMS SNOLAB Matthew Fritts University of Minnesota Department of Physics

RegCM and CORDEX simulations of the local flows over the Adriatic region Ivan Gttler

The role of models@run.time in self-explanation in the era of Machine Learning Antonio Garcia,

inverse model results Benjamin Gaubert 1 , B. B. Stephens 1 , Andrew R. Jacobson 2 , Sourish Basu 2

Software Sustainability in the Many-Core Era Jonas Thies > - PowerPoint PPT Presentation

> Software Sustainability in the Many-Core Era > J.Thies slides > Erlangen, July 11 2016 DLR.de Chart 1 Software Sustainability in the Many-Core Era Jonas Thies > Software Sustainability in the Many-Core Era > J.Thies

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. &amp; Law Response to ERA I ( ii)

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

E RA- MIN 2 Sta rting De c 1 st 2016 2 About ERA MIN 2 ERA MIN 2 is an ERA NET

Reactive Systems Why now? Electronic Commerce Era Multicore Era Cloud Era Backlash to the BOFH

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

Urban Urban Sustainability Urban Urban Sustainability Sustainability Sustainability I di I

Lisa Randall, Harvard University Entering LHC Era Entering LHC Era Many challenges as LHC

Building Sustainability: Building Sustainability: Building Sustainability: Building

Sustainability Sustainability Alyssa Dolher + Elenor Methven ARC 503 Sustainability For the

Sustainability Strategy Ask SMG Sustainability Sustainability is one of the four themes of

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

PSHE curriculum Robert Willmott Core Themes Core Theme 1: Health and Core Theme 2: Core Theme

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

FLAG-ERA Presentation FLAG-ERA JTC 2017 Project Kick-off Seminar March 21-22, 2018 Edouard

FASHION THE VICTORIAN ERA &amp; THE CORSET THE VICTORIAN ERA &amp; THE CORSET THE VICTORIAN

Model Comparison A Systematic Mapping Study Lucian Gonales, Kleinner Farias, Murillo Scholl,

Spatial Data Ahmed Eldawy Computer Science and Engineering Claudius Ptolemy (AD 90 AD 168) Al

Infinitely often equal trees and Cohen reals Yurii Khomskii joint with Giorgio Laguzzi Arctic

Testing for C1P using PQ-Trees C1P: order U so that certain sets S U are consecutive

New Detectors for SuperCDMS SNOLAB Matthew Fritts University of Minnesota Department of Physics

RegCM and CORDEX simulations of the local flows over the Adriatic region Ivan Gttler

The role of models@run.time in self-explanation in the era of Machine Learning Antonio Garcia,

inverse model results Benjamin Gaubert 1 , B. B. Stephens 1 , Andrew R. Jacobson 2 , Sourish Basu 2

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. & Law Response to ERA I ( ii)

FASHION THE VICTORIAN ERA & THE CORSET THE VICTORIAN ERA & THE CORSET THE VICTORIAN