Is it performance portability when I’m using (small) DGEMM? Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions Michael Bader (and many others!) Technical University of Munich Oct 23–27, 2017
Co-Authors – Current SeisSol Group LMU Munich – Geophysics: Alice-Agnes Elizabeth Stephanie Thomas Gabriel Madden Wollherr Ulrich Technical University of Munich – HPC: Further/former members: Alexander Breuer (TUM → San Diego) Alexander Heinecke (Intel) Christian Pelties (LMU → MunichRe) Sebastian Carsten Leonhard Rannabauer (TUM) Rettenberger Uphoff M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 2
Dynamic Rupture and Earthquake Simulation Landers fault system: simulated ground motion and seismic waves [2] SeisSol – ADER-DG for seismic simulations: (www.seissol.org) • adaptive tetrahedral meshes → complex geometries, heterogeneous media, multiphysics • complicated fault systems with multiple branches → non-linear multiphysics dynamic rupture simulation • ADER-DG: high-order discretisation in space and time M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 3
Part I Simulation of the 2004 Sumatra Megathrust Earthquake SC17 paper [5] by Sebastian Rettenberger, Carsten Uphoff, Alice Gabriel, Betsy Madden, Stephanie Wollherr, Thomas Ulrich M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 4
Sumatra Earthquake – Seismology Challenges Megathrust North North Forethrust East East Depth Upper backthrust r t e s w u r o h L k t c a b e r e d Layered L a y t c r u s oceanic crust e n t a l o n t i n c 50 km 1000 km Volume continues to 500 km Domain, mesh and geometry of the Sumatra scenario (images from [5]) • multiscale: rupture extends of 1500 km, but happens on meter scale • complex geometry: shallow angles in subduction zone; splay faults, topography, multiple material layers • extremely long duration of earthquake: 500 s simulated time (over 3 Mio smallest time steps) → local time stepping imperative M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 5
Sumatra Earthquake – HPC Challenges 2048.0 ● Extrapolated time (h) 10 8 1024.0 ● C: BL G6 10 7 512.0 10 6 ● C: BL L6 10 5 256.0 Count ● 187.5 ● 10 4 ● C: SC G6 10 3 111.3 ● ● 77.9 C: SC L6 10 2 ● ● 55.0 ● ● 10 1 32.0 S: SC G6 10 0 ● 1 2 4 8 16 32 64 128 256 512 1024 16.0 S: SC L6 ⋅ ∆ t min ● 9.4 ● 7.3 ● Elements Dynamic rupture faces 16 32 64 128 256 384 512 Number of nodes Sumatra: histogram of LTS clusters and extrapolated runtimes (plots from [5]) • target manycore CPUs (Knights Landing → Cori supercomputer) → available cache/local memory per core → new flux computation → dynamic rupture became bottleneck → matrix-based code generation • dynamic rupture plus local time stepping with strong(!) scalability required M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 6
Sumatra 2004: 220 Mio Elements on SuperMUC HPC Facts – 13.9 Hours Production Run: • 221 million elements with order 6 accuracy • 111 billion degrees of freedom • 11 LTS clusters: “smallest” elements performed 3.3 Mio time steps • 500 s simulated time • 1500km fault size; 400 m geometrical resolution; • 2.2 Hz frequency content of the seismic wave field • 0.94 PFLOPS sustained performance (86,016 Haswell cores 2.2 GHz) • 13 TB checkpoint data, 2.8 TB for post-processing (asynchronous IO; costs entirely overlapped by computation) M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 7
Sumatra 2004 – Results Splay Fault Activation and Ocean Floor Displacements M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 8
Sumatra 2004 – Results Splay Fault Activation and Ocean Floor Displacements M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 8
SeisSol – Recent Extensions “Multiphysics” Simulations: • viscoelastic attenuation; implementation based on new matrix-based code generator (C. Uphoff, [4]) • off-fault plasticity (current work by S. Wollherr) Workflow and HPC: • asynchronous parallel IO using staging nodes or writer cores (S. Rettenberger, [13]) • input of 3D velocity models from data files via parallel library ASAGI (S. Rettenberger, [14]) • simplified CAD generation and close-to-automatic meshing using SimModeler and Simulation Modeling Suite by Simmetrix M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 9
Part II SeisSol as a Compute-Bound Code: Code Generation for Matrix Kernels Breuer, Heinecke, Rannabauer , Bader [1]: High-Order ADER-DG Minimizes Energy- and Time-to-Solution of SeisSol (ISC’15) Uphoff , Bader [4]: Generating high performance matrix kernels for earthquake simulations with viscoelastic attenuation (HPCS 2016) M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 10
Seismic Wave Propagation with SeisSol Elastic Wave Equations: (velocity-stress formulation) q t + Aq x + Bq y + Cq z = 0 q = ( σ 11 , σ 22 , σ 33 , σ 12 , σ 23 , σ 13 , u , v , w ) T with 0 0 0 0 0 0 − λ − 2 µ 0 0 0 0 0 0 0 0 0 − λ 0 0 0 0 0 0 0 − λ 0 0 0 0 0 0 0 0 0 − λ − 2 µ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 − λ 0 0 − λ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 − µ 0 0 − µ A = 0 0 0 0 0 0 0 0 0 B = 0 0 0 0 0 0 0 0 − µ 0 0 0 0 0 0 0 0 − µ 0 0 0 0 0 0 0 0 0 − ρ − 1 − ρ − 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 − ρ − 1 − ρ − 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 − ρ − 1 − ρ − 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • high order discontinuous Galerkin discretisation • ADER-DG : high approximation order in space and time • additional features: local time stepping, high accuracy of earthquake faulting (full frictional sliding) → Dumbser, K¨ aser et al., e.g. [8] M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 11
SeisSol in a Nutshell – ADER-DG 4 = Q k − | S k | � Q n + 1 | J k | M − 1 X F − , i I ( t n , t n + 1 , Q n k ) N k , i A + k N − 1 k k , i Update scheme i = 1 4 � X F + , i , j , h I ( t n , t n + 1 , Q n k ( i ) ) N k , i A − k ( i ) N − 1 + k , i i = 1 + M − 1 K ξ I ( t n , t n + 1 , Q n k ) A ∗ k + M − 1 K η I ( t n , t n + 1 , Q n k ) B ∗ k + M − 1 K ζ I ( t n , t n + 1 , Q n k ) C ∗ k Kovalewski J ( t n + 1 − t n ) j + 1 ∂ j Cauchy I ( t n , t n + 1 , Q n X ∂ t j Q k ( t n ) k ) = ( j + 1 ) ! j = 0 ( Q k ) t = − M − 1 � ( K ξ ) T Q k A ∗ k + ( K η ) T Q k B ∗ k + ( K ζ ) T Q k C ∗ � k M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 12
Sparse, Dense → Block-Sparse Consider equaivalent sparsity patterns: (Uphoff, [4]) 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 0 0 5 5 5 5 1 1 6 6 6 6 7 7 2 7 7 2 3 3 8 8 8 8 4 4 9 9 9 9 10 10 5 10 10 5 6 6 11 11 11 11 7 7 12 12 12 12 13 13 8 13 13 8 9 9 14 14 14 14 10 10 15 15 15 15 16 16 11 16 16 11 12 12 17 17 17 17 18 18 13 18 18 13 19 19 14 19 19 14 15 15 20 20 20 20 21 21 16 21 21 16 17 17 22 22 22 22 18 18 23 23 23 23 24 24 19 24 24 19 20 20 25 25 25 25 21 21 26 26 26 26 27 27 22 27 27 22 23 23 28 28 28 28 24 24 29 29 29 29 30 30 25 30 30 25 26 26 31 31 31 31 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 32 32 32 32 33 33 33 33 34 34 34 34 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Graph representation and block-sparse memory layouts A 1 A 2 A 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455 M. Bader et al. | Is it performance portability when I’m using DGEMM? | Dagstuhl Seminar | Oct 2017 13
Recommend
More recommend