edge extreme scale discontinuous galerkin environment
play

EDGE: Extreme-scale Discontinuous Galerkin Environment Alexander - PowerPoint PPT Presentation

EDGE: Extreme-scale Discontinuous Galerkin Environment Alexander Breuer, Alexander Heinecke (Intel), Yifeng Cui Getting Started: Advection Equation q ( x, t ) t + v q ( x, t ) x = 0 , v R Simplest hyperbolic Partial


  1. EDGE: Extreme-scale Discontinuous Galerkin Environment Alexander Breuer, Alexander Heinecke (Intel), Yifeng Cui

  2. Getting Started: Advection Equation q ( x, t ) t + v · q ( x, t ) x = 0 , v ∈ R • “Simplest” hyperbolic Partial Differential Equation (PDE) • Elastic wave equations similar: Linear system with variable coefficients

  3. Getting Started: Fused Solver q ( x, t ) t + v · q ( x, t ) x = 0 , v ∈ R • Non-Fused: o 1 = s ( i 1 ) o 4 = s ( i 4 ) o 3 = s ( i 3 ) o 2 = s ( i 2 ) • Fused: O 4 = ( o 1 , o 2 , o 3 , o 4 ) = S 4 ( I 4 ) = S 4 ( i 1 , i 2 , i 3 , i 4 )

  4. Getting Started: Fused Solver q ( x, t ) t + v · q ( x, t ) x = 0 , v ∈ R • Non-Fused: o 1 = s ( i 1 ) o 4 = s ( i 4 ) o 3 = s ( i 3 ) o 2 = s ( i 2 ) • Fused: O 4 = ( o 1 , o 2 , o 3 , o 4 ) = S 4 ( I 4 ) = S 4 ( i 1 , i 2 , i 3 , i 4 )

  5. DOFs: Non-Fused vs. Fused fused runs fused runs fused runs 0 1 2 3 0 1 2 3 0 1 2 3 0 0 3 6 0 0 1 2 3 12 13 14 15 12 13 14 15 modes modes 1 1 4 7 1 4 5 6 7 16 17 18 19 16 17 18 19 2 2 5 8 2 8 9 10 11 20 21 22 23 20 21 22 23 0 1 2 0 1 2 elements elements

  6. Key Advantages • Full vector operations, even 6.8 7 for sparse matrix operators relative arithmetic 6 4.9 5 4.0 intensity • Automatic memory alignment 3.3 3.1 4 2.7 2.5 2.4 3 2.0 2.0 • Read-only data shared 1.8 1.9 1.7 1.7 1.5 1.4 2 1.0 1.0 1.0 1.0 among all runs 1 0 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 • Lower sensitivity to latency (memory & network) Relative arithmetic intensities. Shown are convergence rates 2-5 and fusion of 2,4,8,16 simulations vs. non-fused for the elastic wave equations, using an ADER-DG solver. [ISC17]

  7. � � � � � � � � � � � � � � � � � � � � � � � � “Similar Enough”: EDGE’s Approach 1 2 1. Identical mesh for all fused simulations 2. Identical simulations parameters: 3 4 1. Start and end time 2. Convergence rate 3. “Frequency” of wave field output, “frequency” and location of seismic receivers 5 6 3. Identical material parameters (velocity model) 4. “Sources”: � 1. Arbitrary initial DOFs 7 8 2. Kinematic sources: Fused or non-fused point sources 3. Spontaneous rupture: Identical friction law, other parameters � (e.g., nucleation, initial stresses, coefficients) arbitrary � mulations (SoA) with point sources at di � erent locations

  8. Performance: LOH.1 • Orders: 2-6 (non-fused), 2-4 (fused) • Unstructured tetrahedral mesh: 350,264 elements LOH.1 Benchmark: Example mesh • Single node of Cori-II (68 core Intel and material regions [ISC16_1] Xeon Phi x200, 
 1 code-named Knights 
 0.8 0.6 Landing) 0.4 0.2 u (m/s) 0 • EDGE vs. SeisSol (GTS, git-tag -0.2 -0.4 201511) -0.6 reference -0.8 EDG Ǝ O4 -1 -1.2 0 1 2 3 4 5 6 7 8 9 time (s) Synthetic seismogram of EDGE for quantity u at the ninth seismic receiver located at (8647 m, 5764 m, 0) in red. The reference solution is shown in black. Detailed setup: [ISC17]

  9. Fused Simulations: Speedup 4.60 EDGE over speedup: 4 SeisSol 2.87 3 1.82 2 1.24 0.91 0.96 0.80 0.74 1 0 O2C1 O2C8 O3C1 O3C8 O4C1 O4C8 O5C1 O6C1 configuration (order, #fused simulations) Speedup of EDGE over SeisSol (GTS, git-tag 201511). Convergence rates O2 − O6: single non-fused forward simulations (O2C1-O6C1). Additionally, per-simulation speedups for orders O2 − O4 when using EDGE’s full capabilities by fusing eight simulations (O2C8-O4C8). [ISC17]

  10. Weak: Setup • Regular cubic mesh, 5 Tets per Cube, 4th order (P3) and 6th order (P5) 1 10 • Imitates convergence 0 10 -1 benchmark 10 O1 Q8 C1 O1 Q8 C4 -2 10 O1 Q8 C8 • 276K elements per node O2 Q8 C1 linf error -3 10 O2 Q8 C4 O2 Q8 C8 • 1-9000 nodes of Cori-II (9000 -4 O3 Q8 C1 10 O3 Q8 C4 nodes = 612,000 cores) O3 Q8 C8 -5 10 O4 Q8 C1 O4 Q8 C4 -6 10 O4 Q8 C8 O5 Q8 C1 -7 10 O5 Q8 C4 Convergence of EDGE in the L ∞ -norm. Shown are orders O1 − O5 for v (Q8) when utilizing O5 Q8 C8 -8 EDGE’s fusion capabilities with shifted initial conditions. For clarity, from the total of eight fused 10 50 25 20 10 5 3 1/3 2.5 2 simulations, only errors of the first (C1), fourth (C4) and last simulation (C8) are shown. [ISC17] edge length (m)

  11. Weak: Results ���� ���� �������� ���� ���� �������� ���� ���� �������� ���� ���� �������� ���� ��� • O6C1 @ 9K nodes: �� 10.4 PFLOPS (38% �� of peak) �� �� � ���� �� • O4C8 vs. O4C1 @ �� 9K nodes: 
 �� 2.0x speedup �� � � � � �� �� �� �� �� ��� ��� ��� ��� ���� ���� ���� ���� ���� ���� Weak scaling study on Cori-II. Shown are hardware and non-zero peak efficiencies ������ in flat mode. O denotes the order and C the number of fused simulations. [ISC17]

  12. Strong: LOH.1 • Orders: 4 & 6 (non-fused), 4 (fused) • Unstructured tetrahedral LOH.1 Benchmark: Example mesh and material regions [ISC16_1] mesh: 172,386,915 elements 0.02 • 32-3200 nodes of Theta (64 core Intel Xeon Phi x200, 
 2 0.01 frequency (Hz) 1 code-named Knights Landing) 0 0.4 • 3200 nodes = 204,800 cores -0.01 -0.02 0 1 2 3 4 5 6 7 8 time (s) Time-frequency misfit for quantity u at the ninth seismic receiver located at (8647 m, 5764 m, 0) and in a frequency range between 0.13Hz and 5Hz. Detailed setup: [ISC17], Visualization: TF-MISFIT_GOF_CRITERIA, http://nuquake.eu

  13. Strong: Results ���� ���� �������� ���� ���� �������� ���� ���� �������� ���� ���� �������� ���� ��� • O6C1 @ 3.2K nodes: �� 3.4 PFLOPS (40% of �� peak) �� � ���� �� • O4C8 vs. O4C1 @ �� 3.2K nodes: 
 �� 2.0x speedup �� � �� �� �� �� ��� ��� ��� ��� ���� ���� ���� ���� ���� Strong scaling study on Theta. Shown are hardware and non-zero peak efficiencies ������ in flat mode. O denotes the order and C the number of fused simulations. [ISC17]

  14. EDGE: Current and Upcoming • Sparse, fused assembly 
 • Elements: Line, rectangular quads, 3-node triangles, rectangular hexes, 4-node tets kernels for orders 5+ • Equations: Advection (FV+ADER-DG: 1D, • Kinematic Sources 
 2D, 3D), Shallow Water (FV: 1D), Elastic (Standard Rupture Format): Wave Equations (FV+ADER-DG: 2D, 3D) Support for fused and 
 • Parallelization: Assembly kernels for non-fused source descriptions WSM, SNB, HSW, KNC (non-fused), KNL • Spontaneous Rupture (fused & non-fused), OpenMP (custom), MPI (overlapping) Simulations • Continuity: Continuous Integration (sanity • Grouped Local Time Stepping checks), Continuous Delivery (automated • EDGEcut: Automated surface 
 convergence + benchmarks runs), and volume meshing automated code coverage, automated license checks, container bootstrap • Public in next few weeks: 
 http://dial3343.org • License: 3-clause BSD

Recommend


More recommend