Harnessing the Intel Xeon Phi x200 Processor 2017 IXPUG US Annual Meeting for Earthquake Simulations Alexander Breuer, Yifeng Cui, Alexander Heinecke (Intel), Josh Tobin, Chuck Yount (Intel)
AWP-ODC-OS What is AWP-ODC-OS? • AWP-ODC-OS (Anelastic Wave Propagation, Olsen, Day, Cui): Simulates seismic wave propagation after a fault rupture • Used extensively by the Southern California Earthquake Center community (SCEC) • License: BSD 2-Clause Combined Hazard map of CyberShake Study 15.4 (LA, CVM-S4.26) and CyberShake Study 17.4 (Central https://github.com/HPGeoC/awp-odc-os California, CCA-06). AWP-ODC simulations are used to generate hazard maps. Colors show 2 seconds period spectral acceleration (SA) for 2% exceedance probability in 50 years.
EDGE What is EDGE? • Extreme-scale Discontinuous Galerkin Environment (EDGE): Seismic wave propagation through DG-FEM • Focus: Problem settings with high geometric complexity, e.g., mountain topography • “License”: BSD 3-Clause (software), CC0 for supporting files (e.g., user guide) Example of hypothetical seismic wave propagation with mountain topography using http://dial3343.org EDGE. Shown is the surface of the computational domain covering the San Jacinto fault zone between Anza and Borrego Springs in California. Colors denote the amplitude of the particle velocity, where warmer colors correspond to higher amplitudes.
Two Representative Codes AWP-ODC-OS EDGE • Finite difference scheme: 4th order in • Discontinuous Galerkin Finite space, 2nd order in time Element Method (DG-FEM) • Unstructured tetrahedral meshes • Staggered-grid, velocity/stress formulation of elastodynamic eqns • Small matrix kernels in inner-loop with frequency dependent attenuation • Compute bound (high orders) • Memory bandwidth bound fa
AWP-ODC-OS Boosting Single-Node Performance: Vector Folding • Vector folding data layout • Stores elements in small SIMD-sized multi- Traditional dimensional blocks vectorization • Reduces memory bandwidth demands by increasing reuse • YASK (Yet Another Stencil Kernel) Requires 9 cache loads per SIMD result • Open-source (MIT License) framework from Intel • Inputs scalar stencil code • Creates optimized kernels using vector folding and other optimizations Two-dimensional Vector folding https://github.com/01org/yask Requires only 5 cache loads
AWP-ODC-OS Vector Folding: Performance • Hardware: Intel Xeon Phi 7210 1500 • Domain size: 1024x1024x64 • Single precision: Vector blocks 1375 1,311 1,313 1,280 1,273 1,273 1,260 of 16 elements 1250 MLUPS • Performance measured by 1,140 1.6x 1125 YASK proxy • Performance in Mega Lattice 1000 Updates per Second (MLUPS) 875 812 out of MCDRAM (flat-mode) 750 • Insight: Vector folding achieves 1x1x16 1x16x1 2x8x1 4x1x4 4x4x1 8x1x2 8x2x1 16x1x1 a speedup of up to 1.6x Vector folding dimensions Traditional [ISC_17_2] vectorization
EDGE LIBXSMM • LIBXSMM: Library for small sparse and dense matrix-matrix multiplications, BSD 3-Clause [SC14] • JIT code generation of matrix LIBXSMM 1.8.1 ICC 2017.0.4 Intel MKL 2017.0.3 Direct 90.0% 7.0 kernels LIBXSMM vs. Intel MKL Direct 80.0% 6.0 6.0 70.0% % peak performance • Hardware: Intel Xeon Phi 7250, 5.0 60.0% 4.0 50.0% 3.3 flat 2.5 40.0% 3.0 2.0 1.8 30.0% • Insight: Close to peak 1.5 2.0 20.0% performance out of a hot cache 1.0 10.0% 0.0% 0.0 2 3 4 5 6 7 convergence order https://github.com/hfp/libxsmm Performance comparison of dense matrix-matrix multiplications in LIBXSMM on Knights Landing at 1.2 GHz with autovectorized code (compiler) and Intel MKL in version 2017.0.3 out of a hot cache. Shown is the stiffness or flux matrix multiplied with the DOFs
AWP-ODC-OS Leveraging KNL Memory Modes 110% • 26 three-dimensional arrays, 17 of Hybrid vs. Pure MCDRAM which are read-only or read-heavy 100% • Heuristically identified: Arrays which Performance 90% are good candidates to be placed in DDR 80% • Hybrid memory placement: 70% • Option 1: Increase available 60% memory by 26% and improve overall performance 50% 0 2 4 6 8 10 12 14 16 • Option 2: Increase available memory to 46 GB with 50% of Number of Grids in DDR optimal performance Relative performance of AWP-ODC-OS as we move arrays from MCDRAM to DDR. In each case, the best performing combination was found via heuristics and simple search [ISC_17_2].
AWP-ODC-OS Architecture Comparison 1563 1540 AWP-ODC Perf (MLUPS) 1472 • Xeon Phi KNL 7290: 1271 2x speedup over Bandwidth (GB/s) NVIDIA K20X; 97% of NVIDIA Tesla P100 712 MCDRAM MCDRAM performance 541 • Memory bandwidth accurately predicts 95 DDR DDR performance of architectures (as Xeon KNL 7250 KNL 7250 KNL 7290 NVIDIA NVIDIA NVIDIA Cache Flat Flat K20X M40 P100 E5-2630v3 measured by STREAM Part Type and HPCG-SpMv) Single node performance comparison of AWP-ODC-OS on a variety of architectures. Also displayed is the bandwidth of each architecture, as measured by a STREAM and HPCG-SpMv [ISC_17_2].
EDGE Fused Simulations • Exploits inter-simulation parallelism: • Full vector operations, even for sparse matrix operators • Automatic memory alignment • Read-only data shared among all runs • Lower sensitivity to latency (memory & network) fused runs fused runs fused runs 0 1 2 3 0 1 2 3 0 1 2 3 0 0 3 6 0 0 1 2 3 12 13 14 15 24 25 26 27 modes modes 1 1 4 7 1 4 5 6 7 16 17 18 19 28 29 30 31 2 2 5 8 2 8 9 10 11 20 21 22 23 32 33 34 35 0 1 2 0 1 2 elements elements Illustration of the memory layout for fused simulations in EDGE. Shown Illustration of fused simulations in EDGE for the is a third order configuration for line elements and the advection advection equation using line elements. Top: Single equation. Left: Single forward simulation, right: 4 fused simulations forward simulation, bottom: 4 fused simulations.
EDGE Fused Simulations: Performance • Orders: 2-6 (non-fused), 2-4 (fused) • Unstructured tetrahedral mesh: 350,264 elements • Single node of Cori-II (68 core Intel Xeon Phi x200, code-named Knights Landing) • EDGE vs. SeisSol (GTS, git-tag 201511) • Speedup: 2-5x LOH.1 Benchmark: Example mesh and material regions [ISC16_1] 4.60 EDGE over speedup: 4 SeisSol 2.87 3 1.82 2 1.24 Speedup of EDGE over SeisSol (GTS, git-tag 0.91 0.96 0.80 0.74 201511). Convergence rates O2 − O6: single non- 1 fused forward simulations (O2C1-O6C1). Additionally, per-simulation speedups for orders O2 − O4 when 0 using EDGE’s full capabilities by fusing eight O2C1 O2C8 O3C1 O3C8 O4C1 O4C8 O5C1 O6C1 simulations (O2C8-O4C8). [ISC17_1] configuration (order, #fused simulations)
AWP-ODC-OS Outperforming 20K GPUs • Weak scaling studies on Cori NERSC Cori Phase II and Stampede TACC Stampede Extension 100 • Parallel efficiency of over 95 Parallel e � ciency 91% from 1 to 9000 nodes (9000 nodes = 612,000 cores) 90 • Problem size of 512x512x512 85 per node (14 GB per node) 80 Equivalent to more than • Performance on 9000 nodes 20,000 K20X GPUs of Cori equivalent to 75 performance of over 20,000 70 K20X GPUs at 100% scaling 1 2 4 8 16 32 64 128 256 512 1024 2025 3025 4225 6400 9000 Number of nodes AWP-ODC-OS weak scaling on Cori Phase II and TACC Stampede. We attain 91% scaling from 1 to 9000 nodes. The problem size required 14GB on each node [ISC_17_2].
EDGE Reaching 10+ PFLOPS ���� ���� �������� ���� ���� �������� ���� ���� �������� ���� ���� �������� ���� ��� • Regular cubic mesh, 5 Tets per �� Cube, 4th order (O4) and 6th �� order (O6) • Imitates convergence benchmark �� • 276K elements per node �� � ���� 10.4 PFLOPS • 1-9000 nodes of Cori-II (9000 (double precision) nodes = 612,000 cores) �� • O6C1 @ 9K nodes: 10.4 PFLOPS �� (38% of peak) �� • O4C8: @ 9K nodes: 5.0 PFLOPS (18% of peak) �� • O4C8 vs. O4C1 @ 9K nodes: 2.0x speedup � � � � �� �� �� �� �� ��� ��� ��� ��� ���� ���� ���� ���� ���� ���� Weak scaling study on Cori-II. Shown are hardware and non-zero peak efficiencies in ������ flat mode. O denotes the order and C the number of fused simulations [ISC17_1].
EDGE Strong at the Limit: 50x and 100x ���� ���� �������� ���� ���� �������� ���� ���� �������� ���� ���� �������� ���� ��� • Unstructured tetrahedral �� mesh: 172,386,915 elements • 32-3200 nodes of Theta (64 100x �� core Intel Xeon Phi x200, �� 50x code-named Knights Landing) � ���� �� • 3200 nodes = 204,800 cores �� • O6C1 @ 3.2K nodes: 3.4 PFLOPS (40% of peak) �� • O4C8 vs. O4C1 @ 3.2K nodes: �� 2.0x speedup � �� �� �� �� ��� ��� ��� ��� ���� ���� ���� ���� ���� Strong scaling study on Theta. Shown are hardware and non-zero peak efficiencies ������ in flat mode. O denotes the order and C the number of fused simulations [ISC17_1].
Recommend
More recommend