Benchmarking Huawei ARM Multi-Core Processors for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong University Jan 9th, 2019
About Me Key Liao ( 廖秋承 ) B.S. from Environment Science and Engineering, SJTU. HPC Engineer of Center for High Performance Computing, SJTU. Leader of ARM Research Team at CHPC, SJTU. Supervisor of SJTU Student HPC Competition Team. Main Research Area: Computer Architecture Theoretical Computer Performance Evaluation Performance Optimization Email: keymorrislane@sjtu.edu.cn
Outline ➢ Kunpeng 920 ➢ Float-point Arithmetic ➢ Memory subsystem ➢ Proxy Applications ➢ TeaLeaf ➢ SNAP ➢ CloverLeaf ➢ Real-world applications ➢ GTC-P
Chips Information core 1 core 0 grp 0 grp 1 grp 2 grp 3 core 4 core 3 grp 4 grp 5
Chips Information Intel Xeon Kunpeng 920 Model Hi1616 Gold 6148 (Engineering Sample) Arch Skylake-SP ARM ARM Lithography 14nm 16nm 7nm Main Frequency(GHz) 2.4 2.4 2.0 Num of Cores 20 32 48 Vectorization Ins/Width AVX512/512bits ASIMD/128bits ASIMD/128bits Theoretical DP Peak 1536 307.2 768 Performance (GFLOPS)* 32MB 64MB L3 Cache 1.375 MB (shared) (shared) DRAM Support 6 x DDR4-2666 4 x DDR4-2400 8 x DDR4-3200 TDP 150 70 150 Launch Time 2017 2016 2019 * Theoretical DP peak performance is calculated based on the frequency we test during chips running their best vectorization instruction set.
Platform Information Platform 6148 1616 920 CPU Xeon Gold 6148 Hi1616 Kunpeng 920 Number of Sockets 4 4 8 DRAM Size (GB) 2048 256 256 DRAM Frequency (MHz) 2666 2400 2666 CentOS 7.5 EulerOS EulerOS Linux Kernel 3.10.0 Kernel 4.11.0 Kernel 4.14.0 Compiler GNU/GCC-8.2.0 All with MPI Library MVAPICH2-2.3 Intel Parallel Studio XE Cluster Version 2019 Update 1 BLAS Library OpenBLAS 0.3.5 (Education License)
Float-point Arithmetic • 41.1% Better than Hi1616, compared to a 165.3% increase from Haswell to Skylake in 3 years. • HPL efficiency on Kunpeng 920 is around 40% compared to more than 70% on other chips. HPL Benchmark on Four Platforms HPL Efficiency on Four Platforms 2500 90.0% 2252.2 80.0% 2000 70.0% 60.0% 1500 50.0% 955.32 40.0% 1000 670.7 750 30.0% 475.2 360.2 310.5 20.0% 500 220.2 10.0% 0 0.0% 2683 6148 1616 920 2683 6148 1616 920 Single Socket Dual Socket Single Socket Dual Socket
Float-point Arithmetic • Hi1616 • 128-bit SIMD • SP: 614.4 Gflops • DP: 307.2 Gflops • Hi1620 • 128-bit SIMD • SP: 1,536 Gflops • DP: 384 Glops • Throughput of DP SIMD instruction is limited. • Not a good chip for intense DP computation. • DP computation is not so important as people used to think. • Trend on SVE and VLA . FMA Instruction Throughput SP Scalar DP Scalar SP Vector DP Vector Hi1616 2ins/cycle 9.596GFlops 2ins/cycle 9.596GFlops 1ins/cycle 19.194Gflops 1ins/cycle 9.596GFlops Kunpeng920 2ins/cycle 7.989Gflops 2ins/cycle 7.989Gflops 2ins/cycle 31.954GFlops 1ins/cycle 7.989Gflops
Memory Subsystem Normalized Average Latency (ns) Chip L1 L2 L3 DRAM 6148 1x 1x 1x 1x 920 0.33x 0.71x 1.57x 1.25x Normalized Bandwidth of Different Memory Layers 6148 920 4 3.6 3.5 Relative Scale 3 2.5 2.1 2.0 2 1.6 1.6 1.2 1.5 1.2 1.1 1 1 1 1 1 1 1 1 1 0.5 0 L1 Read L1 Write L2 Read L2 Write L3 Read L3 Write DRAM DRAM Read Write
Chip Communication - Bandwidth 2680 2680 6148 6148 1616 1616 920 920 Platform Hydra Hydra QPI UPI Technique Interface Interface 35.2 40.8 10.0 12.7 Bandwidth(GB/s)
Proxy Applications ▪ SNAP ▪ A proxy application for a modern deterministic discrete ordinates transport code ▪ TeaLe Leaf ▪ Proxy app for solving the linear heat conduction equation on a spatially decomposed regular grid, utilising a five point finite difference stencil ▪ Clove overL rLeaf eaf ▪ Solving Euler’s equations of compressible fluid dynamics, under a Lagrangian- Eulerian scheme, on a two-dimensional spatial regular structured grid.
Proxy Applications - Results SNAP Grind Time TeaLeaf Single 2-Socket Strong Scaling 2-Socket Weak Scaling Single 2-Socket Strong Scaling 2-Socket Weak Scaling 0.916 0.91 1 1200 989.88 0.9 0.77 0.766 1000 0.8 Grind Time (ns) 0.7 Wall Lock (s) 800 0.56 0.6 601.18 607.13 0.462 0.5 600 0.4 364.42 339.5 400 0.3 193.05 0.2 200 0.1 0 0 6148 920 6148 920 CloverLeaf-bm16 CloverLeaf-bm128_short Single 2-Socket Strong Scaling Single 2-Socket Strong Scaling 1600 250 1342.61 208.78 1400 182.58 200 1200 1041.56 Wall Lock (s) 1000 Wall Lock (s) 150 755.5 120.65 109.8 800 571.67 100 600 400 50 200 0 0 6148 920 6148 920
Proxy Applications - Results Normalized Performance of Proxy Applications on Single Socket 2 1.65 1.289 1.5 1.005 0.875 1 0.5 0 SNAP TeaLeaf CloverLeaf_bm16 CloverLeaf_bm128_short 6148 920 Normalized Performance of Proxy Applications on Dual Socket 2 1.759 1.636 1.322 1.5 1.099 1 0.5 0 SNAP TeaLeaf CloverLeaf_bm16 CloverLeaf_bm128_short 6148 920
Proxy Applications - SNAP SNAP Grind Time • Generally, load a relative big data (9600 cells, nang=64, ng=332, nstep=100) set. Performing random access in Single 2-Socket Strong Scaling 2-Socket Weak Scaling Same single node performace the data set. 1 0.91 0.916 (dim3_sweep.f90) 0.9 0.77 0.766 • 0.8 If OpenMP is enable, threading 26.9% Speedup 0.7 across data set. Grind Time (ns) 0.56 0.6 • MPI_Recv becomes a hotspot 0.462 0.5 after scaling across socket. 0.4 0.3 0.2 0.1 0 6148 920
Proxy Applications - TeaLeaf • Memory subsystem bandwidth. • 3840 x 3840, 10000 steps. TeaLeaf Relative Speedup Single 2-Socket Strong Scaling 2-Socket Weak Scaling 1.89x 2 1.77x 1.8 1.6 1.4 Wall Lock (s) 1.2 1.0x 1.0x 1 0.607x 0.600x 0.8 0.6 0.4 0.2 0 6148 920
Proxy Applications - CloverLeaf • Memory subsystem bandwidth. • But • Double-float arithmetic intensity increases as the number of cells increases and the total number of iteration decreases. CloverLeaf-bm16 CloverLeaf-bm128_short Single 2-Socket Strong Scaling Single 2-Socket Strong Scaling 250 1600 1.77x 1.82x 1.51x 1.90x 1342.61 208.78 1400 200 182.58 1200 1041.56 1000 Wall Lock (s) Wall Lock (s) 150 120.65 755.5 109.8 800 571.67 100 600 400 50 200 0 0 6148 920 6148 920
GTC-P ▪ GTC-P: Gyrokinetic Toroidal Code - Princeton ▪ GTC-P is Particle-in-Cell code that delivers fusion simulations at extreme scales on the worldwide supercomputers including Tianhe-2, Titan, TaihuLight and etc., that feature CPU, GPU and many-core processors. Supported by NSF SAVI Project
GTC-P Kunpeng 920
GTC-P GTC-P Performance With Different Combination of Processes and Threads on Kunpeng 920
GTC-P Kunpeng 920
Conclusion ▪ Kunpeng 920 is capable to finish those scientific computation which has relatively low arithmetic intensity ( < 4 dp F/B) better than Intel's recent chip which has similar price. ▪ Pro ▪ Good Topology designs for threading ▪ High bandwidth, low latency, do well in many memoty-bound apps. ▪ Con ▪ Low bandwidth of Hydra Interface. ▪ Low DP arithmetic capability.
Recommend
More recommend