for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong - PowerPoint PPT Presentation

Benchmarking Huawei ARM Multi-Core Processors for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong University Jan 9th, 2019

About Me Key Liao ( 廖秋承 ) B.S. from Environment Science and Engineering, SJTU. HPC Engineer of Center for High Performance Computing, SJTU. Leader of ARM Research Team at CHPC, SJTU. Supervisor of SJTU Student HPC Competition Team. Main Research Area: Computer Architecture Theoretical Computer Performance Evaluation Performance Optimization Email: keymorrislane@sjtu.edu.cn

Outline ➢ Kunpeng 920 ➢ Float-point Arithmetic ➢ Memory subsystem ➢ Proxy Applications ➢ TeaLeaf ➢ SNAP ➢ CloverLeaf ➢ Real-world applications ➢ GTC-P

Chips Information core 1 core 0 grp 0 grp 1 grp 2 grp 3 core 4 core 3 grp 4 grp 5

Chips Information Intel Xeon Kunpeng 920 Model Hi1616 Gold 6148 (Engineering Sample) Arch Skylake-SP ARM ARM Lithography 14nm 16nm 7nm Main Frequency(GHz) 2.4 2.4 2.0 Num of Cores 20 32 48 Vectorization Ins/Width AVX512/512bits ASIMD/128bits ASIMD/128bits Theoretical DP Peak 1536 307.2 768 Performance (GFLOPS)* 32MB 64MB L3 Cache 1.375 MB (shared) (shared) DRAM Support 6 x DDR4-2666 4 x DDR4-2400 8 x DDR4-3200 TDP 150 70 150 Launch Time 2017 2016 2019 * Theoretical DP peak performance is calculated based on the frequency we test during chips running their best vectorization instruction set.

Platform Information Platform 6148 1616 920 CPU Xeon Gold 6148 Hi1616 Kunpeng 920 Number of Sockets 4 4 8 DRAM Size (GB) 2048 256 256 DRAM Frequency (MHz) 2666 2400 2666 CentOS 7.5 EulerOS EulerOS Linux Kernel 3.10.0 Kernel 4.11.0 Kernel 4.14.0 Compiler GNU/GCC-8.2.0 All with MPI Library MVAPICH2-2.3 Intel Parallel Studio XE Cluster Version 2019 Update 1 BLAS Library OpenBLAS 0.3.5 (Education License)

Float-point Arithmetic • 41.1% Better than Hi1616, compared to a 165.3% increase from Haswell to Skylake in 3 years. • HPL efficiency on Kunpeng 920 is around 40% compared to more than 70% on other chips. HPL Benchmark on Four Platforms HPL Efficiency on Four Platforms 2500 90.0% 2252.2 80.0% 2000 70.0% 60.0% 1500 50.0% 955.32 40.0% 1000 670.7 750 30.0% 475.2 360.2 310.5 20.0% 500 220.2 10.0% 0 0.0% 2683 6148 1616 920 2683 6148 1616 920 Single Socket Dual Socket Single Socket Dual Socket

Float-point Arithmetic • Hi1616 • 128-bit SIMD • SP: 614.4 Gflops • DP: 307.2 Gflops • Hi1620 • 128-bit SIMD • SP: 1,536 Gflops • DP: 384 Glops • Throughput of DP SIMD instruction is limited. • Not a good chip for intense DP computation. • DP computation is not so important as people used to think. • Trend on SVE and VLA . FMA Instruction Throughput SP Scalar DP Scalar SP Vector DP Vector Hi1616 2ins/cycle 9.596GFlops 2ins/cycle 9.596GFlops 1ins/cycle 19.194Gflops 1ins/cycle 9.596GFlops Kunpeng920 2ins/cycle 7.989Gflops 2ins/cycle 7.989Gflops 2ins/cycle 31.954GFlops 1ins/cycle 7.989Gflops

Memory Subsystem Normalized Average Latency (ns) Chip L1 L2 L3 DRAM 6148 1x 1x 1x 1x 920 0.33x 0.71x 1.57x 1.25x Normalized Bandwidth of Different Memory Layers 6148 920 4 3.6 3.5 Relative Scale 3 2.5 2.1 2.0 2 1.6 1.6 1.2 1.5 1.2 1.1 1 1 1 1 1 1 1 1 1 0.5 0 L1 Read L1 Write L2 Read L2 Write L3 Read L3 Write DRAM DRAM Read Write

Chip Communication - Bandwidth 2680 2680 6148 6148 1616 1616 920 920 Platform Hydra Hydra QPI UPI Technique Interface Interface 35.2 40.8 10.0 12.7 Bandwidth(GB/s)

Proxy Applications ▪ SNAP ▪ A proxy application for a modern deterministic discrete ordinates transport code ▪ TeaLe Leaf ▪ Proxy app for solving the linear heat conduction equation on a spatially decomposed regular grid, utilising a five point finite difference stencil ▪ Clove overL rLeaf eaf ▪ Solving Euler’s equations of compressible fluid dynamics, under a Lagrangian- Eulerian scheme, on a two-dimensional spatial regular structured grid.

Proxy Applications - Results SNAP Grind Time TeaLeaf Single 2-Socket Strong Scaling 2-Socket Weak Scaling Single 2-Socket Strong Scaling 2-Socket Weak Scaling 0.916 0.91 1 1200 989.88 0.9 0.77 0.766 1000 0.8 Grind Time (ns) 0.7 Wall Lock (s) 800 0.56 0.6 601.18 607.13 0.462 0.5 600 0.4 364.42 339.5 400 0.3 193.05 0.2 200 0.1 0 0 6148 920 6148 920 CloverLeaf-bm16 CloverLeaf-bm128_short Single 2-Socket Strong Scaling Single 2-Socket Strong Scaling 1600 250 1342.61 208.78 1400 182.58 200 1200 1041.56 Wall Lock (s) 1000 Wall Lock (s) 150 755.5 120.65 109.8 800 571.67 100 600 400 50 200 0 0 6148 920 6148 920

Proxy Applications - Results Normalized Performance of Proxy Applications on Single Socket 2 1.65 1.289 1.5 1.005 0.875 1 0.5 0 SNAP TeaLeaf CloverLeaf_bm16 CloverLeaf_bm128_short 6148 920 Normalized Performance of Proxy Applications on Dual Socket 2 1.759 1.636 1.322 1.5 1.099 1 0.5 0 SNAP TeaLeaf CloverLeaf_bm16 CloverLeaf_bm128_short 6148 920

Proxy Applications - SNAP SNAP Grind Time • Generally, load a relative big data (9600 cells, nang=64, ng=332, nstep=100) set. Performing random access in Single 2-Socket Strong Scaling 2-Socket Weak Scaling Same single node performace the data set. 1 0.91 0.916 (dim3_sweep.f90) 0.9 0.77 0.766 • 0.8 If OpenMP is enable, threading 26.9% Speedup 0.7 across data set. Grind Time (ns) 0.56 0.6 • MPI_Recv becomes a hotspot 0.462 0.5 after scaling across socket. 0.4 0.3 0.2 0.1 0 6148 920

Proxy Applications - TeaLeaf • Memory subsystem bandwidth. • 3840 x 3840, 10000 steps. TeaLeaf Relative Speedup Single 2-Socket Strong Scaling 2-Socket Weak Scaling 1.89x 2 1.77x 1.8 1.6 1.4 Wall Lock (s) 1.2 1.0x 1.0x 1 0.607x 0.600x 0.8 0.6 0.4 0.2 0 6148 920

Proxy Applications - CloverLeaf • Memory subsystem bandwidth. • But • Double-float arithmetic intensity increases as the number of cells increases and the total number of iteration decreases. CloverLeaf-bm16 CloverLeaf-bm128_short Single 2-Socket Strong Scaling Single 2-Socket Strong Scaling 250 1600 1.77x 1.82x 1.51x 1.90x 1342.61 208.78 1400 200 182.58 1200 1041.56 1000 Wall Lock (s) Wall Lock (s) 150 120.65 755.5 109.8 800 571.67 100 600 400 50 200 0 0 6148 920 6148 920

GTC-P ▪ GTC-P: Gyrokinetic Toroidal Code - Princeton ▪ GTC-P is Particle-in-Cell code that delivers fusion simulations at extreme scales on the worldwide supercomputers including Tianhe-2, Titan, TaihuLight and etc., that feature CPU, GPU and many-core processors. Supported by NSF SAVI Project

GTC-P Kunpeng 920

GTC-P GTC-P Performance With Different Combination of Processes and Threads on Kunpeng 920

GTC-P Kunpeng 920

Conclusion ▪ Kunpeng 920 is capable to finish those scientific computation which has relatively low arithmetic intensity ( ＜ 4 dp F/B) better than Intel's recent chip which has similar price. ▪ Pro ▪ Good Topology designs for threading ▪ High bandwidth, low latency, do well in many memoty-bound apps. ▪ Con ▪ Low bandwidth of Hydra Interface. ▪ Low DP arithmetic capability.

for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong - PowerPoint PPT Presentation

Benchmarking Huawei ARM Multi-Core Processors for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong University Jan 9th, 2019 About Me Key Liao ( ) B.S. from Environment Science and Engineering, SJTU. HPC Engineer of Center

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Introduction Workloads for Experiments Introduction to workloads CS 239 Workload

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Understanding Big Data Workloads on Understanding Big Data Workloads on Modern Processors using

Evaluation of Memory and CPU usage via Cgroups of ATLAS workloads via Cgroups of ATLAS workloads

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, PhD.

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

UNDERSTANDING PROCESSOR CACHE EFFECTS WITH VALGRIND & VTUNE Chester Rebeiro Embedded Lab

Accountability Andrew Poelstra Director of Research, Blockstream 4 February 2019 1 / 23

EE-452 13 - 1 Czochralski (CZ) crystal growing Si is purified from SiO2 (sand) by refining,

Austmine Presentation An introduction to BLAST MOVEMENT and the BMM SYSTEM OPEN PIT MINING

ABC OF BREWING COFFEE alexandru totolici 1 2 NOTSOGOOD COFFEE (were all

CNT 5410 - Computer and Network Security: Denial of Service Professor Kevin Butler Fall 2015

Webinar 6: Quality Measurement and Data Collection Special Issues Part 1 of 2 Presented by

ECE 566: Grid Integration of Wind Energy Systems S. Suryanarayanan Associate Professor

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong - PowerPoint PPT Presentation

Benchmarking Huawei ARM Multi-Core Processors for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong University Jan 9th, 2019 About Me Key Liao ( ) B.S. from Environment Science and Engineering, SJTU. HPC Engineer of Center

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Introduction Workloads for Experiments Introduction to workloads CS 239 Workload

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Understanding Big Data Workloads on Understanding Big Data Workloads on Modern Processors using

Evaluation of Memory and CPU usage via Cgroups of ATLAS workloads via Cgroups of ATLAS workloads

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, PhD.

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

UNDERSTANDING PROCESSOR CACHE EFFECTS WITH VALGRIND &amp; VTUNE Chester Rebeiro Embedded Lab

Accountability Andrew Poelstra Director of Research, Blockstream 4 February 2019 1 / 23

EE-452 13 - 1 Czochralski (CZ) crystal growing Si is purified from SiO2 (sand) by refining,

Austmine Presentation An introduction to BLAST MOVEMENT and the BMM SYSTEM OPEN PIT MINING

ABC OF BREWING COFFEE alexandru totolici 1 2 NOTSOGOOD COFFEE (were all

CNT 5410 - Computer and Network Security: Denial of Service Professor Kevin Butler Fall 2015

Webinar 6: Quality Measurement and Data Collection Special Issues Part 1 of 2 Presented by

ECE 566: Grid Integration of Wind Energy Systems S. Suryanarayanan Associate Professor

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

UNDERSTANDING PROCESSOR CACHE EFFECTS WITH VALGRIND & VTUNE Chester Rebeiro Embedded Lab