Scalable Dense Matrix Multiplication on Multi-Socket Many-Core Systems with Fast Shared Memory Ricardo Magana, Natalia Vassilieva
Acknowledgment Ricardo Magaña magania@gmail.com And also many thanks to prof. Robert Van De Geijn, Field Van Zee and Tyler Smith! 2
Outline – Motivation and The Machine pitch – NUMA-aware extension of BLIS for multi-socket systems – Experimental results 3
The Machine 4
I/O Copper 5
Copper 6
Copper 7
8
9
Memory GPU CPU Open Memory ASIC Memory CPU RISC CPU V Architecture CPU Quantum Memory Processor-centric computing Memory-Driven Computing 10
The Machine in context Physical Server Local DRAM SoC Local NVM Local DRAM SoC Local DRAM Local NVM SoC Physical Local NVM Server Interconnect Network Network Coherent Local DRAM SoC Local DRAM Local NVM SoC Local NVM Local DRAM SoC Physical Local NVM Server Shared nothing Shared everything 11
The Machine in context Physical Communications and memory fabric Server Local DRAM Local DRAM SoC SoC Local NVM NVM Local DRAM SoC Local DRAM Local DRAM Local NVM NVM SoC SoC Physical Local NVM Server Interconnect Network Network Coherent Local DRAM Local DRAM SoC NVM SoC Local DRAM Local NVM SoC NVM Local NVM Local DRAM Local DRAM SoC SoC Physical Local NVM Memory Pool Server Shared nothing Shared something Shared everything 12
Our goal: efficient linear algebra library for The Machine – Fast GEMM is crucial for fast machine learning (deep learning in particular) – BLAS is essential for many problems in scientific computing, pattern recognition and optimization – The ratio of compute/bandwidth on The Machine enables efficient scaling of GEMM for matrices of moderate sizes (up to 100000000 elements) 13
Linear algebra on The Machine: aspiration What do we need to be true: – High-performing single-node Typical sizes multi-core GEMM for small of matrices matrices for deep – Scalable multi-node GEMM learning 14
Existing BLAS libraries Proprietary Open Source – Intel MKL – ATLAS – AMD ACML – OpenBLAS – IBM ESSL and PESSL – BLIS – NVIDIA cuBLAS and NVBLAS – Armadillo – Eigen – ScaLAPACK – PLAPACK – PLASMA – DPLASMA – Elemental 15
Existing BLAS libraries Single-node Proprietary Open Source • Access shared coherent memory – Intel MKL – ATLAS • Threads don’t share data, only synchronization messages – AMD ACML – OpenBLAS Multi-node – IBM ESSL and PESSL – BLIS • Distributed memory • Different processes transfer data and – NVIDIA cuBLAS and NVBLAS – Armadillo synchronization messages – Eigen Multi-socket with shared memory – ScaLAPACK – PLAPACK – PLASMA – DPLASMA – Elemental In The Machine we have different processes that can access shared memory 16
Existing BLAS libraries Proprietary Open Source – Intel MKL – ATLAS – Open Source – AMD ACML – OpenBLAS – Different ways of parallelization – IBM ESSL and PESSL – BLIS – Easier to optimize for a new CPU – NVIDIA cuBLAS and NVBLAS – Armadillo – Eigen – ScaLAPACK – PLAPACK – PLASMA – DPLASMA – Elemental 17
Multi-socket systems today: NUMA The ones we used DL580 Superdome X – 4 sockets – 16 sockets – 15 ivybridge/haswell cores per socket (60 cores total) – 18 haswell cores per socket (288 cores total) – Theoretical peak: ~2.6/5.2 TFLOPS – Theoretical peak: ~20 TFLOPS CPU CPU CPU CPU QPI … Memory Memory Memory Memory 32 GB/s NUMA node 1 NUMA node 1 NUMA node 1 NUMA node 2 Crossbar fabric CPU CPU QPI Memory Memory CPU CPU 32 GB/s … NUMA node 3 NUMA node 4 Memory Memory NUMA node 1 NUMA node 1
NUMA-aware extension of BLIS (1) • Matrix A is composed of horizontal panels Cannon Like • Matrix B is composed of vertical panels A B Node 1 Node 2 • Panels are distributed in SoC Node 3 memory • Each SoC own one panel of A and one of B • GEMM is distributed, each SoC compute 3 blocks, each block is obtained by panel times panel • At every step one read from one remote SoC • Resulting matrix have “A” format. = = = SoC 1 Compute SoC 2 Compute SoC 3 Compute
NUMA-aware extension of BLIS (2) • A and B have the same format Blocks • As previous every SoC reads from only one other SoC A B Node 1 • Unlike previous switch reading Node 2 SoC after each block. Node 3 = = = SoC 1 Compute SoC 2 Compute SoC 3 Compute
Other tricks – Support for different memory pools (for different panels) – The entry point (bli_gemm) receives an array of obj_t that represent the panels of the matrix – MCS barrier instead of linear – Support for multiple thread entry points – To do not spawn new set of threads at every iteration (in every bli_gemm call) – Affinity of threads – We pre-launch the threads, pin them to particular CPU cores using a #pragma omp (outside of blis), and then use multiple threads entry points
SGEMM performance on Superdome X, comparison with a GPU system (2 NVIDIA Tesla K80) DISTRIBUTED SGEMM PERFORMANCE 16000 14000 SGEMM PERFORMANCE ( GFLOPS ) 12000 10000 8000 Intel ScaLAPACK PLASMA+OpenBLAS 6000 NUMA-BLIS v1 Custom+BLIS cuBLAS (1 GPU nocopy) 4000 cuBLAS (4 GPUs) 2000 cuBLAS (2 GPUs) 0 0 10000 20000 30000 40000 50000 60000 70000 MATRIX DIMENSION ( M=N=K ) 22
SGEMM performance on Superdome X DISTRIBUTED SGEMM PERFORMANCE nvBLAS (4 GPUs) 16000 nvBLAS (2 GPUs) nvBLAS (1 GPU no copy) 14000 SGEMM PERFORMANCE ( GFLOPS ) NUMA-BLIS v1 Custom + BLIS 12000 nvBLAS (1 GPU) 10000 8000 6000 4000 2000 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 MATRIX DIMENSION ( M=N=K ) 23
Improved usability and performance for small matrices (v2) Distributed SGEMM on Superdome X NUMA-BLIS v1 NUMA-BLIS v2
Conclusion – Done (almost): Extended BLIS (GEMM so far…) for multi-socket systems with shared memory – Matrix data is accessed directly – Synchronization via barriers – NUMA-aware – In progress: Extended BLIS for The Machine – Matrix data is accessed directly – Matrix data is in NVM – Synchronization via MPI/RVMA
Thank you! nvassilieva@hpe.com
Recommend
More recommend