NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1
About this talk • Objective: Show how to improve performance of algorithms in a NUMA-system with MMM as an example • Code was written in C with numa.h, pthread.h • Tested on FSOC – ubuntu-0101 • 2 Nodes, 24 Cores – dl980 • 8 Nodes, 128 Cores • Compiled with gcc – – O3 2
Naïve Matrix-Matrix-Multiplication • We will examine MMM for large n x n matrices • 𝒫 𝑜 3 3 http://www.mathematrix.de/wp-content/uploads/matrixmul2.png
Naïve MMM implementation 4
Performance of Naive vs. MKL 98,14 128 64 32 11,79 16 8 4 2 1,02 Naive 1 MKL 0,38 0,5 0,25 0,13 0,125 0,0625 0,03125 0,02 0,015625 512 1024 2048 dl980 on one core 5
Intel Math Kernel Library (MKL) • BLAS : Basic Linear Algebra Subprograms – Standard for Linear Algebra • MKL : – Implements BLAS for Intel hardware – Vectorized and threaded for highest performance 6
Analysis of Naïve MMM • Testsetup • Use ubuntu-numa machine • No thread or memory pinning • Use numatop/pcm • Performance tools show: – Unused cores (obvious) – QPI cannot be fully loaded with one thread 8
Parallelization I • How can the work be divided? – 1. Partition computation of matrixC by rows or columns • Problem: All threads need matrixA and matrixB • Solution: – Accept overhead for remote memory access or – Copy input/output matrices to the other nodes (preprocessing) = * 9
Parallelization – Partition by rows 10
Parallelization – Partition by rows 98,14 128 64 32 11,79 16 8 4 2,54 Naive Sequential 2 Naive Parallel 1 MKL Parallel 0,27 0,38 0,5 0,28 0,26 0,19 0,25 0,125 0,05 0,0625 0,03125 512 1024 2048 dl980 on 128 cores 11
Parallelization II • How can the work be divided? – 2. Partition computation of matrixC by summands • Benefit: – for computing the i-th summand, only the i-th row of matrixA / column of matrixB is needed – This allows to only copy the needed parts to the other nodes • Disadvantage: – matrixB has to be transposed to be able to partition the memory (preprocessing) – locking or merging of matrixC is needed 12
Parallelization II 13
Performance of „Parallel Sum “ Method 218,84 256,00 186,39 128,00 64,00 32,00 17,24 14,91 16,00 Parallel sum 8,00 Naive Parallel 3,34 2,94 4,00 2,81 2,41 MKL Parallel 1,59 2,00 1,41 1,00 0,43 0,27 0,50 0,28 0,27 0,19 0,25 0,13 512 1024 2048 4096 8192 dl980 on 128 cores 14
Strassen • Runtime Complexity: – Naive algorithm 𝒫 𝑜 3 • Can we get better? – Strassens algorithm, published 1969, was the first to improve asymptotic complexity – Runtime 𝒫 𝑜 log 2 7 ≈ 𝒫 𝑜 2.8 • Algorithms today can get O( 𝑜 2.35 ), but are not pratical – Uses only 7 multiplications instead of 8 per recursion step 15
Matrix definition For matrices A,B,C with dimension 𝑜 = 4𝑙, 𝑙 ∈ ℕ A,B,C can be viewed as 2x2 block matrices: A = 𝐵 1,1 𝐵 1,2 𝐵 2,2 , 𝐶 = 𝐶 1,1 𝐶 2,2 , 𝐷 = 𝐷 1,1 𝐶 1,2 𝐷 1,2 𝐷 2,2 𝐵 2,1 𝐶 2,1 𝐷 2,1 Conventional Algorithm uses 8 (expensive) multiplications: 𝐷 1,1 = 𝐵 1,1 ∙ 𝐶 1,1 + 𝐵 1,2 ∙ 𝐶 2,1 𝐷 1,2 = 𝐵 1,1 ∙ 𝐶 1,2 + 𝐵 1,2 ∙ 𝐶 2,2 𝐷 2,1 = 𝐵 2,1 ∙ 𝐶 1,1 + 𝐵 2,2 ∙ 𝐶 2,1 𝐷 2,2 = 𝐵 2,1 ∙ 𝐶 1,2 + 𝐵 2,2 ∙ 𝐶 2,2 16
Strassen’s algorithm Define temporary matrices: 𝑁 1 ∶= 𝐵 1,1 + 𝐵 2,2 ∙ 𝐶 1,1 + 𝐶 2,2 𝑁 3 ∶= 𝐵 1,1 ∙ 𝐶 1,2 − 𝐶 2,2 𝑁 4 ∶= 𝐵 2,2 ∙ 𝐶 2,1 − 𝐶 1,1 𝑁 2 ∶= 𝐵 2,1 + 𝐵 2,2 ∙ 𝐶 1,1 𝑁 5 ∶= 𝐵 1,1 + 𝐵 1,2 ∙ 𝐶 2,2 𝑁 7 ∶= (𝐵 1,2 − 𝐵 2,2 ) ∙ (B 2,1 + 𝐶 2,2 𝑁 6 ∶= 𝐵 2,1 − 𝐵 1,1 ∙ 𝐶 1,1 + 𝐶 1,2 Only 7 multiplications! Compose final matrix 𝐷 1,1 = 𝑁 1 + 𝑁 4 − 𝑁 5 + 𝑁 7 𝐷 1,2 = 𝑁 3 + 𝑁 5 𝐷 2,1 = 𝑁 2 + 𝑁 4 𝐷 2,2 = 𝑁 1 − 𝑁 2 + 𝑁 3 + 𝑁 6 17
Strassen - Example Substituting the 𝑁 𝑗 𝑡 by their term gives back the original formula: 𝐷 1,2 = 𝑁 3 + 𝑁 5 = 𝐵 1,1 ∙ 𝐶 1,2 − 𝐶 2,2 + 𝐵 1,1 + 𝐵 1,2 ∙ 𝐶 2,2 = 𝐵 1,1 𝐶 1,2 − 𝐵 1,1 𝐶 2,2 + 𝐵 1,1 𝐶 2,2 + 𝐵 1,2 𝐶 2,2 = 𝐵 1,1 𝐶 1,2 + 𝐵 1,2 𝐶 2,2 𝐵 1,1 𝐵 1,2 𝐵 2,2 ∙ 𝐶 1,1 𝐶 1,2 𝐶 2,2 = 𝐷 1,1 𝐷 1,2 𝐷 2,2 𝐵 2,1 𝐶 2,1 𝐷 2,1 18
Strassen - Analysis • Cost: 7 multiplications and 18 additions – 8 multiplications and 4 additions for naïve • Only practical for large matrices n > 1000 – Although our results indicate otherwise (later) • Define cutoff point for recursion – If n is sufficiently small, do naïve multiplication 19
Strassen - Implementation 20
Execution Time: Single-threaded 98,14 128,0000 64,0000 32,0000 11,79 16,0000 6,12 8,0000 4,0000 strassen: BREAK = 64 2,0000 1,02 0,87 1,0000 0,38 0,5000 Seconds 0,2500 0,12 0,13 0,1250 0,05 0,0625 0,02 0,0313 0,02 0,0156 0,01 0,0078 0,00 0,00 0,0039 0,0020 0,00 0,00 0,00 0,0010 0,0005 0,00 0,0002 0,00 0,00 0,0001 0,0001 0,00 32 64 128 256 512 1024 2048 N-dimension Naive Strassen MKL dl980 on 1 core 21
Parallelization of Strassen I • Data dependencies: – Have to do additions in 𝑁 𝑗 before multiplication • e.g. M 1 = 𝐵 1,1 + 𝐵 2,2 ∙ 𝐶 1,1 + 𝐶 2,2 – Have to calculate 𝑁 𝑗 before calculating C • 𝐷 1,2 = 𝑁 3 + 𝑁 5 • Easiest solution: – Calculate in 𝑁 𝑗 s in parallel – Then calculate 𝐷 𝑗,𝑘 in parallel 22
Parallelization of Strassen II • Level 1 can be scheduled to 7 threads • Level n can be scheduled to 7 𝑜 threads – Most systems have number of processors on base 2 • We used manual parallelization – 49 distinct functions for Ms and 16 for Cs – Code bloating and not scalable, BUT: • Automatic parallelization is hard – Thread load becomes very unbalanced – Every level needs 7 temporary matrices • Exponential rising memory requirements 23
Execution Time – 49 Threads 228,57 256 128 64 27,61 32 13,53 16 8 seconds 4 2,54 2,06 1,84 2 1 0,49 0,44 0,5 0,28 0,27 0,26 0,19 0,25 0,14 0,125 0,05 0,0625 0,05 0,03125 512 1024 2048 4096 8192 N-dimension Naive Strassen MKL dl980 on 49 cores 24
NUMA-Optimizations • Try to have as much memory local as possible to avoid remote memory access – Because it is slower by a factor of ~ 1.4 • Partition data and work depending on #nodes and #cores • Pin threads to nodes with the memory they need • (Topology for other algorithms) 25
Distributing memory and threads 204,85 256 182,45 143,44 101,12 128 64 Distributed Memory and 21,96 32 Threads 18,34 14,33 Neither distributed 16 11,39 8 Distributed threads 4 2 Distributed memory 1 0,35 0,35 0,37 0,5 0,34 0,25 1024 2048 4096 26 Parallel naive on ubuntu-numa0101 on 24 cores
DEMO 27
Application of NUMA-Optimizations • Copy all data to every node: – Duration of preprocessing: • 11.11s for a 8192x8192 matrix to 8 nodes • Partition data and move to corresponding nodes – Duration of preprocessing: • 1.03s for a 8192x8192 matrix to 8 nodes • Pin threads to nodes – int numa_run_on_node(int node); 28
Parallelization – Partition by rows Copying memory to different nodes 29
Strassen Memory Distribution Effects Dimension: 16384 40 35 30 25 21,316545 20 22,147083 19,611477 15 10 14,671332 5 0 6 7 8 distributed memory copy multiplication result combination dl980 on 128 core 30
Other optimization techniques • Tiling • Vectorization • Scalar replacement • Precomputation of constants • (unrolling) 31
Tiling • Divide computational work into tiles to leverage cache • Tile size depends on cache size • gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE) 33
Performance of Tiling perf stat -e L1-dcache-misses,LLC-misses,DTLB-misses bin/matrixmult – n 2048 s 8,59E+09 120 1,074E+09 97 134217728 100 16777216 80 2097152 262144 60 32768 39 4096 40 512 13 12 64 20 8 0 1 Time Not Tiled, not Transposed Not Tiled, Transposed Tiled, not Transposed 34 dl980 on 128 cores Tiled, Transposed
Vectorization • SIMD : Single Instruction Multiple Data • All recent Intel and AMD Processors have Streaming Instructions Extensions (SSE) • An instruction is simultaneously applied to multiple floats • Can only operate efficiently on aligned data (16 bit aligned) • SSE operate on 128bit registers – Newer Intel processors have Advanced Vector Instructions (AVX) with 256 bit – Dl980 machine only support 128bit operations 35
Auto-Vectorization • Can this be done automatically? – Gcc – O3 tries to auto-vectorize • only possible for simple statements 36
Assembler 37
Aligned Malloc Example: • Numa_alloc returns adrr: 0x1232, not 16bit aligned • We add 15, so addr = 0x1241 or 0b1001001000001 • Now we clear last 4 bits by ANDing ~0x0f (=0xfff0) • => result 0x1240 is now 16bit aligned 38
Intrinsics Example Source: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ 39
Use Parallelism for MMM • We try to construct a 4x4 matrix multiplication • How to process rows ? continuous memory X Can’t be loaded in one instr. 40
Recommend
More recommend