NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto - PowerPoint PPT Presentation

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1

About this talk • Objective: Show how to improve performance of algorithms in a NUMA-system with MMM as an example • Code was written in C with numa.h, pthread.h • Tested on FSOC – ubuntu-0101 • 2 Nodes, 24 Cores – dl980 • 8 Nodes, 128 Cores • Compiled with gcc – – O3 2

Naïve Matrix-Matrix-Multiplication • We will examine MMM for large n x n matrices • 𝒫 𝑜 3 3 http://www.mathematrix.de/wp-content/uploads/matrixmul2.png

Naïve MMM implementation 4

Performance of Naive vs. MKL 98,14 128 64 32 11,79 16 8 4 2 1,02 Naive 1 MKL 0,38 0,5 0,25 0,13 0,125 0,0625 0,03125 0,02 0,015625 512 1024 2048 dl980 on one core 5

Intel Math Kernel Library (MKL) • BLAS : Basic Linear Algebra Subprograms – Standard for Linear Algebra • MKL : – Implements BLAS for Intel hardware – Vectorized and threaded for highest performance 6

Analysis of Naïve MMM • Testsetup • Use ubuntu-numa machine • No thread or memory pinning • Use numatop/pcm • Performance tools show: – Unused cores (obvious) – QPI cannot be fully loaded with one thread 8

Parallelization I • How can the work be divided? – 1. Partition computation of matrixC by rows or columns • Problem: All threads need matrixA and matrixB • Solution: – Accept overhead for remote memory access or – Copy input/output matrices to the other nodes (preprocessing) = * 9

Parallelization – Partition by rows 10

Parallelization – Partition by rows 98,14 128 64 32 11,79 16 8 4 2,54 Naive Sequential 2 Naive Parallel 1 MKL Parallel 0,27 0,38 0,5 0,28 0,26 0,19 0,25 0,125 0,05 0,0625 0,03125 512 1024 2048 dl980 on 128 cores 11

Parallelization II • How can the work be divided? – 2. Partition computation of matrixC by summands • Benefit: – for computing the i-th summand, only the i-th row of matrixA / column of matrixB is needed – This allows to only copy the needed parts to the other nodes • Disadvantage: – matrixB has to be transposed to be able to partition the memory (preprocessing) – locking or merging of matrixC is needed 12

Parallelization II 13

Performance of „Parallel Sum “ Method 218,84 256,00 186,39 128,00 64,00 32,00 17,24 14,91 16,00 Parallel sum 8,00 Naive Parallel 3,34 2,94 4,00 2,81 2,41 MKL Parallel 1,59 2,00 1,41 1,00 0,43 0,27 0,50 0,28 0,27 0,19 0,25 0,13 512 1024 2048 4096 8192 dl980 on 128 cores 14

Strassen • Runtime Complexity: – Naive algorithm 𝒫 𝑜 3 • Can we get better? – Strassens algorithm, published 1969, was the first to improve asymptotic complexity – Runtime 𝒫 𝑜 log 2 7 ≈ 𝒫 𝑜 2.8 • Algorithms today can get O( 𝑜 2.35 ), but are not pratical – Uses only 7 multiplications instead of 8 per recursion step 15

Matrix definition For matrices A,B,C with dimension 𝑜 = 4𝑙, 𝑙 ∈ ℕ A,B,C can be viewed as 2x2 block matrices: A = 𝐵 1,1 𝐵 1,2 𝐵 2,2 , 𝐶 = 𝐶 1,1 𝐶 2,2 , 𝐷 = 𝐷 1,1 𝐶 1,2 𝐷 1,2 𝐷 2,2 𝐵 2,1 𝐶 2,1 𝐷 2,1 Conventional Algorithm uses 8 (expensive) multiplications: 𝐷 1,1 = 𝐵 1,1 ∙ 𝐶 1,1 + 𝐵 1,2 ∙ 𝐶 2,1 𝐷 1,2 = 𝐵 1,1 ∙ 𝐶 1,2 + 𝐵 1,2 ∙ 𝐶 2,2 𝐷 2,1 = 𝐵 2,1 ∙ 𝐶 1,1 + 𝐵 2,2 ∙ 𝐶 2,1 𝐷 2,2 = 𝐵 2,1 ∙ 𝐶 1,2 + 𝐵 2,2 ∙ 𝐶 2,2 16

Strassen’s algorithm Define temporary matrices: 𝑁 1 ∶= 𝐵 1,1 + 𝐵 2,2 ∙ 𝐶 1,1 + 𝐶 2,2 𝑁 3 ∶= 𝐵 1,1 ∙ 𝐶 1,2 − 𝐶 2,2 𝑁 4 ∶= 𝐵 2,2 ∙ 𝐶 2,1 − 𝐶 1,1 𝑁 2 ∶= 𝐵 2,1 + 𝐵 2,2 ∙ 𝐶 1,1 𝑁 5 ∶= 𝐵 1,1 + 𝐵 1,2 ∙ 𝐶 2,2 𝑁 7 ∶= (𝐵 1,2 − 𝐵 2,2 ) ∙ (B 2,1 + 𝐶 2,2 𝑁 6 ∶= 𝐵 2,1 − 𝐵 1,1 ∙ 𝐶 1,1 + 𝐶 1,2 Only 7 multiplications! Compose final matrix 𝐷 1,1 = 𝑁 1 + 𝑁 4 − 𝑁 5 + 𝑁 7 𝐷 1,2 = 𝑁 3 + 𝑁 5 𝐷 2,1 = 𝑁 2 + 𝑁 4 𝐷 2,2 = 𝑁 1 − 𝑁 2 + 𝑁 3 + 𝑁 6 17

Strassen - Example Substituting the 𝑁 𝑗 𝑡 by their term gives back the original formula: 𝐷 1,2 = 𝑁 3 + 𝑁 5 = 𝐵 1,1 ∙ 𝐶 1,2 − 𝐶 2,2 + 𝐵 1,1 + 𝐵 1,2 ∙ 𝐶 2,2 = 𝐵 1,1 𝐶 1,2 − 𝐵 1,1 𝐶 2,2 + 𝐵 1,1 𝐶 2,2 + 𝐵 1,2 𝐶 2,2 = 𝐵 1,1 𝐶 1,2 + 𝐵 1,2 𝐶 2,2 𝐵 1,1 𝐵 1,2 𝐵 2,2 ∙ 𝐶 1,1 𝐶 1,2 𝐶 2,2 = 𝐷 1,1 𝐷 1,2 𝐷 2,2 𝐵 2,1 𝐶 2,1 𝐷 2,1 18

Strassen - Analysis • Cost: 7 multiplications and 18 additions – 8 multiplications and 4 additions for naïve • Only practical for large matrices n > 1000 – Although our results indicate otherwise (later) • Define cutoff point for recursion – If n is sufficiently small, do naïve multiplication 19

Strassen - Implementation 20

Execution Time: Single-threaded 98,14 128,0000 64,0000 32,0000 11,79 16,0000 6,12 8,0000 4,0000 strassen: BREAK = 64 2,0000 1,02 0,87 1,0000 0,38 0,5000 Seconds 0,2500 0,12 0,13 0,1250 0,05 0,0625 0,02 0,0313 0,02 0,0156 0,01 0,0078 0,00 0,00 0,0039 0,0020 0,00 0,00 0,00 0,0010 0,0005 0,00 0,0002 0,00 0,00 0,0001 0,0001 0,00 32 64 128 256 512 1024 2048 N-dimension Naive Strassen MKL dl980 on 1 core 21

Parallelization of Strassen I • Data dependencies: – Have to do additions in 𝑁 𝑗 before multiplication • e.g. M 1 = 𝐵 1,1 + 𝐵 2,2 ∙ 𝐶 1,1 + 𝐶 2,2 – Have to calculate 𝑁 𝑗 before calculating C • 𝐷 1,2 = 𝑁 3 + 𝑁 5 • Easiest solution: – Calculate in 𝑁 𝑗 s in parallel – Then calculate 𝐷 𝑗,𝑘 in parallel 22

Parallelization of Strassen II • Level 1 can be scheduled to 7 threads • Level n can be scheduled to 7 𝑜 threads – Most systems have number of processors on base 2 • We used manual parallelization – 49 distinct functions for Ms and 16 for Cs – Code bloating and not scalable, BUT: • Automatic parallelization is hard – Thread load becomes very unbalanced – Every level needs 7 temporary matrices • Exponential rising memory requirements 23

Execution Time – 49 Threads 228,57 256 128 64 27,61 32 13,53 16 8 seconds 4 2,54 2,06 1,84 2 1 0,49 0,44 0,5 0,28 0,27 0,26 0,19 0,25 0,14 0,125 0,05 0,0625 0,05 0,03125 512 1024 2048 4096 8192 N-dimension Naive Strassen MKL dl980 on 49 cores 24

NUMA-Optimizations • Try to have as much memory local as possible to avoid remote memory access – Because it is slower by a factor of ~ 1.4 • Partition data and work depending on #nodes and #cores • Pin threads to nodes with the memory they need • (Topology for other algorithms) 25

Distributing memory and threads 204,85 256 182,45 143,44 101,12 128 64 Distributed Memory and 21,96 32 Threads 18,34 14,33 Neither distributed 16 11,39 8 Distributed threads 4 2 Distributed memory 1 0,35 0,35 0,37 0,5 0,34 0,25 1024 2048 4096 26 Parallel naive on ubuntu-numa0101 on 24 cores

DEMO 27

Application of NUMA-Optimizations • Copy all data to every node: – Duration of preprocessing: • 11.11s for a 8192x8192 matrix to 8 nodes • Partition data and move to corresponding nodes – Duration of preprocessing: • 1.03s for a 8192x8192 matrix to 8 nodes • Pin threads to nodes – int numa_run_on_node(int node); 28

Parallelization – Partition by rows Copying memory to different nodes 29

Strassen Memory Distribution Effects Dimension: 16384 40 35 30 25 21,316545 20 22,147083 19,611477 15 10 14,671332 5 0 6 7 8 distributed memory copy multiplication result combination dl980 on 128 core 30

Other optimization techniques • Tiling • Vectorization • Scalar replacement • Precomputation of constants • (unrolling) 31

Tiling • Divide computational work into tiles to leverage cache • Tile size depends on cache size • gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE) 33

Performance of Tiling perf stat -e L1-dcache-misses,LLC-misses,DTLB-misses bin/matrixmult – n 2048 s 8,59E+09 120 1,074E+09 97 134217728 100 16777216 80 2097152 262144 60 32768 39 4096 40 512 13 12 64 20 8 0 1 Time Not Tiled, not Transposed Not Tiled, Transposed Tiled, not Transposed 34 dl980 on 128 cores Tiled, Transposed

Vectorization • SIMD : Single Instruction Multiple Data • All recent Intel and AMD Processors have Streaming Instructions Extensions (SSE) • An instruction is simultaneously applied to multiple floats • Can only operate efficiently on aligned data (16 bit aligned) • SSE operate on 128bit registers – Newer Intel processors have Advanced Vector Instructions (AVX) with 256 bit – Dl980 machine only support 128bit operations 35

Auto-Vectorization • Can this be done automatically? – Gcc – O3 tries to auto-vectorize • only possible for simple statements 36

Assembler 37

Aligned Malloc Example: • Numa_alloc returns adrr: 0x1232, not 16bit aligned • We add 15, so addr = 0x1241 or 0b1001001000001 • Now we clear last 4 bits by ANDing ~0x0f (=0xfff0) • => result 0x1240 is now 16bit aligned 38

Intrinsics Example Source: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ 39

Use Parallelism for MMM • We try to construct a 4x4 matrix multiplication • How to process rows ? continuous memory X Can’t be loaded in one instr. 40

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto - PowerPoint PPT Presentation

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk Objective: Show how to improve performance of algorithms in a NUMA-system with MMM as an example Code was written in C with numa.h, pthread.h

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

Efficient multiplication 2 Matrix multiplication If you have square matrices A and B, then C =

Matrix-chain multiplication Carola Wenk 1 CMPS 6610 Algorithms Matrix-chain multiplication

Chapter VI All Pair Shortest Paths and Matrix Multiplication VI.1 APSPs and Matrix

Complexity of matrix multiplication (For Hierarchical matrix) For Usual matrix The

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

Randomness in Computing L ECTURE 2 Last time Verifying polynomial identities Probability

Fast and Accurate Memristor- Based Algorithms for Social Network Analysis Sucheta Soundarajan

Why? performance wise not!! 3 4 The inner loop of matrix Matrix multiplication: IJK

Local Parallel Iteration in X10 Josh Milthorpe IBM Research This material is based upon work

Matrix-Chain Multiplication Given : chain of matrices ( A 1 , A 2 , . . . A n ) , with A i

Sorting integer arrays: security, speed, and verification D. J. Bernstein 2 Bobs laptop

Merge Sort: Summary General algorithm: Basic analysis: Divide in half log(n) times,

15-252 More Great Ideas in Theoretical Computer Science Lecture 1: Sorting Pancakes January