Static versus Dynamic Memory Allocation: a Comparison for Linear - PowerPoint PPT Presentation

Static versus Dynamic Memory Allocation: a Comparison for Linear Algebra Kernels Toufik Baroudi 1 Vincent Loechner 2 Rachid Seghir 1 1 University of Batna, Algeria 2 University of Strasbourg & INRIA, France IMPACT 2020 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 1 / 25

Introduction Two years ago [BSL17, TACO]: compact data layout for regular sparse matrices optimized by Pluto our preliminary benchmarks were inconsistent → due to matrix allocation mode: as static declared array or as array of pointers to dynamically allocated memory Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 2 / 25

Introduction: Content of this Presentation We precisely analyze one code: triangular matrix multiplication using the performance counters (#instr., #mem. access, #L1-L3 cache misses, #TLB misses, #vectorized instr.) Ran the same tests on the PolyBench linear algebra kernels Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 3 / 25

Introduction: Objective Array allocation mode influences performance! Main factors of performance variation: ability of the compiler to detect vectorization number of cache misses and memory loads Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 4 / 25

Introduction: Objective Array allocation mode influences performance! Main factors of performance variation: ability of the compiler to detect vectorization number of cache misses and memory loads This work is not a manifest for one type of allocation or the other, it is a warning: declaration and allocation of arrays matters! Comparing various versions of codes using different array allocation modes can get biased Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 4 / 25

Table of Contents Introduction 1 Triangular Matrix Multiplication: Demonstration 2 Triangular Matrix Multiplication: Performance Analysis 3 PolyBench: Performance Analysis 4 Conclusion 5 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 5 / 25

Motivation Triangular Matrix Multiplication Demo in completely different conditions than in the paper: on this laptop (MacOS 10.14, clang/llvm-9.0.0, 4-cores Intel core i7) Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 6 / 25

Triangular Matrix Multiplication: setup Intel platform: dual socket Intel Xeon E5-2650v3 (Haswell-EP) 2x10 hyperthreaded cores, AVX2 (256 bits) AMD platform: dual socket AMD Opteron 6172 (Magny-Cours) 2x12 cores, SSE (128 bits) using pluto-0.11.4 --tile --parallel using gcc-7.4.0 -O3 -march=native -fopenmp on a regular Linux 4.0.15 (Ubuntu) problem size: N=8000 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 8 / 25

Triangular Matrix Multiplication: execution time 512 10 3 execution time (s) 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 17 . 3 25 20 10 . 2 9 . 7 Par Dyn 10 1 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 9 / 25

Triangular Matrix Multiplication: L1-dcache-loads 10 3 execution time (s) 512 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 25 17 . 3 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # L1-dc-ld (billions) 1 , 000 834 727 Orig Stat Orig Dyn 471 500 366 369 Par Stat 353 347 320 319 323 279 204 205 Par Dyn 173 43 . 8 44 . 1 0 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 10 / 25

Triangular Matrix Multiplication: L1-dcache-misses 10 3 512 execution time (s) 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 25 17 . 3 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # L1-dc-misses (billions) 19 . 88 19 . 87 19 . 7 19 . 8 Orig Stat 10 1 Orig Dyn 3 . 62 3 . 17 2 . 68 2 . 44 2 . 25 Par Stat 2 . 08 1 . 03 Par Dyn 1 . 01 0 . 89 0 . 91 0 . 88 0 . 85 10 0 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 11 / 25

Triangular Matrix Multiplication: L3-dcache-misses 10 3 512 execution time (s) 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 25 17 . 3 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # L3-dc-misses (billions) 3 . 25 2 . 99 2 . 19 1 . 86 Orig Stat Orig Dyn 0 . 83 0 . 82 0 . 61 10 0 0 . 53 0 . 44 0 . 41 0 . 42 Par Stat 0 . 4 Par Dyn 0 . 12 0 . 12 0 . 12 0 . 12 10 − 1 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 12 / 25

Triangular Matrix Multiplication: dTLB-misses 10 3 execution time (s) 512 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 25 17 . 3 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # dTLB-misses (billions) 10 3 345 300 250 204 208 207 Orig Stat 177 105 Orig Dyn 10 2 66 58 Par Stat 43 43 42 40 40 35 Par Dyn 10 1 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 13 / 25

Triangular Matrix Multiplication: vectorized instructions 512 10 3 execution time (s) 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 17 . 3 25 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # vectorized inst. (millions) 69 , 360 69 , 310 10 6 2 , 006 . 5 Orig Stat 1 , 280 Orig Dyn 10 3 Par Stat 19 . 3 Par Dyn 4 . 6 3 1 10 0 Intel C1 AMD C1 Intel C2 AMD C2 unavailable on the AMD but “ gcc -fopt-info-vec ” seems to confirm the correlation Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 14 / 25

Triangular Matrix Multiplication: Synthesis array allocation mode has a significant impact on the performance of this code it can have opposite effects on different processors! factors of influence: number of memory accesses number of cache and TLB misses number of vectorized instructions other experiments 1 on the Intel platform show that the number of vectorized instructions is a major factor of influence 1 on other triangular matrix kernels: Cholesky, SolveMat, sspfa. Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 15 / 25

PolyBench: setup on the Intel platform using pluto-0.11.4 --tile --parallel using gcc-7.4.0 -O3 -march=native -fopenmp PolyBench macro POLYBENCH STACK ARRAYS : static version: stack allocated static array dynamic version: multidimensional heap-allocated array ( not an array of pointers as in the previous experiment) problem size: N=2,000 for O ( N 3 ) algorithms N=20,000 for O ( N 2 ) algorithms Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 17 / 25

PolyBench: execution time 3.8x 247 . 4 execution time (s) +20% 164 . 8 10 3 43 . 35 65 . 6 Orig Stat 12 . 72 12 . 73 9 . 55 Orig Dyn 3 . 02 3 . 04 2 . 21 2 . 21 1 . 82 3 . 1 10 1 1 . 69 1 . 48 0 . 93 0 . 93 Par Stat 0 . 75 0 . 76 0 . 76 0 . 41 0 . 41 0 . 41 0 . 43 0 . 36 0 . 34 0 . 36 0 . 4 0 . 4 0 . 4 0 . 12 0 . 12 Par Dyn 10 − 1 atax mvt 2mm 3mm bicg trisolv lu cholesky Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 18 / 25

PolyBench: vectorized instructions Orig Dyn Orig Stat Orig Dyn Orig Stat Baroudi, Loechner, Seghir Par Dyn Par Stat Par Dyn Par Stat 38 . 66 164 . 8 2mm 2mm 8 , 042 43 . 35 6 , 294 1 . 69 5 , 537 1 . 82 39 . 21 247 . 4 3mm 3mm 12 , 042 65 . 6 6 , 295 3 . 02 5 , 539 3 . 1 Static vs. Dynamic Memory Allocation 12 , 016 0 . 93 atax atax 12 , 013 0 . 93 1 , 214 0 . 4 1 , 176 0 . 41 951 . 8 1 . 48 bicg bicg 952 . 1 0 . 75 1 , 264 0 . 4 1 , 264 0 . 41 901 . 7 9 . 55 mvt 1 , 102 mvt 3 . 04 901 . 6 0 . 4 901 . 3 0 . 41 300 . 8 0 . 36 trisolv trisolv 300 . 8 0 . 34 300 . 8 0 . 12 300 . 8 0 . 12 6 , 016 12 . 72 6 , 018 12 . 73 lu lu 8 , 082 0 . 36 7 , 830 0 . 43 cholesky cholesky 6 , 006 2 . 21 6 , 009 2 . 21 6 , 007 0 . 76 6 , 007 0 . 76 IMPACT 2020 10 1 10 2 10 3 10 4 10 − 1 10 1 10 3 # vectorized inst. (millions) execution time (s) 19 / 25

Static versus Dynamic Memory Allocation: a Comparison for Linear - PowerPoint PPT Presentation

Static versus Dynamic Memory Allocation: a Comparison for Linear Algebra Kernels Toufik Baroudi 1 Vincent Loechner 2 Rachid Seghir 1 1 University of Batna, Algeria 2 University of Strasbourg & INRIA, France IMPACT 2020 Baroudi, Loechner,

Dynamic Memory Allocation Today Dynamic memory allocation mechanisms & policies

Static and dynamic verification Static and dynamic V&V Software inspections Concerned

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Dynamic Memory Allocation Lecture 27 COP 3014 Spring 2017 March 23, 2017 Allocating memory

Dynamic Memory Allocation Lecture 14 COP 3014 Fall 2019 November 20, 2019 Allocating memory

Dynamic Memory Allocation Lecture 14 COP 3014 Spring 2018 April 4, 2018 Allocating memory

CSE 351: Section 10 Memory Allocation Memory Allocation Must allocate any memory you need to

Automatic Memory Management Storage Allocation Static Allocation Bind names at compile

1 Static Equilibrium From Static Eq. to Dynamic Eq. System of mass points Static

Chapter 4 The Medium Access Control Sublayer 1 The Channel Allocation Problem Static

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

static vs automatic storage classes Three types of memory allocations static storage class

Project 1: -allocator Computation structures October 9, 2018 Memory allocation Static

Memory Management Guest lecture by Junfeng Yang Outline Dynamic memory allocation Stack

Static and Method Overloading static One per class, not per object static variables

Type Systems: Big Idea Static vs. Dynamic Typing Expressiveness (+ Dynamic) Dont have

Par$cleBomb: a model independent event generator Andy Furmanski

Sta$s$calMethodsforExperimental Par$clePhysics TomJunk

Semantic Atomicity for Multithreaded Programs Jacob Burnim, George Necula, Koushik Sen

Facebook Q1 2017 Results investor.fb.com Daily Active Users (DAUs) In Millions 1,284 Rest of

Timing-Sensitive Information-Flow Security Danfeng Zhang , Yao Wang, G. Edward Suh and Andrew C.

Multicore programming in Haskell Simon Marlow Microsoft Research A concurrent web server

IEEE 802.1Qau Congestion IEEE 802.1Qau Congestion Notification Notification Pat Thaler IEEE

A monad for deterministic parallelism Simon Marlow (MSR) Ryan Newton (Intel) Parallel

Static versus Dynamic Memory Allocation: a Comparison for Linear - PowerPoint PPT Presentation

Static versus Dynamic Memory Allocation: a Comparison for Linear Algebra Kernels Toufik Baroudi 1 Vincent Loechner 2 Rachid Seghir 1 1 University of Batna, Algeria 2 University of Strasbourg & INRIA, France IMPACT 2020 Baroudi, Loechner,

Dynamic Memory Allocation Today Dynamic memory allocation mechanisms &amp; policies

Static and dynamic verification Static and dynamic V&amp;V Software inspections Concerned

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Dynamic Memory Allocation Lecture 27 COP 3014 Spring 2017 March 23, 2017 Allocating memory

Dynamic Memory Allocation Lecture 14 COP 3014 Fall 2019 November 20, 2019 Allocating memory

Dynamic Memory Allocation Lecture 14 COP 3014 Spring 2018 April 4, 2018 Allocating memory

CSE 351: Section 10 Memory Allocation Memory Allocation Must allocate any memory you need to

Automatic Memory Management Storage Allocation Static Allocation Bind names at compile

1 Static Equilibrium From Static Eq. to Dynamic Eq. System of mass points Static

Chapter 4 The Medium Access Control Sublayer 1 The Channel Allocation Problem Static

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

static vs automatic storage classes Three types of memory allocations static storage class

Project 1: -allocator Computation structures October 9, 2018 Memory allocation Static

Memory Management Guest lecture by Junfeng Yang Outline Dynamic memory allocation Stack

Static and Method Overloading static One per class, not per object static variables

Type Systems: Big Idea Static vs. Dynamic Typing Expressiveness (+ Dynamic) Dont have

Par$cleBomb: a model independent event generator Andy Furmanski

Sta$s$calMethodsforExperimental Par$clePhysics TomJunk

Semantic Atomicity for Multithreaded Programs Jacob Burnim, George Necula, Koushik Sen

Facebook Q1 2017 Results investor.fb.com Daily Active Users (DAUs) In Millions 1,284 Rest of

Timing-Sensitive Information-Flow Security Danfeng Zhang , Yao Wang, G. Edward Suh and Andrew C.

Multicore programming in Haskell Simon Marlow Microsoft Research A concurrent web server

IEEE 802.1Qau Congestion IEEE 802.1Qau Congestion Notification Notification Pat Thaler IEEE

A monad for deterministic parallelism Simon Marlow (MSR) Ryan Newton (Intel) Parallel

Dynamic Memory Allocation Today Dynamic memory allocation mechanisms & policies

Static and dynamic verification Static and dynamic V&V Software inspections Concerned