static versus dynamic memory allocation a comparison for
play

Static versus Dynamic Memory Allocation: a Comparison for Linear - PowerPoint PPT Presentation

Static versus Dynamic Memory Allocation: a Comparison for Linear Algebra Kernels Toufik Baroudi 1 Vincent Loechner 2 Rachid Seghir 1 1 University of Batna, Algeria 2 University of Strasbourg & INRIA, France IMPACT 2020 Baroudi, Loechner,


  1. Static versus Dynamic Memory Allocation: a Comparison for Linear Algebra Kernels Toufik Baroudi 1 Vincent Loechner 2 Rachid Seghir 1 1 University of Batna, Algeria 2 University of Strasbourg & INRIA, France IMPACT 2020 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 1 / 25

  2. Introduction Two years ago [BSL17, TACO]: compact data layout for regular sparse matrices optimized by Pluto our preliminary benchmarks were inconsistent → due to matrix allocation mode: as static declared array or as array of pointers to dynamically allocated memory Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 2 / 25

  3. Introduction: Content of this Presentation We precisely analyze one code: triangular matrix multiplication using the performance counters (#instr., #mem. access, #L1-L3 cache misses, #TLB misses, #vectorized instr.) Ran the same tests on the PolyBench linear algebra kernels Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 3 / 25

  4. Introduction: Objective Array allocation mode influences performance! Main factors of performance variation: ability of the compiler to detect vectorization number of cache misses and memory loads Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 4 / 25

  5. Introduction: Objective Array allocation mode influences performance! Main factors of performance variation: ability of the compiler to detect vectorization number of cache misses and memory loads This work is not a manifest for one type of allocation or the other, it is a warning: declaration and allocation of arrays matters! Comparing various versions of codes using different array allocation modes can get biased Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 4 / 25

  6. Table of Contents Introduction 1 Triangular Matrix Multiplication: Demonstration 2 Triangular Matrix Multiplication: Performance Analysis 3 PolyBench: Performance Analysis 4 Conclusion 5 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 5 / 25

  7. Motivation Triangular Matrix Multiplication Demo in completely different conditions than in the paper: on this laptop (MacOS 10.14, clang/llvm-9.0.0, 4-cores Intel core i7) Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 6 / 25

  8. Table of Contents Introduction 1 Triangular Matrix Multiplication: Demonstration 2 Triangular Matrix Multiplication: Performance Analysis 3 PolyBench: Performance Analysis 4 Conclusion 5 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 7 / 25

  9. Triangular Matrix Multiplication: setup Intel platform: dual socket Intel Xeon E5-2650v3 (Haswell-EP) 2x10 hyperthreaded cores, AVX2 (256 bits) AMD platform: dual socket AMD Opteron 6172 (Magny-Cours) 2x12 cores, SSE (128 bits) using pluto-0.11.4 --tile --parallel using gcc-7.4.0 -O3 -march=native -fopenmp on a regular Linux 4.0.15 (Ubuntu) problem size: N=8000 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 8 / 25

  10. Triangular Matrix Multiplication: execution time 512 10 3 execution time (s) 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 17 . 3 25 20 10 . 2 9 . 7 Par Dyn 10 1 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 9 / 25

  11. Triangular Matrix Multiplication: L1-dcache-loads 10 3 execution time (s) 512 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 25 17 . 3 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # L1-dc-ld (billions) 1 , 000 834 727 Orig Stat Orig Dyn 471 500 366 369 Par Stat 353 347 320 319 323 279 204 205 Par Dyn 173 43 . 8 44 . 1 0 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 10 / 25

  12. Triangular Matrix Multiplication: L1-dcache-misses 10 3 512 execution time (s) 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 25 17 . 3 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # L1-dc-misses (billions) 19 . 88 19 . 87 19 . 7 19 . 8 Orig Stat 10 1 Orig Dyn 3 . 62 3 . 17 2 . 68 2 . 44 2 . 25 Par Stat 2 . 08 1 . 03 Par Dyn 1 . 01 0 . 89 0 . 91 0 . 88 0 . 85 10 0 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 11 / 25

  13. Triangular Matrix Multiplication: L3-dcache-misses 10 3 512 execution time (s) 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 25 17 . 3 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # L3-dc-misses (billions) 3 . 25 2 . 99 2 . 19 1 . 86 Orig Stat Orig Dyn 0 . 83 0 . 82 0 . 61 10 0 0 . 53 0 . 44 0 . 41 0 . 42 Par Stat 0 . 4 Par Dyn 0 . 12 0 . 12 0 . 12 0 . 12 10 − 1 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 12 / 25

  14. Triangular Matrix Multiplication: dTLB-misses 10 3 execution time (s) 512 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 25 17 . 3 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # dTLB-misses (billions) 10 3 345 300 250 204 208 207 Orig Stat 177 105 Orig Dyn 10 2 66 58 Par Stat 43 43 42 40 40 35 Par Dyn 10 1 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 13 / 25

  15. Triangular Matrix Multiplication: vectorized instructions 512 10 3 execution time (s) 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 17 . 3 25 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # vectorized inst. (millions) 69 , 360 69 , 310 10 6 2 , 006 . 5 Orig Stat 1 , 280 Orig Dyn 10 3 Par Stat 19 . 3 Par Dyn 4 . 6 3 1 10 0 Intel C1 AMD C1 Intel C2 AMD C2 unavailable on the AMD but “ gcc -fopt-info-vec ” seems to confirm the correlation Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 14 / 25

  16. Triangular Matrix Multiplication: Synthesis array allocation mode has a significant impact on the performance of this code it can have opposite effects on different processors! factors of influence: number of memory accesses number of cache and TLB misses number of vectorized instructions other experiments 1 on the Intel platform show that the number of vectorized instructions is a major factor of influence 1 on other triangular matrix kernels: Cholesky, SolveMat, sspfa. Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 15 / 25

  17. Table of Contents Introduction 1 Triangular Matrix Multiplication: Demonstration 2 Triangular Matrix Multiplication: Performance Analysis 3 PolyBench: Performance Analysis 4 Conclusion 5 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 16 / 25

  18. PolyBench: setup on the Intel platform using pluto-0.11.4 --tile --parallel using gcc-7.4.0 -O3 -march=native -fopenmp PolyBench macro POLYBENCH STACK ARRAYS : static version: stack allocated static array dynamic version: multidimensional heap-allocated array ( not an array of pointers as in the previous experiment) problem size: N=2,000 for O ( N 3 ) algorithms N=20,000 for O ( N 2 ) algorithms Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 17 / 25

  19. PolyBench: execution time 3.8x 247 . 4 execution time (s) +20% 164 . 8 10 3 43 . 35 65 . 6 Orig Stat 12 . 72 12 . 73 9 . 55 Orig Dyn 3 . 02 3 . 04 2 . 21 2 . 21 1 . 82 3 . 1 10 1 1 . 69 1 . 48 0 . 93 0 . 93 Par Stat 0 . 75 0 . 76 0 . 76 0 . 41 0 . 41 0 . 41 0 . 43 0 . 36 0 . 34 0 . 36 0 . 4 0 . 4 0 . 4 0 . 12 0 . 12 Par Dyn 10 − 1 atax mvt 2mm 3mm bicg trisolv lu cholesky Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 18 / 25

  20. PolyBench: vectorized instructions Orig Dyn Orig Stat Orig Dyn Orig Stat Baroudi, Loechner, Seghir Par Dyn Par Stat Par Dyn Par Stat 38 . 66 164 . 8 2mm 2mm 8 , 042 43 . 35 6 , 294 1 . 69 5 , 537 1 . 82 39 . 21 247 . 4 3mm 3mm 12 , 042 65 . 6 6 , 295 3 . 02 5 , 539 3 . 1 Static vs. Dynamic Memory Allocation 12 , 016 0 . 93 atax atax 12 , 013 0 . 93 1 , 214 0 . 4 1 , 176 0 . 41 951 . 8 1 . 48 bicg bicg 952 . 1 0 . 75 1 , 264 0 . 4 1 , 264 0 . 41 901 . 7 9 . 55 mvt 1 , 102 mvt 3 . 04 901 . 6 0 . 4 901 . 3 0 . 41 300 . 8 0 . 36 trisolv trisolv 300 . 8 0 . 34 300 . 8 0 . 12 300 . 8 0 . 12 6 , 016 12 . 72 6 , 018 12 . 73 lu lu 8 , 082 0 . 36 7 , 830 0 . 43 cholesky cholesky 6 , 006 2 . 21 6 , 009 2 . 21 6 , 007 0 . 76 6 , 007 0 . 76 IMPACT 2020 10 1 10 2 10 3 10 4 10 − 1 10 1 10 3 # vectorized inst. (millions) execution time (s) 19 / 25

Recommend


More recommend