performance analysis of parallel codes on heterogeneous
play

performance analysis of parallel codes on heterogeneous systems E. - PowerPoint PPT Presentation

performance analysis of parallel codes on heterogeneous systems E. Agullo, O. Aumage, B. Bramas, A. Buttari, A. Guermouche, F. Lopez, S. Nakov, S. Thibault SOLHAR plenary meeting, Bordeaux 25-01-2026 a motivating example Plain speedup is not


  1. performance analysis of parallel codes on heterogeneous systems E. Agullo, O. Aumage, B. Bramas, A. Buttari, A. Guermouche, F. Lopez, S. Nakov, S. Thibault SOLHAR plenary meeting, Bordeaux 25-01-2026

  2. a motivating example

  3. Plain speedup is not enough • qr mumps +StarPU with 1D, block-column partitioning • Matrices from UF collection # Matrix Mflops 12 hirlam 1384160 18 spal 004 30335566 13 flower 8 4 2851508 19 n4c6-b6 62245957 14 Rucci1 5671282 20 sls 65607341 15 ch8-8-b3 10709211 21 TF18 194472820 16 GL7d24 16467844 22 lp nug30 221644546 17 neos2 20170318 23 mk13-b5 259751609 • One node of the ADA supercomputer (IBM x3750-M4, Intel Sandy Bridge E5-4650 @ 2.7 GHz, 4 × 8 cores) 2

  4. Experimental results: speedups Speedup 1D -- 32 cores 30 25 20 15 10 5 1D 0 12 13 14 15 16 17 18 19 20 21 22 23 Matrix # Speedup says something, e.g., performance is poor on small matrices and good on bigger ones. Speedup doesn’t say anything on the reason. Is there a problem in the implementation, in the algorithm, in the data? what’s that crappy matrix? 3

  5. performance analysis approach, the homogeneous case

  6. Area performance upper bound Parallel efficiency The parallel efficiency is defined as e ( p ) = t min ( p ) ˜ t (1) = t ( p ) t ( p ) · p • ˜ t (1) is the execution time of the best sequential algorithm on one core; • t ( p ) is the execution time of the best parallel algorithm on p cores. Note that, in general, t (1) ≥ ˜ t (1) because: • parallelism requires partitioning of data and operations which reduces the efficiency of tasks; • the parallel algorithm may trade some extra flops for concurrency. 5

  7. A finer performance analysis The execution time t ( p ) can be decomposed in the following three terms: • t t ( p ): the time spent executing tasks. • t r ( p ): the overhead of the runtime system. t r (1) := 0. • t i ( p ): idle time. t i (1) := 0. The overall efficiency can thus be written as: ˜ t t (1) e ( p ) = t t ( p ) + t r ( p ) + t i ( p ) e g e p e t e r ˜ t t (1) t t (1) t t ( p ) t t ( p ) + t r ( p ) + t c ( p ) = t t (1) · t t ( p ) · t t ( p ) + t r ( p ) · t t ( p ) + t r ( p ) + t c ( p ) + t i ( p ) . with: 6

  8. A finer performance analysis The execution time t ( p ) can be decomposed in the following three terms: • t t ( p ): the time spent executing tasks. • t r ( p ): the overhead of the runtime system. t r (1) := 0. • t i ( p ): idle time. t i (1) := 0. The overall efficiency can thus be written as: ˜ t t (1) e ( p ) = t t ( p ) + t r ( p ) + t i ( p ) e g e p e t e r ˜ t t (1) t t (1) t t ( p ) t t ( p ) + t r ( p ) + t c ( p ) = t t (1) · t t ( p ) · t t ( p ) + t r ( p ) · t t ( p ) + t r ( p ) + t c ( p ) + t i ( p ) . with: e g : the granularity efficiency. Measures the impact exploiting of parallel algorithms compared to sequential ones. 6

  9. A finer performance analysis The execution time t ( p ) can be decomposed in the following three terms: • t t ( p ): the time spent executing tasks. • t r ( p ): the overhead of the runtime system. t r (1) := 0. • t i ( p ): idle time. t i (1) := 0. The overall efficiency can thus be written as: ˜ t t (1) e ( p ) = t t ( p ) + t r ( p ) + t i ( p ) e g e p e t e r ˜ t t (1) t t (1) t t ( p ) t t ( p ) + t r ( p ) + t c ( p ) = t t (1) · t t ( p ) · t t ( p ) + t r ( p ) · t t ( p ) + t r ( p ) + t c ( p ) + t i ( p ) . with: e t : the task efficiency. Measures the exploitation of data locality. 6

  10. A finer performance analysis The execution time t ( p ) can be decomposed in the following three terms: • t t ( p ): the time spent executing tasks. • t r ( p ): the overhead of the runtime system. t r (1) := 0. • t i ( p ): idle time. t i (1) := 0. The overall efficiency can thus be written as: ˜ t t (1) e ( p ) = t t ( p ) + t r ( p ) + t i ( p ) e g e p e t e r ˜ t t (1) t t (1) t t ( p ) t t ( p ) + t r ( p ) + t c ( p ) = t t (1) · t t ( p ) · t t ( p ) + t r ( p ) · t t ( p ) + t r ( p ) + t c ( p ) + t i ( p ) . with: e r : the runtime efficiency. Measures how the runtime overhead affects performance. 6

  11. A finer performance analysis The execution time t ( p ) can be decomposed in the following three terms: • t t ( p ): the time spent executing tasks. • t r ( p ): the overhead of the runtime system. t r (1) := 0. • t i ( p ): idle time. t i (1) := 0. The overall efficiency can thus be written as: ˜ t t (1) e ( p ) = t t ( p ) + t r ( p ) + t i ( p ) e g e p e t e r ˜ t t (1) t t (1) t t ( p ) t t ( p ) + t r ( p ) + t c ( p ) = t t (1) · t t ( p ) · t t ( p ) + t r ( p ) · t t ( p ) + t r ( p ) + t c ( p ) + t i ( p ) . with: e p : the pipeline efficiency. Measures how much concurrency is available and how well it is exploited. 6

  12. Experimental results: efficiency breakdown Granularity efficiency Task efficiency 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 e_g 1D e_t 1D 12 13 14 15 16 17 18 19 20 21 22 23 12 13 14 15 16 17 18 19 20 21 22 23 Pipeline efficiency Runtime efficiency 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 e_p 1D e_r 1D 12 13 14 15 16 17 18 19 20 21 22 23 12 13 14 15 16 17 18 19 20 21 22 23 7

  13. 2D partitioning + CA front factorization 1D partitioning is not good for (strongly) overdetermined matrices: � Most fronts are overdetermined � The problem is mitigated by concurrent front factorizations • 2D block partitioning (not necessarily square) • Communication avoiding algorithms � More concurrency � More complex dependencies � Many more tasks (higher runtime overhead) � Finer task granularity (less kernel efficiency) Thanks to the simplicity of the STF programming model it is possible to plug in 2D methods for factorizing the frontal matrices with a relatively moderate effort 8

  14. Experimental results: speedups Speedup 2D -- 32 cores 30 25 20 15 10 1D 5 2D 0 12 13 14 15 16 17 18 19 20 21 22 23 Matrix # The scalability of the task-based multifrontal method is enhanced by the the introduction of 2D CA algorithms: • Speedups are uniform for all tested matrices. • We perform a comparative performance analysis wrt to the 1D case to show the benefits of the 2D scheme. 9

  15. Experimental results: efficiency breakdown Granularity efficiency Task efficiency 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 e_g 1D e_t 1D e_g 2D e_t 2D 12 13 14 15 16 17 18 19 20 21 22 23 12 13 14 15 16 17 18 19 20 21 22 23 Pipeline efficiency Runtime efficiency 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 e_p 1D e_r 1D e_p 2D e_r 2D 12 13 14 15 16 17 18 19 20 21 22 23 12 13 14 15 16 17 18 19 20 21 22 23 10

  16. case study with scalfmm

  17. Uniform - native StarPU (with commute) Taskdep efficiency on miriel with StarPU-C (uniform) 1000000, 7 5000000, 7 10000000, 7 1.00 1.0 1.0 0.95 0.9 0.9 0.90 0.8 0.8 0.85 0.7 Test case 0.80 Efficiency Parallel 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Task 20000000, 7 50000000, 8 100000000, 8 Runtime 1.00 Pipeline 1.00 0.99 0.95 0.95 0.96 0.90 0.90 0.93 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Number of threads 12

  18. Uniform - OpenMP-Klang-StarPU (with commute) Taskdep efficiency on miriel with Klang-C (uniform) 1000000, 7 5000000, 7 10000000, 7 0.9 0.9 0.9 0.7 0.7 0.7 0.5 0.5 0.5 Test case Efficiency Parallel 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Task 20000000, 7 50000000, 8 100000000, 8 Runtime 1.0 Pipeline 0.9 0.9 0.8 0.7 0.7 0.6 0.5 0.5 0.4 0.3 0.3 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Number of threads 13

  19. Ellipsoid - native StarPU (with commute) Taskdep efficiency on miriel with StarPU-C (non-uniform) 1000000, 8 5000000, 8 10000000, 10 1.05 1.00 1.0 1.00 0.9 0.95 0.95 0.8 0.90 0.90 0.7 0.85 0.80 0.6 0.85 Test case Efficiency Parallel 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Task 20000000, 10 50000000, 11 100000000, 11 Runtime 1.0 1.0 Pipeline 1.00 0.9 0.95 0.9 0.8 0.90 0.7 0.85 0.8 0.6 0.80 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Number of threads 14

  20. Ellipsoid - OpenMP-Klang-StarPU (with commute) Taskdep efficiency on miriel with Klang-C (non-uniform) 1000000, 8 5000000, 8 10000000, 10 1.00 1.0 1.00 0.75 0.75 0.8 0.50 0.50 0.6 0.25 0.25 Test case 0.4 Efficiency Parallel 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Task 20000000, 10 50000000, 11 100000000, 11 Runtime 1.0 1.0 Pipeline 0.9 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.3 0.2 0.2 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Number of threads 15

  21. performance analysis approach, the heterogeneous case

  22. Area performance upper bound The parallel efficiency can be defined as e ( p ) = t min ( p ) t ( p ) where t min ( p ) is a lower bound on execution time on p resources corresponding to the best schedule under the following assumptions: PU2 PU1 PU0 17

Recommend


More recommend