performance analysis of parallel codes on heterogeneous systems E. - PowerPoint PPT Presentation

performance analysis of parallel codes on heterogeneous systems E. Agullo, O. Aumage, B. Bramas, A. Buttari, A. Guermouche, F. Lopez, S. Nakov, S. Thibault SOLHAR plenary meeting, Bordeaux 25-01-2026

a motivating example

Plain speedup is not enough • qr mumps +StarPU with 1D, block-column partitioning • Matrices from UF collection # Matrix Mflops 12 hirlam 1384160 18 spal 004 30335566 13 flower 8 4 2851508 19 n4c6-b6 62245957 14 Rucci1 5671282 20 sls 65607341 15 ch8-8-b3 10709211 21 TF18 194472820 16 GL7d24 16467844 22 lp nug30 221644546 17 neos2 20170318 23 mk13-b5 259751609 • One node of the ADA supercomputer (IBM x3750-M4, Intel Sandy Bridge E5-4650 @ 2.7 GHz, 4 × 8 cores) 2

Experimental results: speedups Speedup 1D -- 32 cores 30 25 20 15 10 5 1D 0 12 13 14 15 16 17 18 19 20 21 22 23 Matrix # Speedup says something, e.g., performance is poor on small matrices and good on bigger ones. Speedup doesn’t say anything on the reason. Is there a problem in the implementation, in the algorithm, in the data? what’s that crappy matrix? 3

performance analysis approach, the homogeneous case

Area performance upper bound Parallel efficiency The parallel efficiency is defined as e ( p ) = t min ( p ) ˜ t (1) = t ( p ) t ( p ) · p • ˜ t (1) is the execution time of the best sequential algorithm on one core; • t ( p ) is the execution time of the best parallel algorithm on p cores. Note that, in general, t (1) ≥ ˜ t (1) because: • parallelism requires partitioning of data and operations which reduces the efficiency of tasks; • the parallel algorithm may trade some extra flops for concurrency. 5

A finer performance analysis The execution time t ( p ) can be decomposed in the following three terms: • t t ( p ): the time spent executing tasks. • t r ( p ): the overhead of the runtime system. t r (1) := 0. • t i ( p ): idle time. t i (1) := 0. The overall efficiency can thus be written as: ˜ t t (1) e ( p ) = t t ( p ) + t r ( p ) + t i ( p ) e g e p e t e r ˜ t t (1) t t (1) t t ( p ) t t ( p ) + t r ( p ) + t c ( p ) = t t (1) · t t ( p ) · t t ( p ) + t r ( p ) · t t ( p ) + t r ( p ) + t c ( p ) + t i ( p ) . with: 6

A finer performance analysis The execution time t ( p ) can be decomposed in the following three terms: • t t ( p ): the time spent executing tasks. • t r ( p ): the overhead of the runtime system. t r (1) := 0. • t i ( p ): idle time. t i (1) := 0. The overall efficiency can thus be written as: ˜ t t (1) e ( p ) = t t ( p ) + t r ( p ) + t i ( p ) e g e p e t e r ˜ t t (1) t t (1) t t ( p ) t t ( p ) + t r ( p ) + t c ( p ) = t t (1) · t t ( p ) · t t ( p ) + t r ( p ) · t t ( p ) + t r ( p ) + t c ( p ) + t i ( p ) . with: e g : the granularity efficiency. Measures the impact exploiting of parallel algorithms compared to sequential ones. 6

A finer performance analysis The execution time t ( p ) can be decomposed in the following three terms: • t t ( p ): the time spent executing tasks. • t r ( p ): the overhead of the runtime system. t r (1) := 0. • t i ( p ): idle time. t i (1) := 0. The overall efficiency can thus be written as: ˜ t t (1) e ( p ) = t t ( p ) + t r ( p ) + t i ( p ) e g e p e t e r ˜ t t (1) t t (1) t t ( p ) t t ( p ) + t r ( p ) + t c ( p ) = t t (1) · t t ( p ) · t t ( p ) + t r ( p ) · t t ( p ) + t r ( p ) + t c ( p ) + t i ( p ) . with: e t : the task efficiency. Measures the exploitation of data locality. 6

A finer performance analysis The execution time t ( p ) can be decomposed in the following three terms: • t t ( p ): the time spent executing tasks. • t r ( p ): the overhead of the runtime system. t r (1) := 0. • t i ( p ): idle time. t i (1) := 0. The overall efficiency can thus be written as: ˜ t t (1) e ( p ) = t t ( p ) + t r ( p ) + t i ( p ) e g e p e t e r ˜ t t (1) t t (1) t t ( p ) t t ( p ) + t r ( p ) + t c ( p ) = t t (1) · t t ( p ) · t t ( p ) + t r ( p ) · t t ( p ) + t r ( p ) + t c ( p ) + t i ( p ) . with: e r : the runtime efficiency. Measures how the runtime overhead affects performance. 6

A finer performance analysis The execution time t ( p ) can be decomposed in the following three terms: • t t ( p ): the time spent executing tasks. • t r ( p ): the overhead of the runtime system. t r (1) := 0. • t i ( p ): idle time. t i (1) := 0. The overall efficiency can thus be written as: ˜ t t (1) e ( p ) = t t ( p ) + t r ( p ) + t i ( p ) e g e p e t e r ˜ t t (1) t t (1) t t ( p ) t t ( p ) + t r ( p ) + t c ( p ) = t t (1) · t t ( p ) · t t ( p ) + t r ( p ) · t t ( p ) + t r ( p ) + t c ( p ) + t i ( p ) . with: e p : the pipeline efficiency. Measures how much concurrency is available and how well it is exploited. 6

Experimental results: efficiency breakdown Granularity efficiency Task efficiency 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 e_g 1D e_t 1D 12 13 14 15 16 17 18 19 20 21 22 23 12 13 14 15 16 17 18 19 20 21 22 23 Pipeline efficiency Runtime efficiency 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 e_p 1D e_r 1D 12 13 14 15 16 17 18 19 20 21 22 23 12 13 14 15 16 17 18 19 20 21 22 23 7

2D partitioning + CA front factorization 1D partitioning is not good for (strongly) overdetermined matrices: � Most fronts are overdetermined � The problem is mitigated by concurrent front factorizations • 2D block partitioning (not necessarily square) • Communication avoiding algorithms � More concurrency � More complex dependencies � Many more tasks (higher runtime overhead) � Finer task granularity (less kernel efficiency) Thanks to the simplicity of the STF programming model it is possible to plug in 2D methods for factorizing the frontal matrices with a relatively moderate effort 8

Experimental results: speedups Speedup 2D -- 32 cores 30 25 20 15 10 1D 5 2D 0 12 13 14 15 16 17 18 19 20 21 22 23 Matrix # The scalability of the task-based multifrontal method is enhanced by the the introduction of 2D CA algorithms: • Speedups are uniform for all tested matrices. • We perform a comparative performance analysis wrt to the 1D case to show the benefits of the 2D scheme. 9

Experimental results: efficiency breakdown Granularity efficiency Task efficiency 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 e_g 1D e_t 1D e_g 2D e_t 2D 12 13 14 15 16 17 18 19 20 21 22 23 12 13 14 15 16 17 18 19 20 21 22 23 Pipeline efficiency Runtime efficiency 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 e_p 1D e_r 1D e_p 2D e_r 2D 12 13 14 15 16 17 18 19 20 21 22 23 12 13 14 15 16 17 18 19 20 21 22 23 10

case study with scalfmm

Uniform - native StarPU (with commute) Taskdep efficiency on miriel with StarPU-C (uniform) 1000000, 7 5000000, 7 10000000, 7 1.00 1.0 1.0 0.95 0.9 0.9 0.90 0.8 0.8 0.85 0.7 Test case 0.80 Efficiency Parallel 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Task 20000000, 7 50000000, 8 100000000, 8 Runtime 1.00 Pipeline 1.00 0.99 0.95 0.95 0.96 0.90 0.90 0.93 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Number of threads 12

Uniform - OpenMP-Klang-StarPU (with commute) Taskdep efficiency on miriel with Klang-C (uniform) 1000000, 7 5000000, 7 10000000, 7 0.9 0.9 0.9 0.7 0.7 0.7 0.5 0.5 0.5 Test case Efficiency Parallel 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Task 20000000, 7 50000000, 8 100000000, 8 Runtime 1.0 Pipeline 0.9 0.9 0.8 0.7 0.7 0.6 0.5 0.5 0.4 0.3 0.3 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Number of threads 13

Ellipsoid - native StarPU (with commute) Taskdep efficiency on miriel with StarPU-C (non-uniform) 1000000, 8 5000000, 8 10000000, 10 1.05 1.00 1.0 1.00 0.9 0.95 0.95 0.8 0.90 0.90 0.7 0.85 0.80 0.6 0.85 Test case Efficiency Parallel 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Task 20000000, 10 50000000, 11 100000000, 11 Runtime 1.0 1.0 Pipeline 1.00 0.9 0.95 0.9 0.8 0.90 0.7 0.85 0.8 0.6 0.80 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Number of threads 14

Ellipsoid - OpenMP-Klang-StarPU (with commute) Taskdep efficiency on miriel with Klang-C (non-uniform) 1000000, 8 5000000, 8 10000000, 10 1.00 1.0 1.00 0.75 0.75 0.8 0.50 0.50 0.6 0.25 0.25 Test case 0.4 Efficiency Parallel 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Task 20000000, 10 50000000, 11 100000000, 11 Runtime 1.0 1.0 Pipeline 0.9 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.3 0.2 0.2 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 Number of threads 15

performance analysis approach, the heterogeneous case

Area performance upper bound The parallel efficiency can be defined as e ( p ) = t min ( p ) t ( p ) where t min ( p ) is a lower bound on execution time on p resources corresponding to the best schedule under the following assumptions: PU2 PU1 PU0 17

performance analysis of parallel codes on heterogeneous systems E. - PowerPoint PPT Presentation

performance analysis of parallel codes on heterogeneous systems E. Agullo, O. Aumage, B. Bramas, A. Buttari, A. Guermouche, F. Lopez, S. Nakov, S. Thibault SOLHAR plenary meeting, Bordeaux 25-01-2026 a motivating example Plain speedup is not

Building Codes Building Codes Building Codes Building Codes 1 1 Builder Responsibilities

ECEN 5682 Theory and Practice of Error Control Codes Cyclic Codes Peter Mathys University of

Formal Modeling in Cognitive Science Source Codes Lecture 30: Codes; Kraft Inequality; Source

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

CODES FOR ALL SEASONS Emina Soljanin, Bell Labs IN THE CLOUD? CODES Emina @ Bell Labs Codes at

G ENERALIZED R EED -S OLOMON CODES (GRS CODES ) A CHARACTERIZATION OF MDS CODES THAT HAVE AN ERROR

Lattices from Codes or Codes from Lattices Amin Sakzad Dept of Electrical and Computer Systems

Error-Correcting codes: Application of convolutional codes to Video Streaming Diego Napp

Information Theory Lecture 8 BCH codes BCH codes: R8.45 (R5.6) Decoding BCH (and

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Portable Parallel I/O Handling large datasets in heterogeneous parallel environments May 21,

5b. Market Network Codes Capacity Allocation and Congestion Management Forward Capacity

CSS codes Shyam Sundhar R November 12, 2017 Shyam Sundhar R CSS codes November 12, 2017 1 /

Preparation and Adoption International Building Codes The codes published by the International

Optimized geothermal Optimized geothermal binary power cycles binary power cycles Kontoleontos

Neutron complementary studies and neutral pion reconstruction. HPgTPC Meeting Eldwan Brianne

Overview of Compilation Readings: EAC2 Chapter 1 EECS4302 M: Compilers and Interpreters Winter

SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these so different? Last time :

Principles of Software Construction: Objects, Design, and Concurrency (Part 2: Designing (Sub

Objects, Design, and Concurrency (Part 2: Designing (Sub-)Systems) Assigning Responsibilities

ts r r ts

The Mu2e Experiment at Fermilab David Hitlin Caltech INSTR2014 February 25, 2014 David Hitlin