parallel programming 02
play

Parallel programming 02 Walter Boscheri walter.boscheri@unife.it - PowerPoint PPT Presentation

Parallel programming 02 Walter Boscheri walter.boscheri@unife.it University of Ferrara - Department of Mathematics and Computer Science A.Y. 2018/2019 - Semester I Outline Classification of parallel systems 1 Performance measure 2


  1. Parallel programming 02 Walter Boscheri walter.boscheri@unife.it University of Ferrara - Department of Mathematics and Computer Science A.Y. 2018/2019 - Semester I

  2. Outline Classification of parallel systems 1 Performance measure 2 Optimization of parallel computational resources 3

  3. 1. Classification of parallel systems Classification of parallel systems A parallel system can be described by considering: number and type of the processors ( massively parallel and coarse-grained parallelism ) presence of a global control mechanism ( Flynn classification ) synchronism (a common clock among processors is present or not) connections among processors ( shared memory or distributed memory ) Walter Boscheri Parallel programming 02 2 / 12

  4. 1. Classification of parallel systems Flynn classification (1966) SISD: Single Instruction stream-Single Data stream It includes the model of Von Neumann because one stream of instructions is operating on a one stream of data. SIMD: Single Instruction stream-Multiple Data stream It involves vector processors and pipeline processors, in which all processors follow the same instructions by executing them on different data sets. MISD: Multiple Instruction stream-Single Data stream It can be seen as an extension of SISD. MIMD: Multiple Instruction stream-Multiple Data stream A system based on MIMD has independent processors, each of them has a local control unit. As a consequence, each processor can load different instructions and operate onto different data. Walter Boscheri Parallel programming 02 3 / 12

  5. 1. Classification of parallel systems Flynn classification (1966) SISD IS DS CU PU MM CU control unit PU processing unit MM memory module IS instruction stream DS data stream scalar uniprocessor systems Von Neumann architecture Walter Boscheri Parallel programming 02 3 / 12

  6. 1. Classification of parallel systems Flynn classification (1966) SIMD DS 1 PU 1 MM 1 CU control unit IS DS 2 PU processing unit CU PU 2 MM 2 MM memory module IS instruction stream DS data stream DS n PU n MM n synchronized parallelism one single control unit one single instruction operates on several data sets vector processors and parallel processing Walter Boscheri Parallel programming 02 3 / 12

  7. 1. Classification of parallel systems Flynn classification (1966) MIMD IS 1 DS 1 IS 1 CU 1 PU 1 MM 1 IS 2 DS 2 IS 2 CU control unit PU 2 MM 2 CU 2 PU processing unit MM memory module IS instruction stream DS data stream IS n DS n IS n PU n MM n CU n non-synchronized parallelism several processors execute several instructions and operate on several data sets shared or distributed memory Walter Boscheri Parallel programming 02 3 / 12

  8. 1. Classification of parallel systems Shared and distributed memory Shared memory single address space all processors have access to the pool of shared memory Walter Boscheri Parallel programming 02 4 / 12

  9. 1. Classification of parallel systems Shared and distributed memory Distributed memory each processor has its own local memory message-passing is used to exchange data among processors Walter Boscheri Parallel programming 02 4 / 12

  10. 1. Classification of parallel systems Sequential vs vector processors Sequential processors execute all instructions in a serial mode, from the first to the last one. Vector processors make use of the pipelining technique: it is based on the parallel execution of several instructions which belong to the sequential algorithm it is similar to the assembly line : it does not reduce the execution time for one single instruction, but it increases the frequency at which the instructions are executed. Walter Boscheri Parallel programming 02 5 / 12

  11. 1. Classification of parallel systems Sequential vs vector processors SEQUENTIAL PROCESSOR 0.0 ns 1.6 ns 3.2 ns Time IS1 IS 2 1.6 ns 1.6 ns VECTOR PROCESSOR (pipeline) 0.0 ns 1.6 ns 3.2 ns Time IS1 IS 2 0.4 ns Processor : 2.5 GHz (0.4 ns clock period) Instruction order Walter Boscheri Parallel programming 02 5 / 12

  12. 1. Classification of parallel systems Sequential vs vector processors Example Sequential processor DO i = 1, N A(i) = B(i) + C(i) B(i) = 2 * A(i+1) ENDDO Vector processor temp (1:N) = A(2:N+1) A(1:N) = B(1:N) + C(1:N) B(1:N) = 2 * temp (1:N) Walter Boscheri Parallel programming 02 5 / 12

  13. 2. Performance measure Speedup Speedup The speedup S ( p ) measures the reduction of the computational time t p which has been obtained by using a total number of p processors while keeping the size of the problem fixed. Absolute speedup Relative speedup The speedup is measured w.r.t. the The speedup is measured w.r.t. the best serial code with computational same serial code with p = 1: time t best : S ( p ) = t ( p =1) t ( p ) . S ( p ) = t best t ( p ) . It is also called scalability measure . It is also called performance measure . Walter Boscheri Parallel programming 02 6 / 12

  14. 2. Performance measure Ideal speedup In the ideal case, in which the work load is perfectly distributed among all processors, the relative speedup should be equal to 1. This is the case of linear speedup . Actually, linear speedup is never achieved: load balancing is not guaranteed; portions of code which can not be parallelized; synchronization and communication times. Walter Boscheri Parallel programming 02 7 / 12

  15. 2. Performance measure Superlinear speedup Very rarely, one has S ( p ) > p . This is the case of superlinear speedup . Superlinear speedup can be occasionally achieved: in a distributed memory system, if the number of processors increases, the total amount of memory increases as well. Therefore, intermediate data and results can be stored, hence avoiding the need of computing them again. In such a way, the number of floating point operations, i.e. the number of computations, can be reduced compared to an execution on less processors; the size of the problem which belongs to one processor, might be re- duced up to the point that it can be entirely stored and managed in the cache. Walter Boscheri Parallel programming 02 8 / 12

  16. 2. Performance measure Model of Flatt and Kennedy The model qualitatively describes the speedup S ( p ) as a function of p . Definitions T ser execution time of the serial portion of an algorithm → T par execution time of the parallelizable portion of an algorithm → T 0 ( p ) synchronization and communication time for p processors → It holds T (1) = T ser + T par T ser + T par T ( p ) = + T 0 ( p ) p T ser + T par S ( p ) = T ser + T par + T 0 ( p ) p Walter Boscheri Parallel programming 02 9 / 12

  17. 2. Performance measure Model of Flatt and Kennedy By considering that the communication time is a linear function of p , that is T 0 ( p ) = K p , the speedup results T ser + T par ( T ser + T par ) p S ( p ) = = T ser p + T par + Kp 2 . T ser + T par + T 0 ( p ) p It follows that p →∞ S ( p ) = 0 lim Walter Boscheri Parallel programming 02 9 / 12

  18. 2. Performance measure Model of Flatt and Kennedy Speedup function: is initially linear; exhibits a saturation point; decreases as the communication cost increases. Walter Boscheri Parallel programming 02 9 / 12

  19. 2. Performance measure Efficiency Efficiency is defined as the ratio E ( p ) = S ( p ) p if S ( p ) is linear, then E ( p ) = 1 actually, � = 1 if p = 1 E ( p ) < 1 if p > 1 N.B. - the more the efficiency is far from 1, the worse the parallel computa- tional resources are exploited. Walter Boscheri Parallel programming 02 10 / 12

  20. 3. Optimization of parallel computational resources Optimize the number of processors Speedup The optimal number of processors is the one which allows us to reach the saturation point. Efficiency The optimal number of processors is the one with E ( p ) = 1 : p = 1. At the saturation point the speedup is maximum but the efficiency is low. Walter Boscheri Parallel programming 02 11 / 12

  21. 3. Optimization of parallel computational resources Function of Kuck The function of Kuck K ( p ) is used in order to measure the efficiency of a parallelization in terms of the number of processors p : K ( p ) = S ( p ) E ( p ) p ∗ = arg max K ( p ) simultaneous good speedup and efficiency Walter Boscheri Parallel programming 02 12 / 12

Recommend


More recommend