Granularity - size of parallelisable (independent) parts of the program between synchronization points Fine-grain parallelism Course-grain parallelism Embarrasingly parallel ● Example: ILP - ● Coarsest granularity one Relatively large amounts of ● Instruction-Level Parallelism can imagine computational work are done between ● Relatively small amounts of ● Example: communication/synchroniza computational work are done Monte-Carlo method tion events between communication/synchronisatio Processing a huge set of High computation to ● n events images in parallel communication ratio Implies more opportunity for ● ● Low computation to performance increase communication ratio Harder to load balance ● ● Easier to load-balance efficiently ● More difficult to automate ● Easier to automate
Instruction Level Parallelism (ILP) Example: ILP - measure of how many of the instructions in e = a + b a computer program can be executed f = c + d simultaneously g = e * f - How many units of time this can be done? Parallel instructions are a set of instructions that do not depend on each other when executed ● Bit-level Parallelism ILP - challenge for compilers and processor designers ○ 16-bit add on 8-bit CPU ● Instruction level Parallelism ILP allows the compiler to use ● Loop level Parallelism ● pipelining, ● superscalar execution for i in range(10000): ● out-of-order execution x(i)=x(i)+y(i) ● register renaming ● speculative execution ● Thread level Parallelism ● branch prediction ○ Multi-core CPUs
Micro-architectural techniques of ILP ❖ Instruction pipelining (similar to car production lines) - ➢ performing different sub-operations with moving objects Basic five-stage pipeline in a RISC machine ● IF = Instruction Fetch, ● ID = Instruction Decode, ● EX = Execute, In the fourth clock cycle (the green column), the ● MEM = Memory access, earliest instruction is in MEM stage, and the latest ● WB = Register write back). instruction has not yet entered the pipeline.
Micro-architectural techniques of ILP ➢ Superscalar CPU architecture ○ Implements ILP inside a single CPU Example: ○ More than one instruction per clock cycle Fetching-dispatching 2 instructions a time: ○ Dispatches multiple instructions to multiple redundant functional units inside the processor ■ Each separate functional unit not a separate CPU core but an execution resource inside CPU, like: ● arithmetic logic unit, ● floating point unit, ● a bit shifter, ● multiplier
Micro-architectural techniques of ILP ➢ Out-of-Order execution Processing of instructions broken into these steps: ○ Technique used in most high-performance CPUs Instruction fetch ➔ Instruction dispatch to an instruction queue (called ➔ ○ The key technique - allow processor also instruction buffer ) to avoid certain class of delays occuring due to operation data The instruction waits in the queue until its input ➔ unavailability operands are available Instruction is issued to the appropriate functional ➔ unit and executed there The results are queued ( re-order buffer ) ➔ The reults are written back to register ➔
Micro-architectural techniques of ILP ➢ Register renaming ○ technique used to avoid unnecessary serialization of program operations imposed by the reuse of registers by those operations, used to enable out-of-order execution. ○ technique that abstracts logical registers from physical registers r1 = m[1024] r1 = m[1024] r1 = r1 + 2 r1 = r1 + 2 m[1032] = r1 m[1032] = r1 r1 = m[2048] r2 = m[2048] r1 = r1 + 4 r2 = r2 + 4 m[2056] = r1 m[2056] = r2
Micro-architectural techniques of ILP ➢ Speculative execution ○ allows the execution of complete instructions or parts of instructions before being certain whether this execution should take place ■ control flow speculation ■ value prediction ● Eager execution ■ memory dependence prediction ○ → execute both (all) possible scenarios ■ cache latency prediction ● Predictive execution ○ → predict the most likely scenario!
Micro-architectural techniques of ILP ➢ Branch prediction ○ used to avoid stalling for control dependencies to be resolved ○ used with speculative execution ● Static branch prediction ● Dynamic branch prediction ● Random branch prediction ● One-level branch prediction ● Two-level branch preiction ○ Pattern History Table
Memory performance Example: Memory latency ● Simplified architecture: 1 GHz processor Memory system, not processor speed often appears often as a bottleneck ● memory fetching time 100 ns (no cache), blocksize 1 word ● 2 main parameters: ● 2 multiply-add units, 4 multiplication or addition operations per clock cycle => latency - time that it takes from request to delivery from the memory ● peak performance – 4 Gflops. bandwidth - amount of data flow in ● Suppose, we need to perform dot product. Each timestep (between the processor and memory access takes 100 clock cycles memory) ● ⇒ 1 flop per 100 ns ● ⇒ actual performance? - Only 10 Mflops !
Example: BLAS (Basic Linear Algebra Subroutines) Motivation Consider an arbitrary algorithm. Denote ● f - flops registers cache ● m - # memory references Introduce D q=f/m a Data movement before operations t a RAM m o Why this number is important? v e m ● t f - time spent on 1 flop e n Storage t ● t m - time for memory access a f t e r o p e r a t i o n s In general, t m ⋙ t f therefore total Network storage, clouds time reflecting processor speed only if q is large
Example: Gauss elimination method for each i – key operations (1) and (2): for i in range(n): A(i+1:n,i) = A(i+1:n,i)/A(i,i) # op. (1) A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i)*A(i,i+1:n) # op. (2) Operation (2) of type: A = A − vw T , A ∈ ℝ n×n , v , w ∈ ℝ n Operation (1) of type: y=ax+y # saxpy - Rank-one update ● m = 3n + 1 memory references: m = 2n 2 + 2n memory references: ● ○ 2n + 1 reads n 2 + 2n reads ○ ○ n writes n 2 writes ○ ● Computations take f=2n flops Computations – f = 2 n 2 flops ● ⇒ q = 1 + O( 1 /n) ≈ 1 for large n ● ⇒ q = 2/3 + O(1/n) ≈ 2/3 for large n ● 2nd order operation O(n 2 ) flops 1st order operation O(n) flops But q = 1 + O( 1 ) in both cases!
Example of 3rd order operation Faster results in case of 3rd order operations ( O(n 3 ) operations with O(n 2 ) memory references). For example, matrix multiplication: C = AB +C, where A, B, C ∈ ℝ n×n Here m = 4n 2 and f = n 2 ( 2 n − 1 ) + n 2 = 2n 3 ⇒ q = n/ 2 → ∞ if n → ∞. This operation can give processor work near peak performance, with good algorithm scheduling!
BLAS implementations ● BLAS – standard library for simple 1st, 2nd and 3rd order operations ○ BLAS – freeware, available for example from netlib (http://www.netlib.org/blas/) ○ Processor vendors often supply their own implementation ○ BLAS ATLAS implementation ATLAS (http://math-atlas.sourceforge.net/) – self-optimising code ● – OpenBLAS - supporting x86, x86-64, MIPS and ARM processors. Example of using BLAS (fortran90): ● LU factorisation using BLAS3 operations http://www.ut.ee/~eero/SC/konspekt/Naited/lu1blas3.f90.html ● main program for testing different BLAS levels http://www.ut.ee/~eero/SC/konspekt/Naited/testblas3.f90.html )
Memory latency hiding with the help of cache ● Cache - small but fast memory between main memory and DRAM ● works as low latency, high throughput Example: Cache effect storage ➢ Cache size: 32 KB with latency 1 ns ⇒ total time (data access + calculations) = 200μs + 16μs ➢ operation C = AB , where A and B are 32 × 32 ➢ matrices (i.e. all matrices A , B and C fit close to peak performance 64K/216 or 303 Mflops. ➢ simultaneously into cache Access to same data corresponds to temporal locality ➢ We observe that: ➢ In given example O(n ² ) data accesses and O(n ³ ) computations. ○ reading 2 matrices into cache (meaning 2K Such asymptotic difference very good in case of cache words) takes ca 200μs ○ multiplying two n × n matrices takes 2n ³ operations, ⇒ 64K operations, can be Data reusage is of critical importance for performance! performed in 16K cycles (4 op/cycle) ● Cache - helps to decrease real memory latency only if enough reuse of data is present ● part of data served by cache without main memory access – cache hit ratio ● often: hit ratio a major factor to performance
Some trends in HPC concerning cache hierarchies In-class reading: Anant Vithal Nori, Jayesh Gaur, Siddharth Rai, Sreenivas Subramoney and Hong Wang, Criticality Aware Tiered Cache Hierarchy: A Fundamental Relook at Multi-level Cache Hierarchies , 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), DOI: 10.1109/ISCA.2018.00019 https://ieeexplore.ieee.org/abstract/document/8416821 Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure and Stefano Markidis, Exploring the Performance Benefit of Hybrid Memory System on HPC Environments , 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), DOI: 10.1109/IPDPSW.2017.115 https://ieeexplore.ieee.org/abstract/document/7965110
Parallel Architectures Single data Multiple data stream stream Single instr. Flynn’s taxonomy of parallel computers SISD SIMD stream SISD - Single Instruction Single Data Multiple instr. ● A serial (non-parallel) computer MISD MIMD stream ● Single Instruction: Only one instruction stream is being acted on by the CPU during any one clock cycle ● Single Data: Only one data stream is being used as input during any one clock cycle ● Deterministic execution ● Examples: older generation mainframes, minicomputers and workstations, early laptops etc. (https://computing.llnl.gov/tutorials/parallel_comp/#Flynn)
Single data Multiple data Flynn’s taxonomy of parallel computers stream stream Single instr. SIMD - Single Instruction Multiple Data SISD SIMD stream ● A type of parallel computer ● Single Instruction: All processing units execute the same Multiple instr. instruction at any given clock cycle MISD MIMD ● Multiple Data: Each processing unit can operate on a different data stream element ● Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing ● Synchronous (lockstep) and deterministic execution ● Two varieties: Processor Arrays and Vector Pipelines ● Processor Arrays : Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV61 // Architectures ● Vector Pipelines : IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10 ● GPUs (https://computing.llnl.gov/tutorials/parallel_comp/#Flynn)
Single data Multiple data Flynn’s taxonomy of parallel computers stream stream Single instr. (MISD) - Multiple Instruction Single Data SISD SIMD stream ● A type of parallel computer ● Multiple Instruction: Each processing unit operates Multiple instr. on the data independently via separate instruction MISD MIMD streams stream ● Single Data: A single data stream is fed into multiple processing units. ● Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971). ● Some conceivable uses might be: ○ multiple frequency filters operating on a single signal stream ○ multiple cryptography algorithms attempting to crack a single coded message. (Carnegie-MellonC.mmp https://en.wikipedia.org/wiki/C.mmp)
Single data Multiple data Flynn’s taxonomy of parallel computers stream stream Single instr. MIMD - Multiple Instruction Multiple SISD SIMD stream Data ● Parallel computer Multiple instr. ● Multiple Instruction: Every processor may be MISD MIMD stream executing a different instruction stream ● Multiple Data: Every processor may be working with a different data stream ● Execution can be synchronous or asynchronous, deterministic or nondeterministic ● Currently, the most common type of parallel computer ● Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs ● Note: many MIMD architectures also include SIMD execution sub-components (https://computing.llnl.gov/tutorials/parallel_comp/#Flynn)
Flynn-Johnson classification Shared variables Message passing Global memory Single data Multiple data GMSV stream stream GMMP Shared-memory Rarely used multiprocessors Single instr. SISD SIMD stream DMMP DMSV Distributed Distributed-memory Distributed shared memory multicomputers memory Multiple instr. MISD MIMD stream
Flynn-Johnson classification Shared Message Shared-memory multiprocessors variables passing Single data Multiple GMSV stream data stream GMMP memory Global Shared-memory Rarely used multiprocessors Single instr. ● Common shared memory SISD SIMD stream ● Global address space DMMP Distributed DMSV Distributed memory ● Multiple processors operate independently Distributed -memory shared memory multicomputers sharing the same memory resource Multiple MISD MIMD stream instr. ● Changes in memory state affect all processors ● Historically classified as UMA and NUMA (according to memory access times) (multicore) (multicore) (multicore) (multicore) (multicore) (multicore) CPU CPU CPU CPU CPU CPU Shared memory
Flynn-Johnson classification Shared Message Shared-memory multiprocessors variables passing Single data Multiple GMSV stream data stream GMMP memory Global Shared-memory Rarely used multiprocessors Single instr. Uniform Memory Access (UMA): SISD SIMD stream Distributed DMMP DMSV ● Most commonly represented today by memory Distrib-memory Distributed multiicomputers shared memory Symmetric Multiprocessor (SMP) Multiple MISD MIMD stream instr. machines ● Identical processors ● Equal access and access times to memory ● Sometimes called CC-UMA - Cache Coherent UMA CPU ● Cache coherence: ○ if one processor updates a location in shared memory, all Shared the other processors know CPU CPU memory about the update. ○ implemented at the hardware level. CPU
Flynn-Johnson classification Shared Message Distributed shared-memory variables passing Single data Multiple GMSV stream data stream GMMP memory Global Shared-memory Non-Uniform Memory Access (NUMA): Rarely used multiprocessors Single instr. SISD SIMD stream ● Often made by physically linking two or more Distributed DMMP DMSV memory Distrib-memory SMPs Distributed multiicomputers shared memory Multiple ● One SMP can directly access memory of MISD MIMD stream instr. another SMP ● Not all processors have equal access time to all memories ● Memory access across link is slower CPU CPU CPU CPU Memory Memory ● If cache coherency is maintained, then may CPU CPU CPU CPU also be called CC-NUMA - Cache Coherent Bus Interconnect NUMA CPU CPU CPU CPU Memory Memory CPU CPU CPU CPU Advantages : Disadvantages: ● lack of scalability between memory and CPUs. ● Global address space provides a user-friendly ● for cache coherent systems, geometrically increase traffic programming perspective to memory associated with cache/memory management. ● Data sharing between tasks is both fast and uniform due ● Programmer responsibility for synchronization constructs that to the proximity of memory to CPUs ensure "correct" access of global memory.
Distributed memory architectures Flynn-Johnson classification Shared Message variables passing Single data Multiple GMSV stream data stream GMMP memory Global Shared-memory Networked processors with their private Rarely used multiprocessors Single instr. memories SISD SIMD stream Distributed ● No global memory address space DMMP DMSV memory Distrib-memory Distributed multiicomputers shared memory ● All processors independent Multiple MISD MIMD stream instr. ● Communication ○ via message-passing ○ explicit communication operations Network Disadvantages Advantages CPU CPU ● Communication has to be ● Memory is scalable orchestrated by the Memory with the number of Memory programmer processors CPU ● Data stuctures’ mapping to ● No overhead the system can be accessing local Memory complicated memory ● Access times vary a lot due ● Easy to use to non-uniform memory off-the-shelf components
Flynn-Johnson classification Shared Message Hybrid Distributed-Shared Memory variables passing Single data Multiple GMSV stream data stream GMMP memory Global Shared-memory ● Each node a shared memory machine: Rarely used multiprocessors Single instr. ○ Memory components shared between SISD SIMD stream Distributed DMMP CPUs and GPUs DMSV memory Distrib-memory Distributed multiicomputers shared memory ● Nodes connected in a distributed memory Multiple MISD MIMD stream instr. machine ○ Network communication required to move data between the nodes CPU CPU CPU CPU GPU CPU GPU CPU Memory Memory Memory Memory CPU CPU CPU CPU GPU CPU GPU CPU network network CPU CPU CPU CPU GPU CPU GPU CPU Memory Memory Memory Memory CPU CPU CPU CPU GPU CPU GPU CPU ● Used in most high end computing systems nowadays ● All signs show that this will be the prevailing architecture type for the forseeable future
In-class exercise 5 1. Look around in Top500 list and choose for a reviewal your favorite computer system there for a short review about the highlights of the system! In particular, address also the architectural aspects we have discussed up to now during the course! 2. Post the short review to the course Piazza: piazza.com/ut.ee/spring2019/mtat08020
Analytical modeling of Parallel Systems • Sources of Overhead in Parallel Programs • Minimum Execution Time and Minimum Cost-Optimal Execution • Performance Metrics for Parallel Time Systems • Amdahls’s law • Effect of Granularity on • Gustavson-Barsis law Performance • Asymptotic Analysis of Parallel Programs • Scalability of Parallel Systems • Other Scalability Metrics Based on: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Introduction to Parallel Computing, Second Edition. Addison-Wesley, 2003
Analytical Modeling - Basics • A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime T ( n ) as a function of input size n ) – T ( n ) = O ( f ( n )) if there exist positive constants c and n 0 such that T ( n ) ≤ cf ( n ), for all n ≥ n 0 – T ( n ) = Ω( f ( n )) if there exist positive constants c and n 0 such that T ( n ) ≥ cf ( n ), for all n ≥ n 0 – T ( n ) = Θ( f ( n )) if T ( n ) = O ( f ( n )) and T ( n ) = Ω( f ( n )) • The asymptotic runtime is independent of the platform. Analysis “at a constant factor”. • A parallel algorithm is evaluated by its runtime in function of – the input size – the number of processors – communication parameters of the machine • An algorithm must therefore be analyzed in the context of the underlying platform
Analytical Modeling - Basics ● A parallel system is a combination of a parallel algorithm and an underlying platform ● Wall clock time - the time from the start of the first processor to the stopping time of the last processor in a parallel ensemble. ○ But how does this scale when the number of processors is changed of the program is ported to another machine altogether? ● How much faster is the parallel version? ○ This begs the obvious followup question - whats the baseline serial version with which we compare? Can we use a suboptimal serial program to make our parallel program look more attractive? ● Raw FLOP count - What good are FLOP counts when they dont solve a problem?
Sources of Overhead in Parallel Programs • If I use two processors, shouldn’t my program ● Interprocess interactions: Processors run twice as fast? working on any non-trivial parallel problem will • No - a number of overheads, including wasted need to talk to each other. computation, communication, idling, and contention cause degradation in performance ● Dependencies: Computation depends on results from other processes ● Idling: Processes may idle because of load imbalance, synchronization, or serial components. ● Load imbalance: Due to algorithm or system ● Excess Computation: This is computation not performed by the serial version. This might be because the serial algorithm is difficult to parallelize, or that some computations are repeated across processors to minimize communication.
Performance Metrics for Parallel Systems: Execution Time, Speedup Execution time = time elapsed between • Observe that T all - T S is then the total time spent by all processors combined in non-useful work. beginning and end of execution on ➢ This is called the total overhead. sequential a computer • The total time collectively spent by all the beginning on first computer and end on ➢ processing elements last computer on a parallel computer T all = p T P ( p is the number of processors). We denote the serial runtime by T S and the • The overhead function ( T o ) is therefore given by T o = p T P - T S parallel runtime by T P . Let T all be the total time collectively spent by all the processing elements. • Speedup ( S ) is the ratio of the time taken to solve a problem on a single processor to the time required to solve the same problem on a parallel computer with p identical processing elements.
Performance Metrics: Example • Consider the problem of adding n numbers by using n processing elements. • If n is a power of two, we can perform this operation in log n steps by propagating partial sums up a logical binary tree of processors. Computing the globalsum of 16 partial sums using 16 processing elements. Σ j i denotes the sum of numbers with consecutive labels from i to j .
Performance Metrics: Example (continued) • If an addition takes constant time, say, t c and communication of a single word takes time t s + t w , we have the parallel time T P = Θ (log n ) • We know that T S = Θ ( n ) • Speedup S is given by S = Θ ( n / log n )
Performance Metrics: Speedup • For a given problem, there might be many serial algorithms available. These algorithms may have different asymptotic runtimes and may be parallelizable to different degrees. • For the purpose of computing speedup, we always consider the best sequential program as the baseline.
Performance Metrics: Speedup Example • Consider the problem of parallel bubble sort. • Suppose serial time for bubblesort is 52 seconds. • The parallel time for odd-even sort (efficient parallelization of bubble sort) is 14 seconds. • The speedup would appear to be 52/14 = 3.71 • But is this really a fair assessment of the system? • What if serial quicksort only took 12 seconds? ○ In this case, the speedup is 12/14 = 0.86. This is a more realistic assessment of the system.
Performance Metrics: Speedup Bounds • Speedup can be as low as ○ 0 (the parallel program never terminates). • Speedup, in theory, should be upper bounded by p - after all, we can only expect a p -fold speedup if we use times as many resources. • A speedup greater than p is possible only if each processing element spends less than time T S / p solving the problem. • In this case, a single processor could be timeslided to achieve a faster serial program, which contradicts our assumption of fastest serial program as basis for speedup.
Performance Metrics: Superlinear Speedups One reason for superlinearity is that the parallel version does less work than corresponding serial algorithm. Searching an unstructured tree for a node with a given label, `S', on two processing elements using depth-first traversal. The two-processor version with processor 0 searching the left subtree and processor 1 searching the right subtree expands only the shaded nodes before the solution is found. The corresponding serial formulation expands the entire tree. It is clear that the serial algorithm does more work than the parallel algorithm.
Performance Metrics: Superlinear Speedups Resource-based superlinearity: The higher aggregate cache/memory bandwidth can result in better cache-hit ratios, and therefore superlinearity. Example: A processor with 64KB of cache yields an 80% hit ratio. If two processors are used, since the problem size/processor is smaller, the hit ratio goes up to 90%. Of the remaining 10% access, 8% come from local memory and 2% from remote memory. If DRAM access time is 100 ns, cache access time is 2 ns, and remote memory access time is 400ns, this corresponds to a speedup of 2.43!
Performance Metrics: Efficiency • Efficiency is a measure of the fraction of time for which a processing element is usefully employed • Mathematically, it is given by • Following the bounds on speedup, efficiency can be as low as 0 and as high as 1.
Performance Metrics: Efficiency Example • The speedup of adding numbers on processors is given by • Efficiency is given by
Parallel Time, Speedup, and Efficiency Example Consider the problem of edge-detection in images. The problem requires us to apply a 3 x 3 template to each pixel. If each multiply-add operation takes time t c , the serial time for an n x n image is given by T S = t c n 2 Example of edge detection: (a) an 8 x 8 image; (b) typical templates for detecting edges; and (c) partitioning of the image across four processors with shaded regions indicating image data that must be communicated from neighboring processors to processor 1.
Parallel Time, Speedup, and Efficiency Example (continued) • One possible parallelization partitions the image equally into vertical segments, each with n 2 / p pixels. • The boundary of each segment is 2n pixels. This is also the number of pixel values that will have to be communicated. This takes time 2 ( t s + t w n ). Templates may now be applied to all n 2 / p pixels in time 9 t c n 2 / p . •
Parallel Time, Speedup, and Efficiency Example (continued) • The total time for the algorithm is therefore given by: • The corresponding values of speedup and efficiency are given by: and
Cost of a Parallel System • Cost is the product of parallel runtime and the number of processing elements used ( p x T P ). • Cost reflects the sum of the time that each processing element spends solving the problem. • A parallel system is said to be cost-optimal if the cost of solving a problem on a parallel computer is asymptotically identical to serial cost. • Since E = T S / p T P , for cost optimal systems, E = O (1). • Cost is sometimes referred to as work or processor-time product
Cost of a Parallel System: Example Consider the problem of adding numbers on processors. • We have, T P = log n (for p = n ). • The cost of this system is given by p T P = n log n . • Since the serial runtime of this operation is Θ( n ), the algorithm is not cost optimal.
Impact of Non-Cost Optimality Consider a sorting algorithm that uses n processing elements to sort the list in time (log n ) 2 • Since the serial runtime of a (comparison-based) sort is n log n , the speedup and efficiency of this algorithm are given by n / log n and 1 / log n , respectively The p T P product of this algorithm is n (log n ) 2 • • This algorithm is not cost optimal but only by a factor of log n If p < n , assigning n tasks to p processors gives T P = n (log n ) 2 / p • • The corresponding speedup of this formulation is p / log n • This speedup goes down as the problem size n is increased for a given p !
Effect of Granularity on Performance • Often, using fewer processors improves performance of parallel systems • Using fewer than the maximum possible number of processing elements to execute a parallel algorithm is called scaling down a parallel system • A naive way of scaling down is to think of each processor in the original case as a virtual processor and to assign virtual processors equally to scaled down processors • Since the number of processing elements decreases by a factor of n / p , the computation at each processing element increases by a factor of n / p • The communication cost should not increase by this factor since some of the virtual processors assigned to a physical processors might talk to each other. This is the basic reason for the improvement from building granularity
Building Granularity: Example • Consider the problem of adding n numbers on p processing elements such that p < n and both n and p are powers of 2 • Use the parallel algorithm for n processors, except, in this case, we think of them as virtual processors • Each of the p processors is now assigned n / p virtual processors; ○ virtual processing element i is simulated by the physical processing element labeled i mod p • The first log p of the log n steps of the original algorithm are simulated in ( n / p ) log p steps on p processing elements. • Subsequent log n - log p steps do not require any communication.
Building Granularity: Example (continued) • The overall parallel execution time of this parallel system is Θ ( ( n / p ) log p ). • The cost is Θ ( n log p ), which is asymptotically higher than the Θ ( n ) cost of adding n numbers sequentially. Therefore, the parallel system is not cost-optimal.
Building Granularity: Example (continued) Can we build granularity in the example in a cost-optimal fashion? • Each processing element locally adds its n / p numbers in time Θ ( n / p ). • The p partial sums on p processing elements can be added in time Θ( n / p ) A cost-optimal way of computing the sum of 16 numbers using four processing elements.
Building Granularity: Example (continued) • The parallel runtime of this algorithm is T P = Θ ( n / p + log p ), • The cost is Θ ( n + p log p ) • This is cost-optimal, so long as n = Ω( p log p ) !
Scaling Characteristics of Parallel Programs • The efficiency of a parallel program can be written as: or • The total overhead function T o is an increasing function of p • For a given problem size (i.e., the value of T S remains constant), as we increase the number of processing elements, T o increases. • The overall efficiency of the parallel program goes down. This is the case for all parallel programs.
Scaling Characteristics of Parallel Programs: Example • Consider the problem of adding numbers on processing elements. • We have seen that:
Scaling Characteristics of Parallel Programs: Example (continued) Plotting the speedup for various input sizes gives us: Speedup versus the number of processing elements for adding a list of numbers. Speedup tends to saturate and efficiency drops as a consequence of Amdahl's law
In-class exercise 6: ● Search for best parallel programming languages ● Choose one of them (your favorite!) and post a brief review to Course Piazza! One possible starting point can be: https://www.slant.co/topics/6024/~programming-languages-for-concurrent-programming
Amdahl’s law Example 1 . Assume 5% of the algorithm is not parallelisable (ie. σ = 0.05 ) => : In each algorithm there exist P max S ( N , P ) parts that cannot be parallelised 2 1.9 4 3.5 Let Let σ where ( 0 < σ ≤ 1 ) ● – sequential part 10 6.9 20 10.3 Assume that the rest 1 − σ ● 100 16.8 parallelised optimally ∞ 20 Then, in best case: ● Example 2. σ = 0.67 (33% parallelisable), P = 10:
Gustafson-Barsis’ law ● Mathematically – yes. But in practice – not very good idea to solve a problem John Gustafson & Ed Barsis (Scania with fixed size N on whatever number of Laboratory) 1988: processors! • 1024-processor nCube/10 claimed: ● In general, σ = σ ( N ) ≠ const they bet Amdahl’s law! ● Usually, σ decreases with N growing! ● Algorithm is said to be effectively parallel • Their σ ≈ 0.004...0.008 if σ → 0 with N → ∞ • but got S ≈ 1000 • (Acording to Amdahl’s S might be 125...250) Scaled efficiency (to avoid misunderstandings:) How was it possible? Does Amdahl’s law hold?
Scaled efficiency Problem size increasing accordingly with adding new processors – does time remain the same? • 0 < E S ( N , P ) ≤ 1 • If E S ( N , P ) = 1 – linear speedup
Methods to increase efficiency Factors influencing efficiency: Overlapping communication and computations communication time ● Example : Parallel Ax-operation for sparse matrices in Domain Decomposition setups waiting time ● ● Matrix partitioned, divided between processors. additional computations ● Starting communication (non-blocking); calculations changing/improving algorithm ● at inside parts of the region => economy in waiting times Profiling parallel programs Extra computations instead of communication ● MPE - jumpshot, LMPI, MpiP ● Computations in place instead of importing the ● Valgrind, Totalview, Vampir, Allinea OPT results over the network ● Linux - gprof (compiler switch -pg) ● Sometimes it pays off! ● SUN - prof, gprof, prism Example : Random number generation. Broadcasting only seed and generate in parallel (deterministic algorithm ● Many other commercial applications
Computer Benchmarks Some HPC (High Performance Computing) benchmarks: ● HPL (High Performance Linpack) ● NPB (NAS Parallel Benchmarks) ● HINT (Hierarchical INTegration) ● Perf ● IOzone ● Graph 500
Linpack Jack Dongarra. HPL - High Performance Linpack, using MPI and BLAS. Solving systems of linear equations with dense matrices. The aim is to fit a problem withmaximal size (advisably, utilising 80% of memory). http://www.netlib.org/benchmark/hpl/ Used for http://www.top500.org ● R max - maximal achieved performance in ● R peak - peak performance in Gflops Gflops ● N - size of matrix giving peak performance in ● NB - blocksize. In general, the smaller Gflops (usually <80%memory size) the better, but usually in range 32...256.
Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) [1] https://www.nas.nasa.gov/publications/npb.html (1992 v1, 1996 v2.1, 2002 v2.2) The original eight benchmarks specified in NPB 1 mimic the computation and data movement in CFD applications: ● five kernels ○ IS - Integer Sort, random memory access ○ EP - Embarrassingly Parallel ○ CG - Conjugate Gradient, irregular memory access and communication ○ MG - Multi-Grid on a sequence of meshes, long- and short-distance communication, memory intensive ○ FT - discrete 3D fast Fourier Transform, all-to-all communication ● three pseudo applications ○ BT - Block Tri-diagonal solver ○ SP - Scalar Penta-diagonal solver ○ LU - Lower-Upper Gauss-Seidel solver
Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) [2] Multi-zone versions of NPB (NPB-MZ) are designed to exploit multiple levels of parallelism in applications and to test the effectiveness of multi-level and hybrid parallelization paradigms and tools. ● Three types of benchmark problems derived from single-zone pseudo applications of NPB: ○ BT-MZ - uneven-size zones within a problem class, increased number of zones as problem class grows ○ SP-MZ - even-size zones within a problem class, increased number of zones as problem class grows ○ LU-MZ - even-size zones within a problem class, a fixed number of zones for all problem classes
Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) [3] ● NPB 3: Benchmarks for unstructured computation, parallel I/O, and data movement ○ UA - Unstructured Adaptive mesh, dynamic and irregular memory access ○ BT-IO - test of different parallel I/O techniques ○ DC - Data Cube ○ DT - Data Traffic ● GridNPB is designed specifically to rate the performance of computational grids. Each of the four benchmarks in the set consists of a collection of communicating tasks derived from the NPB. They symbolize distributed applications typically run on grids. ○ ED - Embarrassingly Distributed ○ HC - Helical Chain ○ VP - Visualization Pipeline ○ MB - Mixed Bag
HINT benchmark The HINT (Hierarchical INTegration). Graphical view of: ● floating point performance ● integer operation performance ● performances with different memory hierarchies https://web.archive.org/web/20130724124556/http://hint.byu.edu/ Some other benchmarks ● Perf ● IOzone ● etc.
In-class exercise 7 A) Find out at OpenBenchmarking.org a) What are currently the most popular benchmarks related to parallel computing? b) Find a benchmark that you like the most and describe it! c) Any other interesting aspect you find fascinating? B) Find out about top500 vs graph500 a) What is their difference? b) What inspired the creation of graph500? c) How different are these lists? d) Some other interesting aspect you notice when comparing the two benchmarks?
Applications with the need for parallel computing ● Embarrasingly parallel applications ○ Data mininig ○ Molecular dynamics ○ Cryptographic algorithms ○ etc. etc. ● Applications depending on good ways for communication and synchronisation ○ Integral equations ■ Numerical solution usually implies dense matrices ■ Parallelisation of LU-factorisation ○ Numerical solution of Partial Differential Equations (PDEs) ■ Sparse matrice structures ■ → Methods for Iterative solution of systems with sparse matrices
EXAMPLE Finite element method for solving Poisson equation Γ Example: 2D Finite Element Method for Poisson Equation Poisson Equation Ω y where Laplacian Δ is defined by x u can be for example ● temperature ● displacement of an elastic membrane fixed at the boundary under a transversal load of ● Electro-magnetic potential intensity
Finite element method for solving Poisson equation Γ Divergence theorem Ω n y where F=( F 1 , F 2 ) is a vector-valued function defined on Ω , x Divergence of a vector function F =( F 1 , F 2 ) is defined by n = ( n 1 , n 2 ) – outward unit normal to Γ . Here dx – element of area in ℝ 2 ; ds – arc length along Γ .
Finite element method for solving Poisson equation Applying divergence theorem to: F =( vw ,0) and F =(0, vw ) (details not shown here) – Green's first identity The gradient ∇ of a scalar function f ( x,y ) is defined by: we come to a variational formulation of the Poisson problem on V : Poisson problem in Variational Formulation: Find u ∈ V such that where in case of Poisson equation ( a ( u , v ) - being called also linear form )
Finite element method for solving Poisson equation Again, the Variational formulation of Poisson equation the equation is: With discretisation, we replace the space V with V h – space of piecewise linear functions. Each function in V h can be written as where φ i (x, y) – basis functions (or 'hat' functions)
Finite element method for solving Poisson equation With 2D FEM we demand that the equation in the Variational formulation is satisfied for M basis functions φ i ∈ V h i.e. But we have M linear equations with respect to unknowns ξ j :
Finite element method for solving Poisson equation The stiffness matrix A = ( a ij ) elements and right-hand side b = ( b i ) calculation: Integrals computed only where the pairs ∇ φ i · ∇ φ j get in touch (have mutual support) ( Support of a function f = f ( x ) is defined as the region of values x for which f ( x ) ≠ 0 )
Finite element method for solving Poisson equation Example Two basis functions φ i and φ j for nodes N i and N j . Their common support is τ ∪ τ ’ so that
Finite element method for solving Poisson equation Element matrices Consider single element τ Pick two basis functions φ i and φ j (out of three). φ k – piecewise linear ⇒ denoting by p k ( x , y ) = φ k | τ : their dot product
Finite element method for solving Poisson equation Finding coefficients α and β – put three points ( x i , y i , 1), ( x j , y j , 0) , ( x k , y k , 0) to the plane equation and solve the system The integral is computed by the multiplication with the triangle area 0.5 ×|det D | The element matrix for τ is
Finite element method for solving Poisson equation Assembled stiffness matrix ● created by adding appropriately all the element matrices together ● Different types of boundary values used in the problem setup result in slightly different stiffness matrices ● Most typical boundary conditions: ○ Dirichlet’ ○ Neumann ● but also: ○ free boundary condition ○ Robin boundary condition ○ Special boundary conditions for special PDEs (like Impedance boundary conditions for the Helmholtz equation)
Recommend
More recommend