Efficiency Scalability Example Parallel Numerical Algorithms Chapter 2 – Parallel Thinking Section 2.3 – Parallel Performance Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 38
Efficiency Scalability Example Outline Efficiency 1 Parallel Efficiency Basic Definitions Execution Time and Cost Efficiency and Speedup Scalability 2 Definition Problem Scaling Isoefficiency Example 3 Atmospheric Flow Model 1-D Agglomeration 2-D Agglomeration Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 38
Parallel Efficiency Efficiency Basic Definitions Scalability Execution Time and Cost Example Efficiency and Speedup Parallel Efficiency Efficiency : effectiveness of parallel algorithm relative to its serial counterpart (more precise definition later) Factors determining efficiency of parallel algorithm Load balance : distribution of work among processors Concurrency : processors working simultaneously Overhead : additional work not present in corresponding serial computation Efficiency is maximized when load imbalance is minimized, concurrency is maximized, and overhead is minimized Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 38
Parallel Efficiency Efficiency Basic Definitions Scalability Execution Time and Cost Example Efficiency and Speedup Parallel Efficiency (a) (b) (c) (d) (a) perfect load balance and concurrency (b) good initial concurrency but poor load balance (c) good load balance but poor concurrency (d) good load balance and concurrency but additional overhead Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 38
Parallel Efficiency Efficiency Basic Definitions Scalability Execution Time and Cost Example Efficiency and Speedup Algorithm Attributes Memory ( M ) — overall memory footprint of the algorithm in words Work ( Q ) — total number of operations (e.g., flops) computed by algorithm, including loads and stores Depth ( D ) — longest sequence (chain) of dependent work operations Time ( T ) — elapsed wall-clock time (e.g., secs) from beginning to end of computation, expressed using α — time to transfer a 0-byte message β — bandwidth cost (per-word) γ — time to perform one local operation (unit work) Note that effective γ is generally between the time to compute a floating point operation and the time to load/store a word, depending on local computation performed Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 38
Parallel Efficiency Efficiency Basic Definitions Scalability Execution Time and Cost Example Efficiency and Speedup Scaling of Algorithm Attributes Subscript indicates number of processors used (e.g., T 1 is serial execution time, Q p is work using p processors, etc.) We assume the input size , an attribute of the problem rather than the algorithm , is M 1 Most algorithms we study will be memory efficient , meaning M p = M 1 in which case we drop subscript and write just M If serial algorithm is optimal then Q p ≥ Q 1 Parallel work overhead : O p := Q p − Q 1 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 38
Parallel Efficiency Efficiency Basic Definitions Scalability Execution Time and Cost Example Efficiency and Speedup Basic Definitions Amount of data often determines amount of computation, in which case we may write Q ( M ) to indicate dependence of computational complexity on the input size For example, when multiplying two full matrices of order n , M = Θ( n 2 ) and Q = Θ( n 3 ) , so Q ( M ) = Θ( M 3 / 2 ) In numerical algorithms, every data item is typically used in at least one operation, so we generally assume that work Q grows at least linearly with the input size M Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 38
Parallel Efficiency Efficiency Basic Definitions Scalability Execution Time and Cost Example Efficiency and Speedup Execution Time and Cost Execution time ≥ ( total work ) / ( overall processor speed ) Serial execution time: T 1 = γQ 1 Parallel execution time: T p ≥ γQ p /p 1 p T 1 T p We can quantify T p in terms of the critical path cost (sum of costs of longest chain of dependent subtasks) Cost := ( L, W, F ) := ( #messages , #words , #flops ) max( αL, βW, γF ) ≤ T p ≤ αL + βW + γF Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 38
Parallel Efficiency Efficiency Basic Definitions Scalability Execution Time and Cost Example Efficiency and Speedup Efficiency and Speedup Speedup : parallel time = T 1 S p := serial time T p Efficiency : number of processors = S p speedup E p := p Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 38
Parallel Efficiency Efficiency Basic Definitions Scalability Execution Time and Cost Example Efficiency and Speedup Example: Summation Problem: compute sum of n numbers Using p processors, each processor first sums n/p numbers Subtotals are then summed in tree-like fashion to obtain grand total + p + + n/p log p Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 38
Parallel Efficiency Efficiency Basic Definitions Scalability Execution Time and Cost Example Efficiency and Speedup Example: Summation Generally, α ≫ β ≫ γ , which we use to simplify analysis Serial Parallel M 1 = n M p = n Q 1 ≈ n Q p ≈ n T 1 ≈ γn T p ≈ α log( p ) + γn/p S p = T 1 γn p α log p + γn/p = ≈ T p 1 + ( α/γ )( p/n ) log p E p = S p 1 p ≈ 1 + ( α/γ )( p/n ) log p To achieve a good speed-up want α/γ to be small and n ≫ p Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 38
Efficiency Definition Scalability Problem Scaling Example Isoefficiency Parallel Scalability Scalability : relative effectiveness with which parallel algorithm can utilize additional processors A criterion: algorithm is scalable if its efficiency is bounded away from zero as number of processors grows without bound, or equivalently, E p = Θ(1) as p → ∞ Algorithm scalability in this sense is impractical unless we permit the input size to grow or bound the number of processors used Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 38
Efficiency Definition Scalability Problem Scaling Example Isoefficiency Parallel Scalability Why use more processors? solve given problem in less time solve larger problem in same time obtain sufficient memory to solve given (or larger) problem solve ever larger problems regardless of execution time Larger problems require more memory M 1 and work Q 1 , e.g., finer resolution or larger domain in atmospheric simulation more particles in molecular or galactic simulations additional physical effects or greater detail in modeling Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 38
Efficiency Definition Scalability Problem Scaling Example Isoefficiency Problem Scaling The relative parallel scaling of different algorithms for a problem can be studied by fixing input size: constant M 1 input size per processor: constant M 1 /p The relative parallel scaling of different parallelizations of an algorithm can be studied by fixing amount of work per processor: constant Q 1 /p efficiency: constant E p time: constant T p In all cases, we seek to quantify the relationship between parameters of the problem/algorithm with respect to the performance (time/efficiency) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 38
Efficiency Definition Scalability Problem Scaling Example Isoefficiency Strong Scaling Strong scaling – solving the same problem with a growing number of processors (constant input size) Ideal strong scaling to p processors requires T p = T 1 /p When problem is not embarrassingly parallel, the best we can hope for is T p ≈ T 1 /p (i.e., E p ≈ 1 ) up to some p We say an algorithm is strongly scalable to p s processors if E p s = Θ(1) i.e., we seek to asymptotically characterize the function p s ( Q 1 ) such that E p s ( Q 1 ) ( Q 1 ) = const for any Q 1 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 38
Efficiency Definition Scalability Problem Scaling Example Isoefficiency Example: Summation For summation example, 1 E p = 1 + ( α/γ )( p/n ) log p The binary tree summation algorithm is therefore strongly scalable to p s = Θ(( γ/α ) n/ log(( γ/α ) n )) processors The term α/γ is constant for a given architecture, but can range from 10 3 to 10 6 on various machines Ignoring the dependence on this constant, the algorithm is strongly scalable to p s = Θ( n/ log( n )) processors Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 38
Efficiency Definition Scalability Problem Scaling Example Isoefficiency Basic Bounds on Strong Scaling Since all processors have work to do only if Q p /p ≥ 1 for any p the speed-up is bounded by Q 1 S p ≤ Q p /p ≤ Q 1 It is possible but rare to achieve S p > M 1 by using additional memory M p > M 1 , as otherwise some processors have no data to work on Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 38
Recommend
More recommend