Parallel Programming and Heterogeneous Computing A3 - Performance Metrics Max Plauth, Sven Köhler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group
Performance Which car is faster? ■ … for transporting several large boxes … for winning a race Performance depends not only on an execution environment but also on the workload it executes! ParProg 2020 A3 Performance Metrics Lukas Wenzel Chart 2
Recap Optimization Goals Decrease Latency – process a single workload faster (= speedup ) ■ Increase Throughput – process more workloads in the same time ■ Both are Performance metrics Ø Scalability : make best use of additional resources ■ Scale Up : Utilize additional resources on a machine □ Scale Out : Utilize resources on additional machines □ Cost/Energy Efficiency : ■ minimize cost/energy requirements for given performance objectives □ ParProg20 A1 alternatively: maximize performance for given cost/energy budget Terminology □ Lukas Wenzel Utilization : minimize idle time (=waste) of available resources ■ Chart 3 Precision-Tradeoffs : trade performance for precision of results ■
Scaling Behavior Different responses of performance metrics to scaling (additional resources): Speedup : ■ More resources ~ less time executing the same workload › strong scaling Scaled Speedup : ■ More resources ~ same time executing a larger workload › weak scaling ParProg 2020 A3 Performance Metrics Lukas Wenzel Linear speedup = resources and workload execution scale by same factor ■ Chart 4
Anatomy of a Workload A workload consists of multiple tasks, containing different amounts of operations each. × 1 T1 T2 T3 T4 T5 T6 T7 T8 T1 execution time 44 (idle 0 ) T2 T3 T1 T3 T4 × 8 T2 T5 ParProg 2020 A3 × 5 T1 T2 T3 T5 T7 T6 Performance × 3 Metrics T6 T7 T5 T6 T4 T7 Lukas Wenzel T8 T4 T8 T8 Chart 5 Speedup Speedup Speedup 15 (2) 9 (3) 9 (29) 4.89 2.93 4.89 (!)
Anatomy of a Workload The longest task puts a lower bound on the shortest execution time. T1 T2 T3 T4 T5 T6 T7 T8 Modeling discrete tasks is impractical → simplified continuous model. 𝐔 𝟐 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 𝐔 𝐪𝐛𝐬 𝐔 𝐎 = 𝐎 + 𝐔 𝐭𝐟𝐫 𝐔(𝐎) 𝐔 𝐭𝐟𝐫 𝐔 𝐪𝐛𝐬 /𝐎 ParProg 2020 A3 Performance Metrics Lukas Wenzel Replace absolute times by parallelizable fraction 𝐐 : 𝐔 𝐪𝐛𝐬 = 𝐔 𝟐 ⋅ 𝐐 𝑼 𝑶 = 𝑼 𝟐 ⋅ 𝑸 𝑶 + (𝟐 − 𝑸) Chart 6 𝐔 𝐭𝐟𝐫 = 𝐔 𝟐 ⋅ (𝟐 − 𝐐)
[Amdahl1967] Amdahl‘s Law Amdahl's Law derives the speedup 𝐭 𝐁𝐧𝐞𝐛𝐢𝐦 𝐎 for a parallelization degree 𝐎 T T 𝟐 ) ) 𝐭 𝐁𝐧𝐞𝐛𝐢𝐦 𝐎 = T(N) = = ) ⋅ P 𝐐 T N + (1 − P) 𝐎 + (𝟐 − 𝐐) Even for arbitrarily large 𝐎 , the speedup converges to a fixed limit 𝟐 𝑶→, 𝒕 𝑩𝒏𝒆𝒃𝒊𝒎 𝑶 = 𝐦𝐣𝐧 𝟐 − 𝐐 ParProg 2020 A3 Performance Metrics Lukas Wenzel For getting reasonable speedup out of 1000 processors, the sequential part must be substantially below 0.1% Chart 7
[Amdahl1967] Amdahl‘s Law ParProg 2020 A3 Performance Metrics Lukas Wenzel Chart 8 By Daniels220 at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6678551
[Amdahl1967] Amdahl‘s Law Regardless of processor count, 90% parallelizable code allows not more than a speedup by factor 10 . Parallelism requires highly parallelizable workloads to achieve a speedup Ø What is the sense in large parallel machines? ■ Amdahl's law assumes a simple speedup scenario! isolated execution of a single workload Ø fixed workload size ParProg 2020 A3 Ø Performance Metrics Lukas Wenzel Chart 9
[Gustafson1988] Gustafson-Barsis’ Law Consider a scaled speedup scenario , allowing a variable workload size 𝐱 . Amdahl ~ What is the shortest execution time for a given workload? Gustafson-Barsis ~ What is the largest workload for a given execution time? 𝐔 𝐔 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 ParProg 2020 A3 Performance 𝐱 𝟐 ~ 𝐔 𝐪𝐛𝐬 + 𝐔 𝐭𝐟𝐫 𝐱(𝐎) ~ 𝐎 ⋅ 𝐔 𝐪𝐛𝐬 + 𝐔 𝐭𝐟𝐫 Metrics Lukas Wenzel Assumption: The parallelizable part of a workload contributes useful work when replicated. Chart 10
[Gustafson1988] Gustafson-Barsis’ Law Determine the scaled speedup 𝐭 𝐇𝐯𝐭𝐮𝐛𝐰𝐭𝐩𝐨 𝐎 through the increase in workload size 𝐱(𝐎) over the fixed execution time 𝐔 𝐭 𝐇𝐯𝐭𝐮𝐛𝐠𝐭𝐩𝐨 𝐎 = w(N) = T ⋅ (P ⋅ N + (1 − P)) = 𝐐 ⋅ 𝑶 + (𝟐 − 𝑸) w ) T ⋅ P + (1 − P) 𝐔 𝐔 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 ParProg 2020 A3 Performance 𝐱 𝟐 ~ 𝐔 𝐪𝐛𝐬 + 𝐔 𝐭𝐟𝐫 𝐱(𝐎) ~ 𝐎 ⋅ 𝐔 𝐪𝐛𝐬 + 𝐔 𝐭𝐟𝐫 Metrics Lukas Wenzel Assumption: The parallelizable part of a workload contributes useful work when replicated. Chart 11
[Gustafson1988] Gustafson-Barsis’ Law P = 90% P = 50% ParProg 2020 A3 Performance Metrics Lukas Wenzel Chart 12 By Peahihawaii - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=12630392
[Karp1990] Karp-Flatt-Metric Parallel fraction 𝐐 is a hypothetical parameter and not easily deduced from a given workload. Karp-Flatt-Metric determines sequential fraction 𝐑 = 𝟐 − 𝐐 empirically Ø Measure baseline execution time 𝐔 𝟐 1. by executing workload on a single execution unit Measure parallelized execution time 𝐔(𝐎) 2. by executing workload on 𝐎 execution units 𝐔 𝟐 Determine speedup 𝐭(𝐎) = B 3. 𝐔(𝐎) Calculate Karp-Flatt-Metric 4. ParProg 2020 A3 𝐭(𝐎) − 𝟐 𝟐 Performance Metrics 𝐎 𝐑(𝐎) = Lukas Wenzel 𝟐 − 𝟐 𝐎 Chart 13
[Karp1990] Karp-Flatt-Metric The Karp-Flatt-Metric is derived by rearranging Amdahl's Law. 𝒕(𝑶) = 𝑼 𝑶 𝟐 ; 𝑼 𝑶 = 𝟐 − 𝑹 + 𝑹 ⋅ 𝑼 𝟐 𝑼 𝟐 𝑶 𝟐 − 𝑹 + 𝑹 ⋅ 𝑼 𝟐 𝟐 𝑶 𝒕(𝑶) = 𝑼 𝟐 𝒕(𝑶) = 𝟐 − 𝑹 𝟐 + 𝑹 = 𝟐 𝑶 + 𝟐 − 𝟐 𝑶 ⋅ 𝑹 𝑶 𝒕(𝑶) − 𝟐 𝟐 𝑶 = 𝟐 − 𝟐 ParProg 2020 A3 𝑶 ⋅ 𝑹 Performance Metrics 𝒕(𝑶) − 𝟐 𝟐 Lukas Wenzel 𝑶 = 𝑹 𝟐 − 𝟐 Chart 14 𝑶
[Karp1990] Karp-Flatt-Metric Observing 𝐑(𝐎) for different 𝐎 gives an indication, how the workload reacts to different degrees of parallelism: 𝐑(𝐎) close to 𝟏 ~ high parallel fraction, workload benefits from parallelization ■ 𝐑(𝐎) close to 𝟐 ~ low parallel fraction, workload can not use parallel resources ■ 𝐑(𝐎) increases with 𝐎 ~ workload suffers from parallelization overhead ■ 𝐑(𝐎) decreases with 𝐎 ~ workload scales well ParProg 2020 A3 ■ Performance Metrics Observing 𝐑(𝐎) for different implementation variants of the workload can reveal Lukas Wenzel bottlenecks. Chart 15
[Leiserson2008] A More Detailed View Directed Acyclic Graph to model a workload: 1 Nodes represent operations ■ 2 Edges express dependencies between operations 𝐔 𝟐 = 𝟐𝟗 ■ 3 13 𝐔 4 = 𝟘 4 6 14 16 Work 𝐔 - Total workload execution time 7 9 𝐔 𝟐 - Execution time with a single processor 5 8 10 15 17 ~ number of nodes 11 𝐔 𝐐 - Execution time with P processors 12 𝐔 4 - Execution time with arbitrary number of processors ParProg 2020 A3 ~ graph diameter 18 Performance 𝑼 𝟐 𝑸 Work Law 𝑼 𝑸 ≥ B Metrics (processors can not process multiple operations at once) Lukas Wenzel Span Law 𝑈 7 ≥ 𝑈 4 (execution order can not break dependencies) Chart 16
Literature [Amdahl1967] Amdahl, Gene M. "Validity of the single processor approach to achieving large scale computing capabilities." Proceedings of the AFIPS Spring Joint Computer Conference . 483-485. 1967. [Gustafson1988] Gustafson, John L. "Reevaluating Amdahl's law." Communications of the ACM 31.5 (1988): 532-533. [Karp1990] Karp, Alan H. and Flatt, Horace P. "Measuring parallel processor performance." Communications of the ACM 33.5 (1990): 539-543. [Leiserson2008] ParProg 2020 A3 Performance Leiserson, Charles E. and Mirman, Ilya B. "How to survive the multicore Metrics software revolution (or at least survive the hype)." Cilk Arts 1 (2008): 11. Lukas Wenzel Chart 17
And now for a break and a cup of Oolong. *or beverage of your choice
Recommend
More recommend