parallel programming and heterogeneous computing

Parallel Programming and Heterogeneous Computing A3 - Performance - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing A3 - Performance Metrics Max Plauth, Sven Khler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Performance Which car is faster? for

  1. Parallel Programming and Heterogeneous Computing A3 - Performance Metrics Max Plauth, Sven Köhler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group

  2. Performance Which car is faster? ■ … for transporting several large boxes … for winning a race Performance depends not only on an execution environment but also on the workload it executes! ParProg 2020 A3 Performance Metrics Lukas Wenzel Chart 2

  3. Recap Optimization Goals Decrease Latency – process a single workload faster (= speedup ) ■ Increase Throughput – process more workloads in the same time ■ Both are Performance metrics Ø Scalability : make best use of additional resources ■ Scale Up : Utilize additional resources on a machine □ Scale Out : Utilize resources on additional machines □ Cost/Energy Efficiency : ■ minimize cost/energy requirements for given performance objectives □ ParProg20 A1 alternatively: maximize performance for given cost/energy budget Terminology □ Lukas Wenzel Utilization : minimize idle time (=waste) of available resources ■ Chart 3 Precision-Tradeoffs : trade performance for precision of results ■

  4. Scaling Behavior Different responses of performance metrics to scaling (additional resources): Speedup : ■ More resources ~ less time executing the same workload › strong scaling Scaled Speedup : ■ More resources ~ same time executing a larger workload › weak scaling ParProg 2020 A3 Performance Metrics Lukas Wenzel Linear speedup = resources and workload execution scale by same factor ■ Chart 4

  5. Anatomy of a Workload A workload consists of multiple tasks, containing different amounts of operations each. × 1 T1 T2 T3 T4 T5 T6 T7 T8 T1 execution time 44 (idle 0 ) T2 T3 T1 T3 T4 × 8 T2 T5 ParProg 2020 A3 × 5 T1 T2 T3 T5 T7 T6 Performance × 3 Metrics T6 T7 T5 T6 T4 T7 Lukas Wenzel T8 T4 T8 T8 Chart 5 Speedup Speedup Speedup 15 (2) 9 (3) 9 (29) 4.89 2.93 4.89 (!)

  6. Anatomy of a Workload The longest task puts a lower bound on the shortest execution time. T1 T2 T3 T4 T5 T6 T7 T8 Modeling discrete tasks is impractical → simplified continuous model. 𝐔 𝟐 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 𝐔 𝐪𝐛𝐬 𝐔 𝐎 = 𝐎 + 𝐔 𝐭𝐟𝐫 𝐔(𝐎) 𝐔 𝐭𝐟𝐫 𝐔 𝐪𝐛𝐬 /𝐎 ParProg 2020 A3 Performance Metrics Lukas Wenzel Replace absolute times by parallelizable fraction 𝐐 : 𝐔 𝐪𝐛𝐬 = 𝐔 𝟐 ⋅ 𝐐 𝑼 𝑶 = 𝑼 𝟐 ⋅ 𝑸 𝑶 + (𝟐 − 𝑸) Chart 6 𝐔 𝐭𝐟𝐫 = 𝐔 𝟐 ⋅ (𝟐 − 𝐐)

  7. [Amdahl1967] Amdahl‘s Law Amdahl's Law derives the speedup 𝐭 𝐁𝐧𝐞𝐛𝐢𝐦 𝐎 for a parallelization degree 𝐎 T T 𝟐 ) ) 𝐭 𝐁𝐧𝐞𝐛𝐢𝐦 𝐎 = T(N) = = ) ⋅ P 𝐐 T N + (1 − P) 𝐎 + (𝟐 − 𝐐) Even for arbitrarily large 𝐎 , the speedup converges to a fixed limit 𝟐 𝑶→, 𝒕 𝑩𝒏𝒆𝒃𝒊𝒎 𝑶 = 𝐦𝐣𝐧 𝟐 − 𝐐 ParProg 2020 A3 Performance Metrics Lukas Wenzel For getting reasonable speedup out of 1000 processors, the sequential part must be substantially below 0.1% Chart 7

  8. [Amdahl1967] Amdahl‘s Law ParProg 2020 A3 Performance Metrics Lukas Wenzel Chart 8 By Daniels220 at English Wikipedia, CC BY-SA 3.0,

  9. [Amdahl1967] Amdahl‘s Law Regardless of processor count, 90% parallelizable code allows not more than a speedup by factor 10 . Parallelism requires highly parallelizable workloads to achieve a speedup Ø What is the sense in large parallel machines? ■ Amdahl's law assumes a simple speedup scenario! isolated execution of a single workload Ø fixed workload size ParProg 2020 A3 Ø Performance Metrics Lukas Wenzel Chart 9

  10. [Gustafson1988] Gustafson-Barsis’ Law Consider a scaled speedup scenario , allowing a variable workload size 𝐱 . Amdahl ~ What is the shortest execution time for a given workload? Gustafson-Barsis ~ What is the largest workload for a given execution time? 𝐔 𝐔 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 ParProg 2020 A3 Performance 𝐱 𝟐 ~ 𝐔 𝐪𝐛𝐬 + 𝐔 𝐭𝐟𝐫 𝐱(𝐎) ~ 𝐎 ⋅ 𝐔 𝐪𝐛𝐬 + 𝐔 𝐭𝐟𝐫 Metrics Lukas Wenzel Assumption: The parallelizable part of a workload contributes useful work when replicated. Chart 10

  11. [Gustafson1988] Gustafson-Barsis’ Law Determine the scaled speedup 𝐭 𝐇𝐯𝐭𝐮𝐛𝐰𝐭𝐩𝐨 𝐎 through the increase in workload size 𝐱(𝐎) over the fixed execution time 𝐔 𝐭 𝐇𝐯𝐭𝐮𝐛𝐠𝐭𝐩𝐨 𝐎 = w(N) = T ⋅ (P ⋅ N + (1 − P)) = 𝐐 ⋅ 𝑶 + (𝟐 − 𝑸) w ) T ⋅ P + (1 − P) 𝐔 𝐔 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 ParProg 2020 A3 Performance 𝐱 𝟐 ~ 𝐔 𝐪𝐛𝐬 + 𝐔 𝐭𝐟𝐫 𝐱(𝐎) ~ 𝐎 ⋅ 𝐔 𝐪𝐛𝐬 + 𝐔 𝐭𝐟𝐫 Metrics Lukas Wenzel Assumption: The parallelizable part of a workload contributes useful work when replicated. Chart 11

  12. [Gustafson1988] Gustafson-Barsis’ Law P = 90% P = 50% ParProg 2020 A3 Performance Metrics Lukas Wenzel Chart 12 By Peahihawaii - Own work, CC BY-SA 3.0,

  13. [Karp1990] Karp-Flatt-Metric Parallel fraction 𝐐 is a hypothetical parameter and not easily deduced from a given workload. Karp-Flatt-Metric determines sequential fraction 𝐑 = 𝟐 − 𝐐 empirically Ø Measure baseline execution time 𝐔 𝟐 1. by executing workload on a single execution unit Measure parallelized execution time 𝐔(𝐎) 2. by executing workload on 𝐎 execution units 𝐔 𝟐 Determine speedup 𝐭(𝐎) = B 3. 𝐔(𝐎) Calculate Karp-Flatt-Metric 4. ParProg 2020 A3 𝐭(𝐎) − 𝟐 𝟐 Performance Metrics 𝐎 𝐑(𝐎) = Lukas Wenzel 𝟐 − 𝟐 𝐎 Chart 13

  14. [Karp1990] Karp-Flatt-Metric The Karp-Flatt-Metric is derived by rearranging Amdahl's Law. 𝒕(𝑶) = 𝑼 𝑶 𝟐 ; 𝑼 𝑶 = 𝟐 − 𝑹 + 𝑹 ⋅ 𝑼 𝟐 𝑼 𝟐 𝑶 𝟐 − 𝑹 + 𝑹 ⋅ 𝑼 𝟐 𝟐 𝑶 𝒕(𝑶) = 𝑼 𝟐 𝒕(𝑶) = 𝟐 − 𝑹 𝟐 + 𝑹 = 𝟐 𝑶 + 𝟐 − 𝟐 𝑶 ⋅ 𝑹 𝑶 𝒕(𝑶) − 𝟐 𝟐 𝑶 = 𝟐 − 𝟐 ParProg 2020 A3 𝑶 ⋅ 𝑹 Performance Metrics 𝒕(𝑶) − 𝟐 𝟐 Lukas Wenzel 𝑶 = 𝑹 𝟐 − 𝟐 Chart 14 𝑶

  15. [Karp1990] Karp-Flatt-Metric Observing 𝐑(𝐎) for different 𝐎 gives an indication, how the workload reacts to different degrees of parallelism: 𝐑(𝐎) close to 𝟏 ~ high parallel fraction, workload benefits from parallelization ■ 𝐑(𝐎) close to 𝟐 ~ low parallel fraction, workload can not use parallel resources ■ 𝐑(𝐎) increases with 𝐎 ~ workload suffers from parallelization overhead ■ 𝐑(𝐎) decreases with 𝐎 ~ workload scales well ParProg 2020 A3 ■ Performance Metrics Observing 𝐑(𝐎) for different implementation variants of the workload can reveal Lukas Wenzel bottlenecks. Chart 15

  16. [Leiserson2008] A More Detailed View Directed Acyclic Graph to model a workload: 1 Nodes represent operations ■ 2 Edges express dependencies between operations 𝐔 𝟐 = 𝟐𝟗 ■ 3 13 𝐔 4 = 𝟘 4 6 14 16 Work 𝐔 - Total workload execution time 7 9 𝐔 𝟐 - Execution time with a single processor 5 8 10 15 17 ~ number of nodes 11 𝐔 𝐐 - Execution time with P processors 12 𝐔 4 - Execution time with arbitrary number of processors ParProg 2020 A3 ~ graph diameter 18 Performance 𝑼 𝟐 𝑸 Work Law 𝑼 𝑸 ≥ B Metrics (processors can not process multiple operations at once) Lukas Wenzel Span Law 𝑈 7 ≥ 𝑈 4 (execution order can not break dependencies) Chart 16

  17. Literature [Amdahl1967] Amdahl, Gene M. "Validity of the single processor approach to achieving large scale computing capabilities." Proceedings of the AFIPS Spring Joint Computer Conference . 483-485. 1967. [Gustafson1988] Gustafson, John L. "Reevaluating Amdahl's law." Communications of the ACM 31.5 (1988): 532-533. [Karp1990] Karp, Alan H. and Flatt, Horace P. "Measuring parallel processor performance." Communications of the ACM 33.5 (1990): 539-543. [Leiserson2008] ParProg 2020 A3 Performance Leiserson, Charles E. and Mirman, Ilya B. "How to survive the multicore Metrics software revolution (or at least survive the hype)." Cilk Arts 1 (2008): 11. Lukas Wenzel Chart 17

  18. And now for a break and a cup of Oolong. *or beverage of your choice

More recommend