Parallel Programming and Heterogeneous Computing A3 - Performance - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing A3 - Performance Metrics Max Plauth, Sven Köhler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group

Performance Which car is faster? ■ … for transporting several large boxes … for winning a race Performance depends not only on an execution environment but also on the workload it executes! ParProg 2020 A3 Performance Metrics Lukas Wenzel Chart 2

Recap Optimization Goals Decrease Latency – process a single workload faster (= speedup ) ■ Increase Throughput – process more workloads in the same time ■ Both are Performance metrics Ø Scalability : make best use of additional resources ■ Scale Up : Utilize additional resources on a machine □ Scale Out : Utilize resources on additional machines □ Cost/Energy Efficiency : ■ minimize cost/energy requirements for given performance objectives □ ParProg20 A1 alternatively: maximize performance for given cost/energy budget Terminology □ Lukas Wenzel Utilization : minimize idle time (=waste) of available resources ■ Chart 3 Precision-Tradeoffs : trade performance for precision of results ■

Scaling Behavior Different responses of performance metrics to scaling (additional resources): Speedup : ■ More resources ~ less time executing the same workload › strong scaling Scaled Speedup : ■ More resources ~ same time executing a larger workload › weak scaling ParProg 2020 A3 Performance Metrics Lukas Wenzel Linear speedup = resources and workload execution scale by same factor ■ Chart 4

Anatomy of a Workload A workload consists of multiple tasks, containing different amounts of operations each. × 1 T1 T2 T3 T4 T5 T6 T7 T8 T1 execution time 44 (idle 0 ) T2 T3 T1 T3 T4 × 8 T2 T5 ParProg 2020 A3 × 5 T1 T2 T3 T5 T7 T6 Performance × 3 Metrics T6 T7 T5 T6 T4 T7 Lukas Wenzel T8 T4 T8 T8 Chart 5 Speedup Speedup Speedup 15 (2) 9 (3) 9 (29) 4.89 2.93 4.89 (!)

Anatomy of a Workload The longest task puts a lower bound on the shortest execution time. T1 T2 T3 T4 T5 T6 T7 T8 Modeling discrete tasks is impractical → simplified continuous model. 𝐔 𝟐 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 𝐔 𝐪𝐛𝐬 𝐔 𝐎 = 𝐎 + 𝐔 𝐭𝐟𝐫 𝐔(𝐎) 𝐔 𝐭𝐟𝐫 𝐔 𝐪𝐛𝐬 /𝐎 ParProg 2020 A3 Performance Metrics Lukas Wenzel Replace absolute times by parallelizable fraction 𝐐 : 𝐔 𝐪𝐛𝐬 = 𝐔 𝟐 ⋅ 𝐐 𝑼 𝑶 = 𝑼 𝟐 ⋅ 𝑸 𝑶 + (𝟐 − 𝑸) Chart 6 𝐔 𝐭𝐟𝐫 = 𝐔 𝟐 ⋅ (𝟐 − 𝐐)

[Amdahl1967] Amdahl‘s Law Amdahl's Law derives the speedup 𝐭 𝐁𝐧𝐞𝐛𝐢𝐦 𝐎 for a parallelization degree 𝐎 T T 𝟐 ) ) 𝐭 𝐁𝐧𝐞𝐛𝐢𝐦 𝐎 = T(N) = = ) ⋅ P 𝐐 T N + (1 − P) 𝐎 + (𝟐 − 𝐐) Even for arbitrarily large 𝐎 , the speedup converges to a fixed limit 𝟐 𝑶→, 𝒕 𝑩𝒏𝒆𝒃𝒊𝒎 𝑶 = 𝐦𝐣𝐧 𝟐 − 𝐐 ParProg 2020 A3 Performance Metrics Lukas Wenzel For getting reasonable speedup out of 1000 processors, the sequential part must be substantially below 0.1% Chart 7

[Amdahl1967] Amdahl‘s Law ParProg 2020 A3 Performance Metrics Lukas Wenzel Chart 8 By Daniels220 at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6678551

[Amdahl1967] Amdahl‘s Law Regardless of processor count, 90% parallelizable code allows not more than a speedup by factor 10 . Parallelism requires highly parallelizable workloads to achieve a speedup Ø What is the sense in large parallel machines? ■ Amdahl's law assumes a simple speedup scenario! isolated execution of a single workload Ø fixed workload size ParProg 2020 A3 Ø Performance Metrics Lukas Wenzel Chart 9

[Gustafson1988] Gustafson-Barsis’ Law Consider a scaled speedup scenario , allowing a variable workload size 𝐱 . Amdahl ~ What is the shortest execution time for a given workload? Gustafson-Barsis ~ What is the largest workload for a given execution time? 𝐔 𝐔 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 ParProg 2020 A3 Performance 𝐱 𝟐 ~ 𝐔 𝐪𝐛𝐬 + 𝐔 𝐭𝐟𝐫 𝐱(𝐎) ~ 𝐎 ⋅ 𝐔 𝐪𝐛𝐬 + 𝐔 𝐭𝐟𝐫 Metrics Lukas Wenzel Assumption: The parallelizable part of a workload contributes useful work when replicated. Chart 10

[Gustafson1988] Gustafson-Barsis’ Law Determine the scaled speedup 𝐭 𝐇𝐯𝐭𝐮𝐛𝐰𝐭𝐩𝐨 𝐎 through the increase in workload size 𝐱(𝐎) over the fixed execution time 𝐔 𝐭 𝐇𝐯𝐭𝐮𝐛𝐠𝐭𝐩𝐨 𝐎 = w(N) = T ⋅ (P ⋅ N + (1 − P)) = 𝐐 ⋅ 𝑶 + (𝟐 − 𝑸) w ) T ⋅ P + (1 − P) 𝐔 𝐔 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 𝐔 𝐪𝐛𝐬 𝐔 𝐭𝐟𝐫 ParProg 2020 A3 Performance 𝐱 𝟐 ~ 𝐔 𝐪𝐛𝐬 + 𝐔 𝐭𝐟𝐫 𝐱(𝐎) ~ 𝐎 ⋅ 𝐔 𝐪𝐛𝐬 + 𝐔 𝐭𝐟𝐫 Metrics Lukas Wenzel Assumption: The parallelizable part of a workload contributes useful work when replicated. Chart 11

[Gustafson1988] Gustafson-Barsis’ Law P = 90% P = 50% ParProg 2020 A3 Performance Metrics Lukas Wenzel Chart 12 By Peahihawaii - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=12630392

[Karp1990] Karp-Flatt-Metric Parallel fraction 𝐐 is a hypothetical parameter and not easily deduced from a given workload. Karp-Flatt-Metric determines sequential fraction 𝐑 = 𝟐 − 𝐐 empirically Ø Measure baseline execution time 𝐔 𝟐 1. by executing workload on a single execution unit Measure parallelized execution time 𝐔(𝐎) 2. by executing workload on 𝐎 execution units 𝐔 𝟐 Determine speedup 𝐭(𝐎) = B 3. 𝐔(𝐎) Calculate Karp-Flatt-Metric 4. ParProg 2020 A3 𝐭(𝐎) − 𝟐 𝟐 Performance Metrics 𝐎 𝐑(𝐎) = Lukas Wenzel 𝟐 − 𝟐 𝐎 Chart 13

[Karp1990] Karp-Flatt-Metric The Karp-Flatt-Metric is derived by rearranging Amdahl's Law. 𝒕(𝑶) = 𝑼 𝑶 𝟐 ; 𝑼 𝑶 = 𝟐 − 𝑹 + 𝑹 ⋅ 𝑼 𝟐 𝑼 𝟐 𝑶 𝟐 − 𝑹 + 𝑹 ⋅ 𝑼 𝟐 𝟐 𝑶 𝒕(𝑶) = 𝑼 𝟐 𝒕(𝑶) = 𝟐 − 𝑹 𝟐 + 𝑹 = 𝟐 𝑶 + 𝟐 − 𝟐 𝑶 ⋅ 𝑹 𝑶 𝒕(𝑶) − 𝟐 𝟐 𝑶 = 𝟐 − 𝟐 ParProg 2020 A3 𝑶 ⋅ 𝑹 Performance Metrics 𝒕(𝑶) − 𝟐 𝟐 Lukas Wenzel 𝑶 = 𝑹 𝟐 − 𝟐 Chart 14 𝑶

[Karp1990] Karp-Flatt-Metric Observing 𝐑(𝐎) for different 𝐎 gives an indication, how the workload reacts to different degrees of parallelism: 𝐑(𝐎) close to 𝟏 ~ high parallel fraction, workload benefits from parallelization ■ 𝐑(𝐎) close to 𝟐 ~ low parallel fraction, workload can not use parallel resources ■ 𝐑(𝐎) increases with 𝐎 ~ workload suffers from parallelization overhead ■ 𝐑(𝐎) decreases with 𝐎 ~ workload scales well ParProg 2020 A3 ■ Performance Metrics Observing 𝐑(𝐎) for different implementation variants of the workload can reveal Lukas Wenzel bottlenecks. Chart 15

[Leiserson2008] A More Detailed View Directed Acyclic Graph to model a workload: 1 Nodes represent operations ■ 2 Edges express dependencies between operations 𝐔 𝟐 = 𝟐𝟗 ■ 3 13 𝐔 4 = 𝟘 4 6 14 16 Work 𝐔 - Total workload execution time 7 9 𝐔 𝟐 - Execution time with a single processor 5 8 10 15 17 ~ number of nodes 11 𝐔 𝐐 - Execution time with P processors 12 𝐔 4 - Execution time with arbitrary number of processors ParProg 2020 A3 ~ graph diameter 18 Performance 𝑼 𝟐 𝑸 Work Law 𝑼 𝑸 ≥ B Metrics (processors can not process multiple operations at once) Lukas Wenzel Span Law 𝑈 7 ≥ 𝑈 4 (execution order can not break dependencies) Chart 16

Literature [Amdahl1967] Amdahl, Gene M. "Validity of the single processor approach to achieving large scale computing capabilities." Proceedings of the AFIPS Spring Joint Computer Conference . 483-485. 1967. [Gustafson1988] Gustafson, John L. "Reevaluating Amdahl's law." Communications of the ACM 31.5 (1988): 532-533. [Karp1990] Karp, Alan H. and Flatt, Horace P. "Measuring parallel processor performance." Communications of the ACM 33.5 (1990): 539-543. [Leiserson2008] ParProg 2020 A3 Performance Leiserson, Charles E. and Mirman, Ilya B. "How to survive the multicore Metrics software revolution (or at least survive the hype)." Cilk Arts 1 (2008): 11. Lukas Wenzel Chart 17

And now for a break and a cup of Oolong. *or beverage of your choice

Parallel Programming and Heterogeneous Computing A3 - Performance - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing A3 - Performance Metrics Max Plauth, Sven Khler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Performance Which car is faster? for

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

The Power of Abstraction Barbara Liskov October 2010 Outline Inventing abstract data types

Spin Me Right Round: Rotational Symmetry for FPGA-Specific AES CHES 2018, Amsterdam Grant. Nr.

Maarten L offler Marc van Kreveld Center for Geometry, Imaging and Virtual Environments

Yet another attack on whitebox AES implementation Patrick Derbez 1 , Pierre-Alain Fouque 1 ,

Parallel Game Tree Search Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

rt t rs

Power Efficiency in Smart Camera Chips Ricardo Carmona-Galn, Jorge Fernndez-Berni, M. Trevisi

Exploring Quantum Secret Sharing with the ZX Calculus Vladimir Nikolaev Zamdzhiev Oriel College,

Parallel Programming and Heterogeneous Computing A3 - Performance - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing A3 - Performance Metrics Max Plauth, Sven Khler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Performance Which car is faster? for

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

The Power of Abstraction Barbara Liskov October 2010 Outline Inventing abstract data types

Spin Me Right Round: Rotational Symmetry for FPGA-Specific AES CHES 2018, Amsterdam Grant. Nr.

Maarten L offler Marc van Kreveld Center for Geometry, Imaging and Virtual Environments

Yet another attack on whitebox AES implementation Patrick Derbez 1 , Pierre-Alain Fouque 1 ,

Parallel Game Tree Search Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

rt t rs

Power Efficiency in Smart Camera Chips Ricardo Carmona-Galn, Jorge Fernndez-Berni, M. Trevisi

Exploring Quantum Secret Sharing with the ZX Calculus Vladimir Nikolaev Zamdzhiev Oriel College,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &