a case for parallelism profilers and advisers with what
play

A Case for Parallelism Profilers and Advisers with What-If Analyses - PowerPoint PPT Presentation

A Case for Parallelism Profilers and Advisers with What-If Analyses Santosh Nagarakatte Rutgers University, USA @ Workshop on Dependable and Secure Software Systems @ ETH Zurich, October 2019 Is Parallel Programming Hard, And, If So, What Can


  1. A Case for Parallelism Profilers and Advisers with What-If Analyses Santosh Nagarakatte Rutgers University, USA @ Workshop on Dependable and Secure Software Systems @ ETH Zurich, October 2019

  2. Is Parallel Programming Hard, And, If So, What Can You Do About It? “ Parallel programming has earned a reputation as one of the most difficult areas a hacker can tackle. Papers and textbooks warn of the perils of deadlock, livelock, race conditions, non-determinism, Amdahl’s-Law limits to scaling, and excessive realtime latencies. And these perils are quite real; we authors have accumulated uncounted years of experience dealing with them, and all of the emotional scars, grey hairs, and hair loss that go with such experiences.” [McKenny:arXiv17] Main reasons: use of the wrong abstraction, lack of performance analysis and debugging tools

  3. Illustrative Example Write a parallel program Given a range of integers (0 to n) Find all the prime numbers in the range Student in my class Perform a computation on the primes Output result

  4. Illustrative Example – Writing a Parallel Program Work-Sharing Tasking Feature rich SIMD Offload #pragma omp parallel for Incremental for(int i=0; i<n; ++i) compute(i); parallelization 4

  5. Illustrative Example …….. 1 2 3 4 n Identify the number of processors on the machine (4) Divide the range into 4 parts and perform computation Run: ./primes Student in my class Speedup: 1.8X over serial Why? execution Load Imbalance

  6. Need to write Performance Portable Code - Advocacy for Task Parallelism …….. 1 2 3 4 n Express all the parallelism as tasks T 1 T 2 T 3 T m …….. Runtime Runtime that dynamically balances load by assigning tasks to idle threads P k P 1 P 2 ……..

  7. Illustrative Example …….. 1 2 3 4 n …….. T m T 1 T 2 T 3 Expresses parallel work in terms of tasks Is it performance Run: ./primes_tasks portable? Student in my class Speedup: 3.8X over serial execution on 4 cores

  8. Performance Debugging Tools GProf OProfile ARMMap Coz • Most of them provide info on frequently executed regions. • Critical path information is useful • Coz [SOSP 2015]: Identifies if a line of code matters in increasing speedup on a given machine. Intel Intel NVProf VTune Advisor

  9. Our Parallelism Profilers and Advisers: TaskProf & OMP-WHIP [FSE 2017, SC 2018, PLDI 2019] • Making a case for measuring logical parallelism Series-parallel relations + fine-grained measurements is a performance model • Where should programmer focus? Regions with low parallelism => serialization. Critical path! Profiler • Does it matter? Automatically identify regions to increase parallelism to a threshold What-if Analyses - mimic the effect of parallelization Adviser Differential analyses to identify regions with secondary effects General for multiple parallelism models. This talk focuses on OpenMP

  10. Performance Model for Logical Parallelism and What-If Analyses 10

  11. Performance Model for Computing Parallelism Profile on a machine with low core count and identify scalability bottlenecks • OSPG: Logical series-parallel relations between parts of a OpenMP program • Inspired by prior work: DPST [PLDI 2012], SP Parse tree [SPAA 2015] • OSPG Fine-grained measurements 11

  12. OpenMP Series Parallel Graph (OSPG) • A data structure to capture series-parallel relations • Inspired by Dynamic Program Structure Tree [PLDI 2012] • OSPG is an ordered tree in the absence of task dependencies in OpenMP • Handles the combination of work-sharing (fork-join programs with threads) and tasking • Precisely captures the semantics of OpenMP • Three kinds of nodes : W, S, and P nodes similar to Async, Finish, and Step nodes in the DPST

  13. Code Fragments in OpenMP Programs Execution structure OpenMP code snippet … b a(); #pragma omp parallel a c b(); c(); b … A code fragment is the longest sequence of instructions in the dynamic execution before encountering an OpenMP construct 13

  14. Capturing Series-Parallel Relation with the OSPG W-nodes capture computation A maximal sequence of dynamic instructions between b2 two OpenMP directives S1 a1 c4 b3 W4 S2 W1 P-nodes capture the parallel relation Nodes in the sub-tree of a P-node logically executes in parallel with right siblings of the P-node P2 P1 S-nodes capture the series relation Nodes in the sub-tree of a S-node logically executes W2 W3 in series with right siblings of the S-node 14

  15. Capturing Series-Parallel Relation with the OSPG b2 Determine the series-parallel S1 relation between any pair of W c4 a1 nodes with an LCA query b3 W1 S2 W4 Check the type of the LCA’s child on the path to the left w-node. If it’s a p-node , they execute in parallel . P1 P2 Otherwise, they execute in series S2 = LCA(W2,W3) W2 W3 P1 = Left-Child(S2,W2,W3) 15

  16. Capturing Series-Parallel Relation with the OSPG b2 Determine the series-parallel S1 relation between any pair of W c4 a1 nodes with an LCA query b3 W1 S2 W4 Check the type of the LCA’s child on the path to the left w-node. If it’s a p-node , they execute in parallel . P1 P2 Otherwise, they execute in series S1 = LCA(W2,W4) W2 W3 S2 = Left-Child(S1,W2,W4) 16

  17. Profiling an OpenMP Merge Sort Program • Merge sort program parallelized with OpenMP void main(){ void mergeSort(int* arr, int s, int e){ int* arr = init(&n); if (n <= CUT_OFF) #pragma omp parallel serialSort(arr, s, e); #pragma omp single int mid = s + (e-s)/2; mergeSort(arr, 0, n); #pragma omp task } mergeSort(arr, s, mid); #pragma omp task mergeSort(arr, mid+1, e); #pragma omp taskwait merge(arr, s, e); } 17

  18. OSPG Construction S0 void main(){ int* arr = init(&n); #pragma omp parallel W0 S1 #pragma omp single mergeSort(arr, 0, n); } P0 P1 W1 18

  19. OSPG Construction S0 W0 S1 void mergeSort(int* arr, int s, int e){ if (n <= CUT_OFF) P0 P1 serialSort(arr, s, e); int mid = s + (e-s)/2; #pragma omp task mergeSort(arr, s, mid); W2 S2 W5 W1 #pragma omp task mergeSort(arr, mid+1, e); #pragma omp taskwait P3 P2 merge(arr, s, e); } W3 W4 19

  20. Parallelism Computation Using OSPG 20

  21. Compute Parallelism S0 W 260 W 6 W0 S1 W 254 Measure work in each Work node with fine grained measurements W 254 P0 W 52 W 2 W4 W1 S2 Compute work for each internal node W 200 P2 P3 W 100 W 100 W 100 W 100 W2 W3 21

  22. Compute Serial Work S0 W0 S1 Measure work in each Work node P0 Compute work for each internal node W4 Identify serial work on critical path W1 S2 P2 P3 W2 W3 22

  23. W 260 Compute Serial Work S0 SW 160 W 6 W0 S1 W 254 Measure work in each Work node SW 154 W 254 P0 Compute work for each internal node SW 154 W 52 W 2 W4 Compute serial work for each W1 S2 Internal node W 200 P2 P3 W 100 W 100 SW 100 SW 100 SW 100 W 100 W 100 W2 W3 23

  24. W 260 Source Code Attribution S0 SW 160 W0 S1 main L1 W 254 P0 omp parallel L3 SW 154 W4 omp task L11 W1 S2 omp task L13 P2 P3 W 100 W 100 Aggregate parallelism at OpenMP SW 100 SW 100 constructs W2 W3 24

  25. Parallelism Profile S0 W0 S1 Line Number Work Serial Parallelism Critical Path P0 Work Work % program:1 260 160 1.625 3.75 omp parallel:3 254 154 1.65 33.75 W1 S2 W4 omp task:11 100 100 1.00 62.5 omp task:13 100 100 1.00 0 P3 P2 W2 W3 25

  26. Identify what parts of the code matter in increasing parallelism 26

  27. Adviser mode with What-If Analyses S0 Identify code regions that must be W0 S1 optimized to increase parallelism 6 P0 Which region to select? Select a 52 region to 2 W1 S2 W4 optimize Select step node P3 P2 performing highest work on critical path 100 W2 W3 100

  28. Adviser mode with What-If Analyses Identify all W-nodes Identify code regions corresponding to the S0 that must be optimized region and perform to increase parallelism what-if analyses W0 S1 ` What-If Profile 6 P0 Line Work Cwork Parallelism CP Select 1 260 85 3.05 7.05% 52 highest step 3 254 79 1.65 63.5% W1 S2 W4 node on 11 100 25 4.00 29.45% 2 critical path 13 100 25 4.00 0% P3 P2 25 W2 W3 25 100 100 Repeat until threshold parallelism is reached

  29. Tasking and Scheduling Overhead S0 W0 S1 Runtime Parallelism overhead P0 ` W1 S2 W4 Speedup P3 P2 ` ` W2 W3

  30. Adviser mode with What-If Analyses Identify code regions S0 that must be optimized to increase parallelism W0 S1 What-If Profile 6 P0 Line Work Cwork Parallelism CP Select 1 260 85 3.05 7.05% 52 highest step 3 254 79 1.65 63.5% W1 S2 W4 node on 11 100 25 4.00 29.45% 2 critical path 13 100 25 4.00 0% P3 P2 25 W2 W3 25 100 100 Repeat until threshold parallelism is reached OR Work of highest step node < K * average tasking overhead

Recommend


More recommend