10th international parallel tools workshop
play

10th International Parallel Tools Workshop Rohit Atre, Zia Ul-Huda, - PowerPoint PPT Presentation

DiscoPoP: A Profiling, Analysis, and Visualization Tool for Parallelism Discovery 10th International Parallel Tools Workshop Rohit Atre, Zia Ul-Huda, Mohammad Norouzi, Arya Mazaheri, Zhen Li, Dr. Ali Jannesari, Prof. Felix Wolf 10/12/2016 |


  1. DiscoPoP: A Profiling, Analysis, and Visualization Tool for Parallelism Discovery 10th International Parallel Tools Workshop Rohit Atre, Zia Ul-Huda, Mohammad Norouzi, Arya Mazaheri, Zhen Li, Dr. Ali Jannesari, Prof. Felix Wolf 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 1

  2. Introduction • A large numbers of legacy programs need to be parallelized • Transforming an existing sequential program into a parallel one is not easy • DiscoPoP (Discovery of Potential Parallelism) is a tool to detect parallelism in sequential applications • Detects hotspots in the sequential applications • Gives hints to programmers: making parallelization process easy 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 2

  3. DiscoPoP Workflow Phase 1 Phase 3 Phase 2 Data Dependency Analysis Conversion to LLVM IR CU Parallelism Discovery Ranking Graph Ranked Source Computational Parallel Code Unit (CU) Opportu- Analysis nities Control Control-flow Region Analysis Info 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 3

  4. Outline • Profiler • Computational Units & Program graph • Applications • Evaluation • Future Works 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 4

  5. Profiling with signatures • A signature is usually implemented as a Bloom filter: − A fixed-size bit array − k different hash functions that together map an element to a number of array indices • Two signatures: one for recording read operations and one for recording write operations 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 5

  6. Profiling with signatures write signature read signature ... 0 0 0 0 a. write 3 0 0 0 0 ... b. read 2 ... 0 0 0 a c. read 1 0 0 b 0 ... d. write 2 ... 0 0 d a 0 c b 0 previous read at 2 in line b d WAR b|2 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 6

  7. Parallel data-dependence profiling distribute main thread “producer” fetch fetch worker threads “consumers” read signature read signature write signature write signature global dependence local dependence local dependence storage storage storage merge 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 7

  8. Profiling multithreaded programs • Allow program analyses for multithreaded applications − Communication pattern detection − Scheduling − Performance tuning 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 8

  9. Outline • Profiler • Computational Units & Program graph • Applications • Evaluation • Future Works 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 9

  10. Computational Unit(CU) • A collection of program statements 1 x = 3 • Follows the read-compute-write pattern 2 y = 4 3 a = x + rand() / x 4 b = x - rand() / x 5 x = a + b 6 a = y + rand() / y 7 b = y - rand() / y • 8 y = a + b Logical units to make larger tasks • Could be merged together • Assigned to threads x = 3 y = 4 • Building blocks for various patterns a = x + rand() / x a = y + rand() / y • Tasks in a taskpool • b = x - rand() / x b = y - rand() / y Stages of a pipeline x = a + b y = a + b CUx CUy 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 10

  11. Computational Unit (CU) – Example 1 int x = 3 ; Identify variables global to each region 2 for ( int i = 0 ; i < MAX_ITER ; ++ i ) { 3 int a = x + rand () / x ; Add the global variable x to 4 int b = x - rand () / x ; read-set if it is read 5 x = a + b ; Add the global variable x to the write-set when written 6 } Region 1 • Every read on a global variable should happen before a corresponding write on it • For every read instruction that violates, a new CU is created 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 11

  12. Program Graph (Output of phase-1) Dependencies Control-Flow Children 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 12

  13. Program Graph Visualization 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 13

  14. Outline • Profiler • Computational Units & Program graph • Applications • Evaluation • Future Works 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 14

  15. Applications • Detection of parallel design patterns • Output of phase-1 is used for detecting: • Pipeline • Multiloop pipeline • Task parallelism • Geometric decomposition • Do-all loops • Reduction • Different approaches are used to detect these patterns 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 15

  16. Parallel Patterns 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 16

  17. Pipeline • Exists only if stages are executed many times • Loops, recursions and functions with multiple loops • Graph Matrix is computed from CU Graph of the hotspot • Pipeline Matrix is created based on the number of CUs in Graph Matrix • Pipeline Matrix have specific properties like chain dependences and Forward dependence weights 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 17

  18. Multi-loop Pipeline • Iterations of one loop for (. . .)// Loop x depends on iterations of a[i] = foo(i); another loop for (. . .)// Loop y b[i] = bar(a[i]); Example code • Each loop can be a stage of a pipeline Iteration # of Loop x Iteration # of Loop y Variable (I x ) (I y ) a[0] 0 0 • We profile loops and a[1] 1 1 gather iteration … … … a[n] n n dependence data Results of profiled run 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 18

  19. Fusion & Reduction • Loop fusion: • Fusion of loops x and y can occur if: • Both loops x and y are do-all loops • There are no loop carried dependences. • Reduction • State-of-the-art compilers may miss reduction due to pointer aliasing or array referencing • Dynamic analysis helps overcome limitations of static analysis • Detection approach same as multi-loop pipeline • Profile iterations of a single loop 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 19

  20. Applications • SPMD/MIMD type task parallelism • Detection of independent sets of CUs that can run in parallel with each other • Detection of parallelism between different region levels • Detection of synchronization points between different parallel tasks 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 20

  21. Task Parallelism Dependencies Control-Flow Children 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 21

  22. Task Parallelism • Using Breadth-first search we classify the CUs into Fork, Worker and Barrier CUs 0 Unmarked task 1 2 3 4 Fork task Worker task 5 6 Barrier task Dependence 7 Current node Simplified CU Graph – Task Parallelism • Two barriers can run in parallel if there is no directed path between them 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 22

  23. CU Instantiation void cilksort(ELM *low, ELM *tmp, long size) void cilkmerge(ELM *low1, ELM *high1, { ELM *low2, ELM *high2, ... ELM *lowdestif { … cilksort(A, tmpA, quarter); cilksort(B, tmpB, quarter); cilkmerge(low1, split1 - 1, cilksort(C, tmpC, quarter); low2, split2, cilksort(D, tmpD, size - 3 * quarter); lowdest); cilkmerge(split1 + 1, high1, cilkmerge(A, A + quarter - 1, B, B + quarter - 1, split2 + 1, high2, tmpA); lowdest+lowsize+2); … cilkmerge(C, C + quarter - 1, D, low + size - 1, tmpC); } cilkmerge(tmpA, tmpC - 1, tmpC, tmpA + size - 1, A); ... } 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 23

  24. CU Instantiation - PET cilksort cilksort cilksort cilksort cilksort cilkmerge cilkmerge cilkmerge cilkmerge cilkmerge 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 24

  25. Applications • Energy efficient parallelism • Energy consumption per CU • Energy efficient pattern detection • Energy efficient task formation 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 25

  26. Energy Efficient Parallelism Energy optimization • Reduce memory accesses • Which openmp constructs to use? Energy efficient task formation • Considering CU attributes (data size, memory access frequency, etc.) to form tasks • Which openmp constructs to use? 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 26

Recommend


More recommend