10th International Parallel Tools Workshop Rohit Atre, Zia Ul-Huda, - PowerPoint PPT Presentation

DiscoPoP: A Profiling, Analysis, and Visualization Tool for Parallelism Discovery 10th International Parallel Tools Workshop Rohit Atre, Zia Ul-Huda, Mohammad Norouzi, Arya Mazaheri, Zhen Li, Dr. Ali Jannesari, Prof. Felix Wolf 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 1

Introduction • A large numbers of legacy programs need to be parallelized • Transforming an existing sequential program into a parallel one is not easy • DiscoPoP (Discovery of Potential Parallelism) is a tool to detect parallelism in sequential applications • Detects hotspots in the sequential applications • Gives hints to programmers: making parallelization process easy 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 2

DiscoPoP Workflow Phase 1 Phase 3 Phase 2 Data Dependency Analysis Conversion to LLVM IR CU Parallelism Discovery Ranking Graph Ranked Source Computational Parallel Code Unit (CU) Opportu- Analysis nities Control Control-flow Region Analysis Info 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 3

Outline • Profiler • Computational Units & Program graph • Applications • Evaluation • Future Works 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 4

Profiling with signatures • A signature is usually implemented as a Bloom filter: − A fixed-size bit array − k different hash functions that together map an element to a number of array indices • Two signatures: one for recording read operations and one for recording write operations 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 5

Profiling with signatures write signature read signature ... 0 0 0 0 a. write 3 0 0 0 0 ... b. read 2 ... 0 0 0 a c. read 1 0 0 b 0 ... d. write 2 ... 0 0 d a 0 c b 0 previous read at 2 in line b d WAR b|2 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 6

Parallel data-dependence profiling distribute main thread “producer” fetch fetch worker threads “consumers” read signature read signature write signature write signature global dependence local dependence local dependence storage storage storage merge 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 7

Profiling multithreaded programs • Allow program analyses for multithreaded applications − Communication pattern detection − Scheduling − Performance tuning 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 8

Computational Unit(CU) • A collection of program statements 1 x = 3 • Follows the read-compute-write pattern 2 y = 4 3 a = x + rand() / x 4 b = x - rand() / x 5 x = a + b 6 a = y + rand() / y 7 b = y - rand() / y • 8 y = a + b Logical units to make larger tasks • Could be merged together • Assigned to threads x = 3 y = 4 • Building blocks for various patterns a = x + rand() / x a = y + rand() / y • Tasks in a taskpool • b = x - rand() / x b = y - rand() / y Stages of a pipeline x = a + b y = a + b CUx CUy 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 10

Computational Unit (CU) – Example 1 int x = 3 ; Identify variables global to each region 2 for ( int i = 0 ; i < MAX_ITER ; ++ i ) { 3 int a = x + rand () / x ; Add the global variable x to 4 int b = x - rand () / x ; read-set if it is read 5 x = a + b ; Add the global variable x to the write-set when written 6 } Region 1 • Every read on a global variable should happen before a corresponding write on it • For every read instruction that violates, a new CU is created 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 11

Program Graph (Output of phase-1) Dependencies Control-Flow Children 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 12

Program Graph Visualization 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 13

Applications • Detection of parallel design patterns • Output of phase-1 is used for detecting: • Pipeline • Multiloop pipeline • Task parallelism • Geometric decomposition • Do-all loops • Reduction • Different approaches are used to detect these patterns 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 15

Parallel Patterns 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 16

Pipeline • Exists only if stages are executed many times • Loops, recursions and functions with multiple loops • Graph Matrix is computed from CU Graph of the hotspot • Pipeline Matrix is created based on the number of CUs in Graph Matrix • Pipeline Matrix have specific properties like chain dependences and Forward dependence weights 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 17

Multi-loop Pipeline • Iterations of one loop for (. . .)// Loop x depends on iterations of a[i] = foo(i); another loop for (. . .)// Loop y b[i] = bar(a[i]); Example code • Each loop can be a stage of a pipeline Iteration # of Loop x Iteration # of Loop y Variable (I x ) (I y ) a[0] 0 0 • We profile loops and a[1] 1 1 gather iteration … … … a[n] n n dependence data Results of profiled run 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 18

Fusion & Reduction • Loop fusion: • Fusion of loops x and y can occur if: • Both loops x and y are do-all loops • There are no loop carried dependences. • Reduction • State-of-the-art compilers may miss reduction due to pointer aliasing or array referencing • Dynamic analysis helps overcome limitations of static analysis • Detection approach same as multi-loop pipeline • Profile iterations of a single loop 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 19

Applications • SPMD/MIMD type task parallelism • Detection of independent sets of CUs that can run in parallel with each other • Detection of parallelism between different region levels • Detection of synchronization points between different parallel tasks 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 20

Task Parallelism Dependencies Control-Flow Children 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 21

Task Parallelism • Using Breadth-first search we classify the CUs into Fork, Worker and Barrier CUs 0 Unmarked task 1 2 3 4 Fork task Worker task 5 6 Barrier task Dependence 7 Current node Simplified CU Graph – Task Parallelism • Two barriers can run in parallel if there is no directed path between them 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 22

CU Instantiation void cilksort(ELM *low, ELM *tmp, long size) void cilkmerge(ELM *low1, ELM *high1, { ELM *low2, ELM *high2, ... ELM *lowdestif { … cilksort(A, tmpA, quarter); cilksort(B, tmpB, quarter); cilkmerge(low1, split1 - 1, cilksort(C, tmpC, quarter); low2, split2, cilksort(D, tmpD, size - 3 * quarter); lowdest); cilkmerge(split1 + 1, high1, cilkmerge(A, A + quarter - 1, B, B + quarter - 1, split2 + 1, high2, tmpA); lowdest+lowsize+2); … cilkmerge(C, C + quarter - 1, D, low + size - 1, tmpC); } cilkmerge(tmpA, tmpC - 1, tmpC, tmpA + size - 1, A); ... } 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 23

CU Instantiation - PET cilksort cilksort cilksort cilksort cilksort cilkmerge cilkmerge cilkmerge cilkmerge cilkmerge 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 24

Applications • Energy efficient parallelism • Energy consumption per CU • Energy efficient pattern detection • Energy efficient task formation 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 25

Energy Efficient Parallelism Energy optimization • Reduce memory accesses • Which openmp constructs to use? Energy efficient task formation • Considering CU attributes (data size, memory access frequency, etc.) to form tasks • Which openmp constructs to use? 10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 26

10th International Parallel Tools Workshop Rohit Atre, Zia Ul-Huda, - PowerPoint PPT Presentation

DiscoPoP: A Profiling, Analysis, and Visualization Tool for Parallelism Discovery 10th International Parallel Tools Workshop Rohit Atre, Zia Ul-Huda, Mohammad Norouzi, Arya Mazaheri, Zhen Li, Dr. Ali Jannesari, Prof. Felix Wolf 10/12/2016 |

Presentation Videos Opening 10th International Symposium on Knappable Materials

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

1 11th 10th International Conference on Quantum Cryptography 1 11th 10th International

6 th international Parallel Tools Workshop Cray Performance Measurement and Analysis Tools

10th edition of the World Wide Workshop for Young Environmental 10th edition of the World Wide Wo

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Cray Tools, an overview 8th International Parallel Tools Workshop Stuttgart, Germany, 1st October

The 10 10th AIM International Workshop th AIM International Workshop The Tae Yong Jung IGES,

Temporal reflection and reflective scheduling for the L4 microkernel WIRTES workshop - Pisa, 2

MantisTable an automatic approach for the Semantic Table Interpretation Marco Cremaschi, Roberto

On Global Types and Multi-Party Sessions Giuseppe Castagna CNRS Universit e Paris Diderot

From Data to Machine Readable Information Aggregated in Research Objects Markus Stocker PANGAEA

A New Formulation of Relativistic Euler Flow: Miraculous Geo-Analytic Structures and Applications

GANs for Creativity and Design MIX+GAN Ian Goodfellow, Sta ff Research Scientist, Google Brain

A Resource Delegation Framework for Software-Defined Networks Ilya Baldin ibaldin@renci.org

Wardley Mapping Mapping a scaleup Rachel Murphy - CEO difrent.co.uk 12 months ago