DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, and Felix Wolf 02.10.2014
Outline • Background • Approach • Results 2 01.10.2014
Background • Multicore CPUs are dominating the market of desktops and servers, but writing programs that utilize the available hardware parallelism on these architectures still remains a challenge. • Today, software development is mostly the transformation of programs written by someone else rather than starting from scratch. [1] • Parallelizing legacy sequential programs presents a huge economic challenge. • Appropriate tool support required. [1] R. E. Johnson. Software development is program transformation. In Proceedings of the FSE/SDP Workshop on Future of Software Engineering Research , FoSER ’10, pages 177-180. 3 01.10.2014
Related work • Dynamic approaches Kremlin – “gprof in parallel age” o “available parallelism” o targets on OpenMP style loops o based on critical path analysis Alchemist o number of instructions / number of dependencies o control regions o counts data dependencies - Previous dynamic approaches usually do not reveal the root causes that prevent parallelization, as the profiling overhead is too high. 4 01.10.2014
Related work • Static approaches Cetus o compiler infrastructure for source-to-source transformation o framework for writing automatic parallelization tools ParallWare, Par4All, Polly, PLUTO, … o loop parallelism o automatic parallel code generation o mainly for scientific computing kernels - Previous static approaches mainly focus on loop parallelism in scientific computing area since static dependence analysis is conservative, and kernels have more regular access patterns. 5 01.10.2014
Our goal • Discover potential parallelism in sequential programs • Target parallelism: o DOALL loops o Pipeline o Tasking • Reveal specific data dependences that prevent parallelization • Efficient in time and space 6 01.10.2014
Outline • Background • Approach • Results 7 01.10.2014
Approach • Work flow P h a s e 1 P h a s e 2 Depen- Conversion to IR Memory Access dency & Control-fmow Graph Parallelism Discovery Instrumentation Ranking Ranked Sour ce Execution Parallel Code Oppor tunities Control Static Control- Region fmow Analysis Information static dynamic 8 01.10.2014
Approach • Background • Approach o Dependence profiling o Computational Unit and CU graph o Parallelism discovery • Results 9 01.10.2014
Dependence profiling • Detailed data dependences with control-flow information 1:60 BGN loop 1:60 NOM {RAW 1:60|i} {WAR 1:60|i} {INIT *} 1:63 NOM {RAW 1:59|temp1} {RAW 1:67|temp1} 1:64 NOM {RAW 1:60|i} 1:65 NOM {RAW 1:59|temp1} {RAW 1:67|temp1} {WAR 1:67|temp2} {INIT *} 1:66 NOM {RAW 1:59|temp1} {RAW 1:65|temp2} {RAW 1:67|temp1} {INIT *} 1:67 NOM {RAW 1:65|temp2} {WAR 1:66|temp1} 1:70 NOM {RAW 1:67|temp1} {INIT *} 1:74 NOM {RAW 1:41|block} 1:74 END loop 1200 10 01.10.2014
Dependence profiling • Support for multithreaded programs 4:58|2 NOM {WAR 4:77|2|iter} 4:59|2 NOM {WAR 4:71|2|z_real} 4:64|3 NOM {RAW 3:75|0|maxiter} {RAW 4:58|3|iter} {RAW 4:61|3|z_norm} {RAW 4:71|3|z_norm} {RAW 4:73|3|iter} 4:69|3 NOM {RAW 4:57|3|c_real} {RAW 4:66|3|z2_real} {WAR 4:67|3|z_real} 4:71|2 NOM {RAW 4:69|2|z_real} {RAW 4:70|2|z_imag} {WAR 4:64|2|z_norm} 4:80|1 NOM {WAW 4:80|1|green} {INIT *} - Discover more parallelism in parallel programs - Support other analyses where necessary information can be derived from dependence 11 01.10.2014
Dependence profiling • Parallel implementation, efficient in both time and space • Implemented based on LLVM 1 • Instrumentation applied to IR • Instrumentation library integrated in Compiler-RT • Interface integrated in Clang 1 DiscoPoP on LLVM website: http://llvm.org/ProjectsWithLLVM/ 12 01.10.2014
Approach • Background • Approach o Dependence profiling o Computational Unit and CU graph o Parallelism discovery • Results 13 01.10.2014
Computational Unit (CU) • A collection of instructions • Follows the read-compute-write pattern: a program state is first read from memory, the new state is computed, and finally written back • A small piece of code containing no parallelism or only ILP • Building blocks of parallel tasks 14 01.10.2014
Computational Unit (CU) //Region 0; Depth 0 void netlist::get_random_pair(netlist_elem** a, netlist_elem** b, Rng* rng) { //get a random element long id_a = rng->rand(_chip_size); netlist_elem* elem_a = &(_elements[id_a]); //now do the same for b long id_b = rng->rand(_chip_size); netlist_elem* elem_b = &(_elements[id_b]); //Region 1; Depth 1; while (id_b == id_a) { id_b = rng->rand(_chip_size); elem_b = &(_elements[id_b]); } *a = elem_a; *b = elem_b; return; } 15 01.10.2014
Computational Unit (CU) 16 01.10.2014
CU graph 53 2 • Two CUs can share common 6 1 7 instructions blue edges 54 Are two CUs refer to the same code 55 section? 3 1 4 56 • A CU can depend on another via 7 data dependence red edges 57 5 Are two CUs tightly depend on each 3 2 1 4 other? 58 2 59 Should two CUs be merged? No. of Common Instructions No. of Dependences 17 01.10.2014
Approach • Background • Approach o Dependence profiling o Computational Unit and CU graph o Parallelism discovery • Results 18 01.10.2014
Parallelism discovery • DOALL loops o Looking for loop-carried dependences loopA: no loopA { …… loopB: yes …… loopB { …… …… …… } …… …… } 19 01.10.2014
Parallelism discovery • DOALL loops o Looking for loop-carried dependences loopA { …… …… loopA: yes loopB { loopB: no …… …… …… } …… …… } 20 01.10.2014
Parallelism discovery • DOALL loops o Looking for loop-carried dependences loopA { …… …… loopB { …… …… …… loopA: no } …… loopB: yes …… } 21 01.10.2014
Parallelism discovery • DOALL loops o Looking for loop-carried dependences loopA { …… …… loopA: no loopB { loopB: yes …… …… …… } …… …… } 22 01.10.2014
Parallelism discovery #pragma omp parallel for private(i, price, priceDelta) for (i=0; i<numOptions; i++) { /* Calling main function to calculate option value based on * Black & Scholes's equation. */ price = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i], rate[i], volatility[i], otime[i], otype[i], 0); prices[i] = price; #ifdef ERR_CHK priceDelta = data[i].DGrefval - price; if( fabs(priceDelta) >= 1e-4 ){ printf("Error on %d. Computed=%.5f, Ref=%.5f, Delta=%.5f\n", i, price, data[i].DGrefval, priceDelta); numError++; } #endif } The main loop of Parsec.blackscholes. 23 01.10.2014
Parallelism discovery • Tasking A A A B B B 1 2 C F C F G H F G H D G D C D E SCC SCC chain E H E I I I 24 01.10.2014
Parallelism discovery • Tasking 53 53 53 2 0.35 0.35 6 1 0.28 0.28 7 54 54 54 55 55 55 3 1 0.17 0.17 4 0.10 56 0.10 56 7 56 0.18 0.18 57 5 57 3 57 0.43 2 0.43 1 4 0.20 0.20 0.20 58 0.20 2 59 58 58 0.05 59 0.05 59 No. of Common Instructions No. of Dependences Affinity Min Cut 25 01.10.2014
Parallelism discovery #pragma omp parallel { #pragma omp sections { #pragma omp section { for (auto iter = _elem_names.begin(); iter != _elem_names.end(); ++iter){ netlist_elem* elem = iter->second; for (int i = 0; i< elem->fanin.size(); ++i){ location_t* fanin_loc = elem->fanin[i]->present_loc.Get(); fanin_cost += fabs(elem->present_loc.Get()->x - fanin_loc->x); fanin_cost += fabs(elem->present_loc.Get()->y - fanin_loc->y); } } } 26 01.10.2014
Parallelism discovery #pragma omp section { for (auto iter = _elem_names.begin(); iter != _elem_names.end(); ++iter){ netlist_elem* elem = iter->second; for (int i = 0; i< elem->fanout.size(); ++i){ location_t* fanout_loc = elem->fanout[i]->present_loc.Get(); fanout_cost += fabs(elem->present_loc.Get()->x - fanout_loc->x); fanout_cost += fabs(elem->present_loc.Get()->y - fanout_loc->y); } } } } } 27 01.10.2014
Parallelism discovery • Pipeline o template matching o input: a CU graph mapped onto the execution tree of the program r o o t x y f u n c t i o n L e a f l o o p . . . l o o p y i s c a l l e d f r o m x x y y i s d a t a - d e p e n d e n t o n x . . l e a f . . . . l e a f l e a f E x e c u t i o n T r e e 28 01.10.2014
Parallelism discovery • Pipeline • Cross-correlation between two vectors to determine similarity • The vector of a program is derived from its CU graph • The vector of a parallel pattern is built using specific properties of the pattern p . g [ ] CorrCoef CorrCoef 0 , 1 = ∈ p g • CorrCoef: - 0: pattern not detected - 1: pattern detected successfully - (0,1): pattern may exist but there are obstacles in implementing it 29 01.10.2014
Recommend
More recommend