Dependence-Based Automatic Parallelization using CnC Bo Zhao, Ali Janessari Technische Universit¨ at Darmstadt bo.zhao@rwth-aachen.de, jannesari@cs.tu-darmstadt.de September 8, 2015 Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 1 / 19
Overview Introduction 1 Motivation Objectives Approach 2 Overview Framework Program Analysis Task parallelism Extraction Code Generation Conclusion 3 Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 2 / 19
Introduction Motivation Motivation Multicore and architecture has become popular as a result of the stagnating single core performance Many software products are implemented sequentially fail to tap potential of the parallel hardware Problem : the gap between parallel hardware and sequential software take advantage of new hardware features preserve the current software investment save human resource Solution: automatically (semi-automatically) transform sequential code into parallel code Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 3 / 19
Introduction Objectives Objectives Discover potential parallelism Loop parallelism Irregular task parallelism Detect data and control dependencies Generate parallel code using Concurrent Collections Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 4 / 19
Approach Overview Framework Overview Workflow Phase2: Coarse-Grained Task Phase1: Program Analysis Phase3: Code Generation Extraction & IR2IR Trans Ctrl Info Front End Compile Time Seq IR IR-to-IR transformation CnC-Par IR DiscoPoP Static Code Analysis Task Sequential Graph Source Unit Testing Code instrumentation Code CU Task Graph Generator LLVM JIT Compiler Graph Runtime Dynamic Code Analysis Parallel Execution Correctness Feedback Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 5 / 19
Approach Program Analysis DiscoPoP (Discovery of Potential Parallelism) Phase 1: Static and dynamic analyses Instruments the target program and identifies control and data dependencies Phase 2 & 3: Post-mortem analysis for parallelism discovery Builds Computational Units (CUs) for the target program Ranking Phase 1 Phase 2 Phase 3 Data Dependency Memory Access Analysis & Control-flow Variable Lifetime CU Instrumentation Conversion to IR execution Analysis Graph Parallel Pattern Discovery & Parallelism Detection Runtime Dependency Ranking Ranked Source Merging Computational Parallel Code Unit Analysis Opportunities Dynamic Control-flow Analysis Control Static Control-flow Region Analysis Information static dynamic Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 6 / 19
Approach Program Analysis Dependence Profiling Control dependence <FileID:LineID <Contr.ID> <Label> <Exec.Time> 1:60 BGN loop void 1:74 END loop 1200 Data dependence <FileID:LineID> <Contr.ID> <Label> <Dep.> <FileID:LineID|VarName> 1:63 NOM void RAW 1:59 | temp1 1:70 NOM void WAR 1:67 | temp2 Data dependence (multi-threaded) <FileID:LineID|ThreadID> <Contr.ID> <Label> <Dep.> <FileID:LineID|VarName|ThreadID> 4:59 | 2 NOM void WAR 4:71 | 2 | z real Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 7 / 19
Approach Task parallelism Extraction Computation Unit (CU) A collection of instructions (LLVM-IR instruction) Follows the read-compute-write pattern A program state is first read from memory, the new state is computed, and finally written back A small piece of code containing no parallelism or only ILP Building blocks for forming parallel tasks CU graph Dependences are mapped to CUs Exposes tightly-connected CUs x = 3 y = 4 INIT 1 x = 3 2 y = 4 3 a = x + rand() / x 4 b = x - rand() / x 5 x = a + b a = x + rand() / x a = y + rand() / y 6 a = y + rand() / y 7 b = y - rand() / y b = x - rand() / x b = y - rand() / y 8 y = a + b x = a + b y = a + b CUx CUy Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 8 / 19
Approach Task parallelism Extraction CU Graph CU-21 [147,156,160] CU-32 [152,153,154,156] *pY = args->pY; for(all_pixels){ for(all_pixels){ R = *in++; Y = round(c7*R+c8*G+c9*B); *pY++ = Y; G = *in++; B = *in++; } Y= round(c1*R+c2*G+c3*B); } CU-19 [146,152,153,154] *in = args->in_img; CU-33 [152,153,154,157] for(all_pixels){ for(all_pixels){ R = *in++; CU-24 [148,157,161] G = *in++; R = *in++; G = *in++; CU-34 [152,153,154,158] B = *in++; *pU = args->pU; } B = *in++; for(all_pixels){ for(all_pixels){ U = round(c7*R+c8*G+c9*B); U = round(c4*R+c5*G+c6*B); R = *in++; } *pU++ = U; G = *in++; } B = *in++; V = round(c7*R+c8*G+c9*B); } CU-27 [149,158,162] *pV = args->pV; for(all_pixels){ V = round(c7*R+c8*G+c9*B); *pV++ = V; } Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 9 / 19
Approach Task parallelism Extraction Program Execution tree Program 1 - 377 A call tree combined with loop information and basic Basic Block Loop 11 - 19 21 - 36 blocks CU graph is mapped on to Basic Block the execution tree 22 - 27 Tree node CU Data Dependency Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 10 / 19
Approach Task parallelism Extraction Task Extraction Merge CUs contained in strongly connected components (SCCs) or in chains A A A B B B 1 2 C F C FGH FGH D G D CDE SCC SCC chain E H E I I I SCC FGH and chain CDE are two tasks Hide complex dependences inside SSCs, exposing parallelization opportunities outside Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 11 / 19
Approach Task parallelism Extraction Task Extraction Two CUs can share common instructions 53 53 0.35 53 0.35 0.28 0.28 0.35 53 54 54 0.28 2 6 1 55 55 7 54 54 0.17 0.17 55 55 0.10 0.10 3 1 56 56 4 0.18 0.18 56 56 57 7 57 0.43 0.18 57 0.43 5 3 57 0.20 0.20 2 0.20 1 0.43 0.20 4 58 0.20 58 58 0.20 2 59 0.05 59 0.05 59 58 No. of Common Instructions 0.05 59 Affinity Min Cut No. of Dependences (a) CU graph with CUs (b) CU graph with (c) CU graph with a (d) CU graph as vertices and RAW affinities between the CUs minimum cut partitioned to identify dependences and common tasks instructions as edges Figure : Demonstration of a CU graph and graph partitioning to form tasks. Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 12 / 19
Approach Task parallelism Extraction Task Graph Task Extraction Not limited to predefined language constructs Covers independent tasks and dependent tasks (coarse-grained tasks) function: 365 - 381 Parallelizable: true loop: 372 - 380 Parallelizable: false INIT CU Blue 370 374 - 379 loop: 682 - 709 Parallelizable: true RAW Yellow CU INIT if-else: 667 - 678 if-else: 719 Control 666 Parallelizable: false Parallelizable: false Region Grey Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 13 / 19
Approach Code Generation Code Generation On going work Map the task graph to CnC graph CnC defines two scheduling constraints in parallel execution producer/consumer relationships controller/controllee relationships A task (coarse-grained CU) is similar to a step collection Data dependency among tasks are known form the task graph Detected control information is not sufficient Users specify the controller/controllee relationships Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 14 / 19
Approach Code Generation Code Generation Propose CnC-specific IR template Transform the original IR to Cnc specific IR using task graph and users’ control information Generate binary code form CnC speceific IR Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 15 / 19
Approach Code Generation Code Generation previous code transformation results Source-to-source transformation using Intel TBB flow graph (semi-automatic) FaceDetection (CnC sample application) (b) Flow graph (a) Logic of FaceDetection Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 16 / 19
Approach Code Generation Code Generation Speedups on 2x8-core Intel Xeon E5-2650 2 GHz Official Manual CnC Parallelization Semi-automatic TBB Parallelization 20 15 speedup 10 5 0 1 2 4 8 16 32 thread Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 17 / 19
Conclusion Conclusion Profile data and control dependencies DiscoPoP Users’ specification Extract coarse-grained task parallelism CU graph Program execution tree Task graph Generate parallel code using CnC Define CnC-specific IR Code transformation at IR level Employ CnC runtime library Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 18 / 19
Conclusion Thanks! Q & A Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 19 / 19
Recommend
More recommend