Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle. 1 master and N-1 worker teams, worker teams M threads: 7/14 The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle.
Masters execute bar concurrently, workers idle. 1 master and N-1 worker teams, worker teams M threads: 7/14 The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution.
1 master and N-1 worker teams, worker teams M threads: 7/14 The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.
• a separate master team costs resources • synchronization has overhead • currently impossible to optimization 7/14 Problems: 1 master and N-1 worker teams, worker teams M threads: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.
• synchronization has overhead • currently impossible to optimization 7/14 Problems: 1 master and N-1 worker teams, worker teams M threads: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot • a separate master team costs resources bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.
• currently impossible to optimization 7/14 Problems: 1 master and N-1 worker teams, worker teams M threads: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot • a separate master team costs resources bar(); • synchronization has overhead } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.
1 master and N-1 worker teams, worker teams M threads: 7/14 Problems: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot • a separate master team costs resources bar(); • synchronization has overhead } • currently impossible to optimization Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.
few + Logic +Logic + RT Calls Device RT Device Code Code Host Code IPO Module Cross Fat Binary Opt. 8/14 Host 1. Offmoad-Specific Optimizations on Device Code Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM
few + Logic +Logic + RT Calls Device RT Device Code Host Code Module Cross IPO Fat Binary Opt. Host 8/14 1. Offmoad-Specific Optimizations on Device Code Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Code
few + Logic +Logic + RT Calls Device RT Device Code Host Code Module Cross IPO Fat Binary Opt. Host 8/14 1. Offmoad-Specific Optimizations on Device Code Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Code
+Logic few + Logic Device RT + RT Calls IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device Code Code Host Code
+Logic Device RT few + Logic IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device Code+ RT Calls Code Host Code
+Logic Device RT few IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Opt. Device Code Gen. OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device Code+ RT Calls + Logic Code Host Code
+Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code
+Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code
+Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code
+Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code
+Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code
few 8/14 1. Offmoad-Specific Optimizations on Device Code IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+ RT Calls + Logic Code Host Code
8/14 1. Offmoad-Specific Optimizations on Device Code IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code
Interested? Take a look and contact me :) 1. Offmoad-Specific Optimizations on Device Code Reviewers are needed! allow for more to come! Pending patches “fix” the motivating problem and 8/14 IPO Module Cross Fat Binary Opt. Host Code Gen. Opt. Device OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code
Interested? Take a look and contact me :) 1. Offmoad-Specific Optimizations on Device Code Reviewers are needed! allow for more to come! Pending patches “fix” the motivating problem and 8/14 IPO Module Cross Fat Binary Opt. Host Code Gen. Opt. Device OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code
Interested? Take a look and contact me :) 1. Offmoad-Specific Optimizations on Device Code Reviewers are needed! allow for more to come! Pending patches “fix” the motivating problem and 8/14 IPO Module Cross Fat Binary Opt. Host Code Gen. Opt. Device OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code
8/14 1. Offmoad-Specific Optimizations on Device Code IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code
8/14 2. Optimize Device and Host Code Together IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code
8/14 2. Optimize Device and Host Code Together IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code
8/14 Device mization Opti- Device AND Host Opt. Opt. Host Fat Binary Cross Module Gen. Code IPO 2. Optimize Device and Host Code Together OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code
• A straight-forward #pragma omp target front-end: • Interface exposes information and implementation choices: • Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface
• Interface exposes information and implementation choices: • Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end:
• Interface exposes information and implementation choices: • Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end:
• Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end: • Interface exposes information and implementation choices:
⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end: • Interface exposes information and implementation choices: • Device RT interface & implementation are separated:
• if legal, switch all boolean UseSPMDMode flags to true • currently, no (unknown) global side-efgects allowed outside parallel regions. 10/14 1. Offload-Specific Optimizations — “SPMD-zation” • use inter-procedural reasoning to place minimal guards/synchronization
• currently, no (unknown) global side-efgects allowed outside parallel regions. 10/14 1. Offload-Specific Optimizations — “SPMD-zation” • use inter-procedural reasoning to place minimal guards/synchronization • if legal, switch all boolean UseSPMDMode flags to true
10/14 1. Offload-Specific Optimizations — “SPMD-zation” • use inter-procedural reasoning to place minimal guards/synchronization • if legal, switch all boolean UseSPMDMode flags to true • currently, no (unknown) global side-efgects allowed outside parallel regions.
• reachability & post-dominance restrict the set of potential next parallel regions • reuse already communicated/shared values if possible • currently, a simple state machine is generated with explicit conditionals for all to work on known parallel regions in the module 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable
• reuse already communicated/shared values if possible • currently, a simple state machine is generated with explicit conditionals for all to work on known parallel regions in the module 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable • reachability & post-dominance restrict the set of potential next parallel regions
• currently, a simple state machine is generated with explicit conditionals for all to work on known parallel regions in the module 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable • reachability & post-dominance restrict the set of potential next parallel regions • reuse already communicated/shared values if possible
known parallel regions in the module to work on 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable • reachability & post-dominance restrict the set of potential next parallel regions • reuse already communicated/shared values if possible • currently, a simple state machine is generated with explicit conditionals for all
TransitiveCallSite AbstractCallSite Passes (IPOs) 12/14 2. Optimize Device and Host Together — Abstract Call Sites CallInst InvokeInst CallSite Passes (IPOs)
Passes (IPOs) 12/14 2. Optimize Device and Host Together — Abstract Call Sites CallInst InvokeInst CallSite TransitiveCallSite AbstractCallSite Passes (IPOs)
Passes (IPOs) 12/14 Functional changes required for Inter-procedural Constant Propagation: 2. Optimize Device and Host Together — Abstract Call Sites CallInst InvokeInst CallSite TransitiveCallSite AbstractCallSite Passes (IPOs)
13/14 Abstract Call Sites — Performance Results
13/14 Abstract Call Sites — Performance Results
13/14 Abstract Call Sites — Performance Results
13/14 Abstract Call Sites — Performance Results
14/14 Conclusion
14/14 Conclusion
14/14 Conclusion
14/14 Conclusion
14/14 Conclusion
I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18)
II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , …
III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var)
IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts:
V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers
⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute
⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute
⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — In LLVM: Attribute Deduction (IPO!) read/write-only , restrict / noalias , … II: Variable Privatization — In LLVM: Argument Promotion (IPO!) shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute
OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];
OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out);
OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int *N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }
OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }
OpenMP Input: An Abstract Parallel IR #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by an annotated loop for /* parallel */ ( int i = 0; i < N; i++) body_fn(i, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int i , int * N, float ** In, float ** Out) { (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }
OpenMP Input: Early Outlined + Transitive Calls #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Model transitive call: body_fn(?, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }
OpenMP Input: − integration cost per IPO Early Outlined + Transitive Calls #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Model transitive call: body_fn(?, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); + valid and executable IR for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] } + no unintended interactions
TransitiveCallSite AbstractCallSite Passes (IPOs) IPO in LLVM CallInst InvokeInst CallSite Passes (IPOs)
Recommend
More recommend