compiler optimization for openmp accelerator offloading
play

Compiler Optimization For (OpenMP) Accelerator Offloading Johannes - PowerPoint PPT Presentation

EuroLLVM April 8, 2019 Brussels, Belgium Leadership Computing Facility Argonne National Laboratory Compiler Optimization For (OpenMP) Accelerator Offloading Johannes Doerfert and Hal Finkel https://www.alcf.anl.gov/ This research was


  1. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle. 1 master and N-1 worker teams, worker teams M threads: 7/14 The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle.

  2. Masters execute bar concurrently, workers idle. 1 master and N-1 worker teams, worker teams M threads: 7/14 The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution.

  3. 1 master and N-1 worker teams, worker teams M threads: 7/14 The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

  4. • a separate master team costs resources • synchronization has overhead • currently impossible to optimization 7/14 Problems: 1 master and N-1 worker teams, worker teams M threads: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

  5. • synchronization has overhead • currently impossible to optimization 7/14 Problems: 1 master and N-1 worker teams, worker teams M threads: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot • a separate master team costs resources bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

  6. • currently impossible to optimization 7/14 Problems: 1 master and N-1 worker teams, worker teams M threads: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot • a separate master team costs resources bar(); • synchronization has overhead } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

  7. 1 master and N-1 worker teams, worker teams M threads: 7/14 Problems: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot • a separate master team costs resources bar(); • synchronization has overhead } • currently impossible to optimization Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

  8. few + Logic +Logic + RT Calls Device RT Device Code Code Host Code IPO Module Cross Fat Binary Opt. 8/14 Host 1. Offmoad-Specific Optimizations on Device Code Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM

  9. few + Logic +Logic + RT Calls Device RT Device Code Host Code Module Cross IPO Fat Binary Opt. Host 8/14 1. Offmoad-Specific Optimizations on Device Code Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Code

  10. few + Logic +Logic + RT Calls Device RT Device Code Host Code Module Cross IPO Fat Binary Opt. Host 8/14 1. Offmoad-Specific Optimizations on Device Code Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Code

  11. +Logic few + Logic Device RT + RT Calls IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device Code Code Host Code

  12. +Logic Device RT few + Logic IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device Code+ RT Calls Code Host Code

  13. +Logic Device RT few IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Opt. Device Code Gen. OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device Code+ RT Calls + Logic Code Host Code

  14. +Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code

  15. +Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code

  16. +Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code

  17. +Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code

  18. +Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code

  19. few 8/14 1. Offmoad-Specific Optimizations on Device Code IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+ RT Calls + Logic Code Host Code

  20. 8/14 1. Offmoad-Specific Optimizations on Device Code IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  21. Interested? Take a look and contact me :) 1. Offmoad-Specific Optimizations on Device Code Reviewers are needed! allow for more to come! Pending patches “fix” the motivating problem and 8/14 IPO Module Cross Fat Binary Opt. Host Code Gen. Opt. Device OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  22. Interested? Take a look and contact me :) 1. Offmoad-Specific Optimizations on Device Code Reviewers are needed! allow for more to come! Pending patches “fix” the motivating problem and 8/14 IPO Module Cross Fat Binary Opt. Host Code Gen. Opt. Device OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  23. Interested? Take a look and contact me :) 1. Offmoad-Specific Optimizations on Device Code Reviewers are needed! allow for more to come! Pending patches “fix” the motivating problem and 8/14 IPO Module Cross Fat Binary Opt. Host Code Gen. Opt. Device OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  24. 8/14 1. Offmoad-Specific Optimizations on Device Code IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  25. 8/14 2. Optimize Device and Host Code Together IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  26. 8/14 2. Optimize Device and Host Code Together IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  27. 8/14 Device mization Opti- Device AND Host Opt. Opt. Host Fat Binary Cross Module Gen. Code IPO 2. Optimize Device and Host Code Together OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  28. • A straight-forward #pragma omp target front-end: • Interface exposes information and implementation choices: • Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface

  29. • Interface exposes information and implementation choices: • Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end:

  30. • Interface exposes information and implementation choices: • Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end:

  31. • Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end: • Interface exposes information and implementation choices:

  32. ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end: • Interface exposes information and implementation choices: • Device RT interface & implementation are separated:

  33. • if legal, switch all boolean UseSPMDMode flags to true • currently, no (unknown) global side-efgects allowed outside parallel regions. 10/14 1. Offload-Specific Optimizations — “SPMD-zation” • use inter-procedural reasoning to place minimal guards/synchronization

  34. • currently, no (unknown) global side-efgects allowed outside parallel regions. 10/14 1. Offload-Specific Optimizations — “SPMD-zation” • use inter-procedural reasoning to place minimal guards/synchronization • if legal, switch all boolean UseSPMDMode flags to true

  35. 10/14 1. Offload-Specific Optimizations — “SPMD-zation” • use inter-procedural reasoning to place minimal guards/synchronization • if legal, switch all boolean UseSPMDMode flags to true • currently, no (unknown) global side-efgects allowed outside parallel regions.

  36. • reachability & post-dominance restrict the set of potential next parallel regions • reuse already communicated/shared values if possible • currently, a simple state machine is generated with explicit conditionals for all to work on known parallel regions in the module 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable

  37. • reuse already communicated/shared values if possible • currently, a simple state machine is generated with explicit conditionals for all to work on known parallel regions in the module 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable • reachability & post-dominance restrict the set of potential next parallel regions

  38. • currently, a simple state machine is generated with explicit conditionals for all to work on known parallel regions in the module 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable • reachability & post-dominance restrict the set of potential next parallel regions • reuse already communicated/shared values if possible

  39. known parallel regions in the module to work on 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable • reachability & post-dominance restrict the set of potential next parallel regions • reuse already communicated/shared values if possible • currently, a simple state machine is generated with explicit conditionals for all

  40. TransitiveCallSite AbstractCallSite Passes (IPOs) 12/14 2. Optimize Device and Host Together — Abstract Call Sites CallInst InvokeInst CallSite Passes (IPOs)

  41. Passes (IPOs) 12/14 2. Optimize Device and Host Together — Abstract Call Sites CallInst InvokeInst CallSite TransitiveCallSite AbstractCallSite Passes (IPOs)

  42. Passes (IPOs) 12/14 Functional changes required for Inter-procedural Constant Propagation: 2. Optimize Device and Host Together — Abstract Call Sites CallInst InvokeInst CallSite TransitiveCallSite AbstractCallSite Passes (IPOs)

  43. 13/14 Abstract Call Sites — Performance Results

  44. 13/14 Abstract Call Sites — Performance Results

  45. 13/14 Abstract Call Sites — Performance Results

  46. 13/14 Abstract Call Sites — Performance Results

  47. 14/14 Conclusion

  48. 14/14 Conclusion

  49. 14/14 Conclusion

  50. 14/14 Conclusion

  51. 14/14 Conclusion

  52. I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18)

  53. II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , …

  54. III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var)

  55. IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts:

  56. V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers

  57. ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute

  58. ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute

  59. ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — In LLVM: Attribute Deduction (IPO!) read/write-only , restrict / noalias , … II: Variable Privatization — In LLVM: Argument Promotion (IPO!) shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute

  60. OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

  61. OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out);

  62. OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int *N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

  63. OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

  64. OpenMP Input: An Abstract Parallel IR #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by an annotated loop for /* parallel */ ( int i = 0; i < N; i++) body_fn(i, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int i , int * N, float ** In, float ** Out) { (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

  65. OpenMP Input: Early Outlined + Transitive Calls #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Model transitive call: body_fn(?, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

  66. OpenMP Input: − integration cost per IPO Early Outlined + Transitive Calls #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Model transitive call: body_fn(?, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); + valid and executable IR for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] } + no unintended interactions

  67. TransitiveCallSite AbstractCallSite Passes (IPOs) IPO in LLVM CallInst InvokeInst CallSite Passes (IPOs)

Recommend


More recommend