modesto data centric analytic optimization of complex
play

MODESTO: Data-centric Analytic Optimization of Complex Stencil - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , AND T ORSTEN H OEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures spcl.inf.ethz.ch @spcl_eth Stencil Programs Motivation:


  1. spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , AND T ORSTEN H OEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures

  2. spcl.inf.ethz.ch @spcl_eth Stencil Programs Motivation:  COSMO is a regional climate model used by 7 national weather services  The real-world application COSMO contains hundreds of different stencils Analysis:  Stencil programs can be represented using directed acyclic graphs  Nodes and edges correspond to stencils and dependencies respectively simplified horizontal diffusion example ⊕ ⊕ in lap fli ⊕ ⊕ flj out ⊕ ⊕ 𝑏⨁𝑐 = 𝑏 ′ + 𝑐 ′ 𝑏 ′ ∈ 𝑏, 𝑐 ′ ∈ 𝑐+ 2

  3. spcl.inf.ethz.ch @spcl_eth TODO: Quickly say what tiling and loop fusion are? 3

  4. spcl.inf.ethz.ch @spcl_eth How to Deal with Data Dependencies?  Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension) tile 2 Inter tile data dependencies: i-dimension • Perform halo exchange communication • Introduce redundant computation tile 1 tile 0 time 4

  5. spcl.inf.ethz.ch @spcl_eth How to Deal with Data Dependencies?  Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension) Halo Exchange Parallel (hp): • Update tiles in parallel i-dimension • Perform halo exchange communication Pros and Cons: • Avoid redundant computation • At the cost of additional synchronization time 5

  6. spcl.inf.ethz.ch @spcl_eth How to Deal with Data Dependencies?  Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension) Halo Exchange Sequential (hs): • Update tiles sequentially i-dimension • E.g. Innermost loop updates tile-by-tile Pros and Cons: • Avoid redundant computation • At cost of being sequential time 6

  7. spcl.inf.ethz.ch @spcl_eth How to Deal with Data Dependencies?  Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension) Computation on-the-fly (of): • Compute all dependencies on-the-fly i-dimension • «Overlapped tiling» Pros and Cons: • Avoid synchronization • At the cost of redundant computation time 7

  8. spcl.inf.ethz.ch @spcl_eth Case Study: STELLA (STEncil Loop LAnguage)  STELLA is a stencil DSL used to refactor the dynamical core of COSMO // define stencil functors using C++ template metaprogramming: struct Lap { ... }; struct Fli { ... }; ... // stencil assembly Stencil stencil ; StencilCompiler::Build( stencil, pack_parameters( ... ), define_temporaries( StencilBuffer<lap, double>(), StencilBuffer<fli, double>(), ... ), define_loops ( define_sweep ( StencilStage<Lap, IJRange<-1,1,-1,1> >(), StencilStage<Fli, IJRange<-1,0,0,0> >(), ... ))); STELLA defines a virtual tiling // stencil execution hierarchy that facilitates platform stencil.Apply(); independent code generation 8

  9. spcl.inf.ethz.ch @spcl_eth Tiling Hierarchy of STELLA’s GPU -Backend DSL Tile Size Strategy Memory Communication halo exchange sweep 1 x 1 x 1 registers scratchpad parallel tiling hierarchy halo exchange sweep ∞ x ∞ x 1 registers registers sequential computation loop 64x4x64 GDDR - on-the-fly computation stencil ∞ x ∞ x ∞ GDDR - on-the-fly 9

  10. spcl.inf.ethz.ch @spcl_eth Stencil Program Algebra  Map stencils to the tiling hierarchy using a bracket expression lap fli out [[lap,fli,flj],[out]] [lap,fli,flj,out] flj  Enumerate the stencil execution orders that respect the dependencies lap fli flj out  Enumerate implementation variants by adding/removing brackets ...,lap,fli flj,out,... ]],[[ ],[ , 10

  11. spcl.inf.ethz.ch @spcl_eth lateral and vertical communication refer to commuication within one respectively Machine Model between different tiling hierarchy levels  Our model considers peak computation and communication throughputs target machine machine model C = 90 Gflops core 1 core 2 core 3 (30 Gflop) (30 Gflop) (30 Gflop) V 1 = 300 GB/s 100 GB/s 100 GB/s 100 GB/s L 1 = 50 GB/s 25 GB/s 25 GB/s cache cache cache M 1 = 256 kB (256 kB) (256 kB) (256 kB) 10 GB/s V 0 = 10 GB/s L 0 = 0 GB/s DDR M 0 = 8 GB (8 GB) 11

  12. spcl.inf.ethz.ch @spcl_eth remember the performance model parameters Performance Modeling 𝐷, 𝑊 𝑛 , 𝑀 1 , … , 𝑀 𝑛 𝑢 𝑚𝑏𝑞 𝑢 𝑔𝑚𝑗 𝑢 𝑔𝑚𝑘 𝑢 𝑝𝑣𝑢 𝑤 𝑚𝑏𝑞 𝑊 2 𝑤 𝑔𝑚𝑗 𝑊 2 𝑤 𝑔𝑚𝑘 𝑊 2 𝑤 𝑝𝑣𝑢 𝑊 2 tiling hierarchy 𝑢 ,𝑚𝑏𝑞,𝑔𝑚𝑗,𝑔𝑚𝑘- 𝑢 ,𝑝𝑣𝑢- 𝑤 ,𝑚𝑏𝑞,𝑔𝑚𝑗,𝑔𝑚𝑘- 𝑊 1 𝑤 ,𝑝𝑣𝑢- 𝑊 1 𝑢 ,,𝑚𝑏𝑞,𝑔𝑚𝑗,𝑔𝑚𝑘-,,𝑝𝑣𝑢-- execution time  Given a stencil 𝑡 given and the amount of computation 𝑑 𝑡 𝑢 𝑡 = 𝑑 𝑡 /𝐷  1 , … , 𝑚 𝑑 𝑛 Given a group 𝑕 and the vertical and lateral communication 𝑤 𝑑 and 𝑚 𝑑 1 𝑀 1 𝑛 𝑀 𝑛 max(𝑢 𝑑 , 𝑤 𝑑 𝑊 𝑛 𝑢 𝑕 = , 𝑚 𝑑 , … , 𝑚 𝑑 ) 𝑑∈𝑕.𝑑𝑖𝑗𝑚𝑒 12

  13. spcl.inf.ethz.ch @spcl_eth Affine Sets and Maps  The stencil program analysis is based on (quasi-) affine sets and maps ∈ ℤ 𝑜 ⋀ (0, … , 0) < 𝑗 𝑇 = 𝑗 𝑗 < (10, … , 10) ∈ ℤ 𝑜 𝑘 ∈ ℤ 𝑜 , 𝑘 𝑁 = 𝑗 → 𝑘 𝑗 = 2 ∙ 𝑗  For example, data dependencies are expressed using a named map ∈ ℤ 2 , 𝑘 𝐸 𝑚𝑏𝑞 = 𝑚𝑏𝑞 𝑗 → 𝑗𝑜 𝑗 + 𝑘 𝑗 ∈ 0,0 , 1,0 , −1,0 , 0,1 , 0, −1 in lap fli flj out 𝐸 = 𝐸 𝑚𝑏𝑞 ∪ 𝐸 𝑔𝑚𝑗 ∪ 𝐸 𝑔𝑚𝑘 ∪ 𝐸 𝑝𝑣𝑢 𝐹 = D + (* 𝑝𝑣𝑢 0 +) apply the out origin vector to the transitive closure of all dependencies 13

  14. spcl.inf.ethz.ch @spcl_eth Tiling Maps i 1 out (2,5) tile id (1,2) 5 (0,2) (1,2) (2,2) 4 3 (0,1) (1,1) (2,1) 2 1 (0,0) (1,0) (2,0) 0 i 0 0 1 2 3 4 5  Define a tiling using a map that associates stencil evaluations to tile ids ⌋, ⌊𝑗 1 2 ⌋ + 𝑈 𝑝𝑣𝑢 = *(out, 𝑗 0 , 𝑗 1 ) → ⌊𝑗 0 2 14

  15. spcl.inf.ethz.ch @spcl_eth Count Stencil Evaluations tiling hierarchy count the points in the filtered tiling map using the Barvinok algorithm  Intersect the range of the tiling map with the base tiling hierarchy origin tile 𝑇 = 0, 0, 𝑧 0 , 𝑧 1 𝑧 𝑘 ∈ ℤ+  Count the number of stencil evaluations associated to the remaining tiles 𝑑 𝑝𝑣𝑢 = 𝑈 𝑝𝑣𝑢 ∩ 𝑠𝑏𝑜 𝑇 ∙ #𝑔𝑚𝑝𝑞𝑡 15

  16. spcl.inf.ethz.ch @spcl_eth Stencil Program Optimization  Put it all together (stencil algebra, performance model, stencil analysis) 1. Optimize the stencil execution order (brute force search) 2. Optimize the stencil grouping (dynamic programming) fli lap out minimize 𝑢(𝑦) 𝑦∈𝐽 flj subject to 𝑛(𝑦) ≤ 𝑁 I fli lap out fli flj lap out flj fli lap out flj 16

  17. spcl.inf.ethz.ch @spcl_eth Dynamic Programming (simplified) tiling hierarchy level 1 lap ⟷ lap - - - lap ⟷ fli fli ⟷ fli - - lap ⟷ flj fli ⟷ flj flj ⟷ flj - lap ⟷ out fli ⟷ out flj ⟷ out out ⟷ out 1 4 3 2 tiling hierarchy level 0 - - - - 4 - [ ... ] :: ... fli ⟷ out flj ⟷ out out ⟷ out 3 2 17

  18. spcl.inf.ethz.ch @spcl_eth Evaluation CPU Experiments (i5-3330): GPU Experiments (Tesla K20c): no fusion hand-tuned optimized no fusion hand-tuned optimized 3.1 3.1 2.7 2.7 2.4 2.4 2.4 2.3 2.3 2.1 2.1 2.1 2.1 2 1.5 1.1 1 1 1 1 1 1 1 1 hd uv div uv&div hd uv div uv&div 120 12 m = measured time [ms] m = measured time [ms] m ~ 1.5e m ~ 1.6e 80 8 40 4 0 0 0 20 40 60 80 0 2 4 6 8 e = estimated time [ms] e = estimated time [ms] 18

  19. spcl.inf.ethz.ch @spcl_eth Conclusions  We categorize data locality transformations for stencil programs  We enumerate stencil program implementation variants using an algebra  Our performance model estimates the stencil program execution time  Using MODESTO we can automatically tune STELLA stencil programs  2.0-3.1x speedup against naive implementations  1.0-1.8x speedup against expert tuned implementations 19

Recommend


More recommend