spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , AND T ORSTEN H OEFLER Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot
spcl.inf.ethz.ch @spcl_eth COSMO Atmospheric Model • Regional atmospheric model used by 7 national weather services • Implements many different stencil programs
spcl.inf.ethz.ch @spcl_eth Optimizing the Fastwaves Kernel from the COSMO Atmospheric Model model prediction [ms] model prediction [ms] model prediction [ms] 64x64x1 64x64x1 64x64x1 0 0 0 1 1 1 2 2 2 unfused unfused unfused 5 5 5 3 3 3 4 4 4 measured execution time [ms] measured execution time [ms] measured execution time [ms] tiled tiled tiled 64x4x3 64x4x3 0.94 0.94 0.94 2 2 1 1 8 8 8 7 7 7 6 6 6 0 0 5 5 3 3 4 4 8 8 64x4x1 6 6 7 7 2 1 64x4x5 64x4x5 0 absinthe absinthe 5 3 4 auto-tuning 0.62 0.62 -6.5% 8 0.58 6 7 64x4x4 0.67 0.67 0.73 1.08 1.08 1.08 Michael Baldauf, Axel Seifert, Jochen Förstner, Detlev Majewski, Matthias Raschendorfer, and Thorsten Reinhardt, Operational Convective-Scale Numerical Weather Prediction with the COSMO Model: Description and Sensitivities . 2011. 3
spcl.inf.ethz.ch @spcl_eth Stencil Programs Execute Multiple Stencils in Sequence y yend for (int y = ybeg; y < yend; y++) for (int x = xbeg; x < xend; x++) A(x,y) = I(x,y) + I(x-1,y) + I(x+1,y); for (int y = ybeg; y < yend; y++) for (int x = xbeg; x < xend; x++) B(x,y) = A(x,y+1) + A(x,y); x ybeg xbeg xend • element-wise computation • position independent access pattern 1 load / 1 store 2 loads / 1 store 4
spcl.inf.ethz.ch @spcl_eth Loop Tiling and Loop Fusion y for (int idx = 0; idx < 4; ++idx) { int xbeg = tiles[idx].xbeg; yend int xend = tiles[idx].xend; int ybeg = tiles[idx].ybeg; idx = 2 idx = 3 int yend = tiles[idx].yend; Buffer A(xbeg, xend, ybeg, yend+1); yend ybeg for (int y = ybeg; y < yend+1 ; ++y) for (int x = xbeg; x < xend; ++x) idx = 0 idx = 1 A(x,y) = I(x,y) + I(x-1,y) + I(x+1,y); for (int y = ybeg; y < yend; y++) x ybeg for (int x = xbeg; x < xend; x++) xbeg xend xbeg xend B(x,y) = A(x,y+1) + A(x,y); } 1 load / 0 store 0 loads / 1 store 5
spcl.inf.ethz.ch @spcl_eth Architecture Overview model learner 1 𝑢(𝑞, 𝑐) = 𝑄𝑞 + 𝐶𝑐 learned parameters benchmark target system 2 optimizer ILP solver fast code 3 code generator code transformations 6
spcl.inf.ethz.ch @spcl_eth Performance Model Ideas • execution time of innermost loop scalar peel loops vectorized loop body • memory accesses dominate the execution time slow memory (L3 cache/DDR) fast memory (L1 cache) 7
spcl.inf.ethz.ch @spcl_eth Performance Model Design • linear cost functions for peel and body cost + 𝑢 = 𝑄𝑞 + 𝐶𝑐 𝑄𝑞 𝐶𝑐 • slow and fast memory 𝐶 2 𝑐 2 𝑄 1 𝑞 1 𝑢 = max 𝑄 1 𝑞 1 , 𝑄 2 𝑞 2 + max(𝐶 1 𝑐 1 , 𝐶 2 𝑐 2 ) + 𝑄 2 𝑞 2 𝐶 1 𝑐 1 • model the entire program 𝑢 = 𝑢 𝑗 0 1 2 3 4 5 6 7 8 𝑗=0..8 8
spcl.inf.ethz.ch @spcl_eth Evaluating the Fast Memory Model 𝑜 𝑦 = 2 , 𝑜 𝑧 = 2 • # cache accesses 𝑓 𝑧 𝑞 𝑔 = (3 + 1)𝑜 𝑦 (𝐸 𝑧 + 𝑓 𝑧 𝑜 𝑧 ) 𝑞 𝑔 = 𝑜 𝑦 𝐸 𝑧 𝑞 𝑔 = 𝑜 𝑦 (𝐸 𝑧 + 𝑓 𝑧 𝑜 𝑧 ) 𝑐 𝑔 = (3 + 1)𝐸 𝑦 𝐸 𝑧 + 𝐸 𝑦 𝑓 𝑧 𝑜 𝑧 𝑐 𝑔 = 𝐸 𝑦 𝐸 𝑧 + 𝐸 𝑦 𝑓 𝑧 𝑜 𝑧 𝑐 𝑔 = 𝐸 𝑦 𝐸 𝑧 + 0 1 3 loads / 1 store 𝐸 𝑧 𝑓 𝑧 • estimated execution time 3 2 𝑢 = 𝑄 𝑔 𝑞 𝑔 + 𝐶 𝑔 𝑐 𝑔 learn the model parameters 𝑄 𝑔 , 𝐶 𝑔 𝐸 𝑦 9
spcl.inf.ethz.ch @spcl_eth Learning the Fast Memory Model k k-1 k+1 fast memory p=12 = + + execution time [ms] p=20 0.10 p=16 k k+1 k-1 p=16 0.05 p=12 = + + 0.00 k k+1 k-1 20 40 60 80 x p=20 = + + 𝑄 𝑔 , 𝐶 𝑔 = argmin 𝑄𝑞 𝑠 − 𝐶𝑐 𝑠 − 𝑢 𝑠 (𝑄,𝐶)∈ℝ output array input array 𝑠∈[0,𝑆] 10
spcl.inf.ethz.ch @spcl_eth Linear Multiplication of Bounded Integer Variables p • the binary product 𝑞 = 𝑦𝑐 given the upper bound 𝑌 𝑞 ≥ 𝑦 0 𝑦 result 𝑞 ≤ 𝑦 𝟏 ≤ 𝒒 ≤ 𝒚 limit range 𝑞 ≥ 0 𝒒 − 𝒀𝒄 ≤ 𝟏 𝒒 − 𝒚 − 𝒀𝒄 ≥ −𝒀 force result 𝑞 ≤ 0 b • the integer product 𝑞 = 𝑦𝑧 given the upper bounds 𝑌 and 𝑍 𝑐 = 0 𝑐 = 1 ⌊log 2 (𝑍)⌋ binary representation 2 𝑗 𝑧 𝑗 𝑧 = 𝑗=0 ⌊log 2 (𝑍)⌋ sum binary products 2 𝑗 𝑦𝑧 𝑗 𝑞 = 𝑗=0 https://blog.adamfurmanek.pl/2015/09/26/ilp-part-6/ 11
spcl.inf.ethz.ch @spcl_eth Comparison to Auto-tuning, Heuristics, Hand-tuned, and Random Variants fastwaves diffusion advection 1.3 max 0.8 (74.0%) 1.6 measured time [ms] measured time [ms] measured time [ms] min 1.1 0.7 hand min hand max 1.2 0.6 absinthe 0.9 min hand 0.5 absinthe absinthe max 0.8 0.7 0.4 auto-tuning auto-tuning auto-tuning (-6.5%) (-0.8%) (-3.4%) 0.3 0.5 0.4 0.5 0.7 0.9 1.1 1.3 0.4 0.8 1.2 1.6 0.3 0.4 0.5 0.6 0.7 0.8 estimated time [ms] estimated time [ms] estimated time [ms] 12
spcl.inf.ethz.ch @spcl_eth Comparison to Halide and Polymage Absinthe 1.66x Halide 3.7x execution time [ms] 1.29x 20 Polymage 1x 1.4x 2.03x 1.06x 1x 10 1x 0 fastwaves advection diffusion R. T. Mullapudi, A. Adams, D. Sharlet, J. Ragan-Kelley, and K. Fatahalian, Automatically scheduling halide image processing pipelines . 2016. A. Jangda and U. Bondhugula, An effective fusion and tile size model for optimizing image processing pipelines . 2018. 13
spcl.inf.ethz.ch @spcl_eth Conclusions loop fusion and loop tiling learned performance model integer linear programming close to auto-tuning 14
spcl.inf.ethz.ch @spcl_eth Backup Slides 15
spcl.inf.ethz.ch @spcl_eth Model the Space of Possible Code Transformations 0 1 2 3 4 5 6 7 8 stencils 64x4x3 64x4x5 0 = 0 1 = 0 2 = 0 3 = 0 4 = 0 5 = 0 6 = 1 7 = 1 8 = 1 fusion choices 0 ≤ 𝑗+1 − 𝑗 ≤ 1 ∀𝑗 ∈ 0,7 16
spcl.inf.ethz.ch @spcl_eth Model the Space of Possible Code Transformations 0 1 2 3 4 5 6 7 8 stencils 64x4x3 64x4x5 𝑦 = 1 𝑦 = 1 𝑦 = 1 𝑦 = 1 𝑜 0 𝑜 8 𝑜 5 𝑜 6 𝑧 = 16 𝑧 = 16 𝑧 = 16 𝑧 = 16 tile sizes … 𝑜 0 ... 𝑜 8 𝑜 5 𝑜 6 𝑨 = 20 𝑨 = 12 𝑨 = 20 𝑨 = 12 𝑜 0 𝑜 8 𝑜 5 𝑜 6 equality constraints 𝑧 ≤ 𝐸 𝑧 , 1 ≤ 𝑜 𝑗 𝑦 ≤ 𝐸 𝑦 , 1 ≤ 𝑜 𝑗 𝑨 ≤ 𝐸 𝑨 ∀𝑗 ∈ 0,8 1 ≤ 𝑜 𝑗 17
spcl.inf.ethz.ch @spcl_eth Limit the Cache Utilization stencils 0 1 2 𝑔 2 ≥ 𝐺 22 𝑔 2 + 𝐺 12 2 − 1 ≥ 𝐺 12 𝑔 2 + 𝐺 02 2 − 0 ≥ 𝐺 02 𝑨 − 𝑔 𝑧 𝑜 2 𝑦 𝑜 2 𝐷𝑜 2 2 ≥ 0 𝐺 02 = 6 𝐺 12 = 5 𝐺 22 = 4 18
Recommend
More recommend