Stencil Codes in One Shot spcl.inf.ethz.ch @spcl_eth COSMO - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , AND T ORSTEN H OEFLER Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot

spcl.inf.ethz.ch @spcl_eth COSMO Atmospheric Model • Regional atmospheric model used by 7 national weather services • Implements many different stencil programs

spcl.inf.ethz.ch @spcl_eth Optimizing the Fastwaves Kernel from the COSMO Atmospheric Model model prediction [ms] model prediction [ms] model prediction [ms] 64x64x1 64x64x1 64x64x1 0 0 0 1 1 1 2 2 2 unfused unfused unfused 5 5 5 3 3 3 4 4 4 measured execution time [ms] measured execution time [ms] measured execution time [ms] tiled tiled tiled 64x4x3 64x4x3 0.94 0.94 0.94 2 2 1 1 8 8 8 7 7 7 6 6 6 0 0 5 5 3 3 4 4 8 8 64x4x1 6 6 7 7 2 1 64x4x5 64x4x5 0 absinthe absinthe 5 3 4 auto-tuning 0.62 0.62 -6.5% 8 0.58 6 7 64x4x4 0.67 0.67 0.73 1.08 1.08 1.08 Michael Baldauf, Axel Seifert, Jochen Förstner, Detlev Majewski, Matthias Raschendorfer, and Thorsten Reinhardt, Operational Convective-Scale Numerical Weather Prediction with the COSMO Model: Description and Sensitivities . 2011. 3

spcl.inf.ethz.ch @spcl_eth Stencil Programs Execute Multiple Stencils in Sequence y yend for (int y = ybeg; y < yend; y++) for (int x = xbeg; x < xend; x++) A(x,y) = I(x,y) + I(x-1,y) + I(x+1,y); for (int y = ybeg; y < yend; y++) for (int x = xbeg; x < xend; x++) B(x,y) = A(x,y+1) + A(x,y); x ybeg xbeg xend • element-wise computation • position independent access pattern 1 load / 1 store 2 loads / 1 store 4

spcl.inf.ethz.ch @spcl_eth Loop Tiling and Loop Fusion y for (int idx = 0; idx < 4; ++idx) { int xbeg = tiles[idx].xbeg; yend int xend = tiles[idx].xend; int ybeg = tiles[idx].ybeg; idx = 2 idx = 3 int yend = tiles[idx].yend; Buffer A(xbeg, xend, ybeg, yend+1); yend ybeg for (int y = ybeg; y < yend+1 ; ++y) for (int x = xbeg; x < xend; ++x) idx = 0 idx = 1 A(x,y) = I(x,y) + I(x-1,y) + I(x+1,y); for (int y = ybeg; y < yend; y++) x ybeg for (int x = xbeg; x < xend; x++) xbeg xend xbeg xend B(x,y) = A(x,y+1) + A(x,y); } 1 load / 0 store 0 loads / 1 store 5

spcl.inf.ethz.ch @spcl_eth Architecture Overview model learner 1 𝑢(𝑞, 𝑐) = 𝑄𝑞 + 𝐶𝑐 learned parameters benchmark target system 2 optimizer ILP solver fast code 3 code generator code transformations 6

spcl.inf.ethz.ch @spcl_eth Performance Model Ideas • execution time of innermost loop scalar peel loops vectorized loop body • memory accesses dominate the execution time slow memory (L3 cache/DDR) fast memory (L1 cache) 7

spcl.inf.ethz.ch @spcl_eth Performance Model Design • linear cost functions for peel and body cost + 𝑢 = 𝑄𝑞 + 𝐶𝑐 𝑄𝑞 𝐶𝑐 • slow and fast memory 𝐶 2 𝑐 2 𝑄 1 𝑞 1 𝑢 = max 𝑄 1 𝑞 1 , 𝑄 2 𝑞 2 + max(𝐶 1 𝑐 1 , 𝐶 2 𝑐 2 ) + 𝑄 2 𝑞 2 𝐶 1 𝑐 1 • model the entire program 𝑢 = ෍ 𝑢 𝑗 0 1 2 3 4 5 6 7 8 𝑗=0..8 8

spcl.inf.ethz.ch @spcl_eth Evaluating the Fast Memory Model 𝑜 𝑦 = 2 , 𝑜 𝑧 = 2 • # cache accesses 𝑓 𝑧 𝑞 𝑔 = (3 + 1)𝑜 𝑦 (𝐸 𝑧 + 𝑓 𝑧 𝑜 𝑧 ) 𝑞 𝑔 = 𝑜 𝑦 𝐸 𝑧 𝑞 𝑔 = 𝑜 𝑦 (𝐸 𝑧 + 𝑓 𝑧 𝑜 𝑧 ) 𝑐 𝑔 = (3 + 1)𝐸 𝑦 𝐸 𝑧 + 𝐸 𝑦 𝑓 𝑧 𝑜 𝑧 𝑐 𝑔 = 𝐸 𝑦 𝐸 𝑧 + 𝐸 𝑦 𝑓 𝑧 𝑜 𝑧 𝑐 𝑔 = 𝐸 𝑦 𝐸 𝑧 + 0 1 3 loads / 1 store 𝐸 𝑧 𝑓 𝑧 • estimated execution time 3 2 𝑢 = 𝑄 𝑔 𝑞 𝑔 + 𝐶 𝑔 𝑐 𝑔 learn the model parameters 𝑄 𝑔 , 𝐶 𝑔 𝐸 𝑦 9

spcl.inf.ethz.ch @spcl_eth Learning the Fast Memory Model k k-1 k+1 fast memory p=12 = + + execution time [ms] p=20 0.10 p=16 k k+1 k-1 p=16 0.05 p=12 = + + 0.00 k k+1 k-1 20 40 60 80 x p=20 = + + 𝑄 𝑔 , 𝐶 𝑔 = argmin ෍ 𝑄𝑞 𝑠 − 𝐶𝑐 𝑠 − 𝑢 𝑠 (𝑄,𝐶)∈ℝ output array input array 𝑠∈[0,𝑆] 10

spcl.inf.ethz.ch @spcl_eth Linear Multiplication of Bounded Integer Variables p • the binary product 𝑞 = 𝑦𝑐 given the upper bound 𝑌 𝑞 ≥ 𝑦 0 𝑦 result 𝑞 ≤ 𝑦 𝟏 ≤ 𝒒 ≤ 𝒚 limit range 𝑞 ≥ 0 𝒒 − 𝒀𝒄 ≤ 𝟏 𝒒 − 𝒚 − 𝒀𝒄 ≥ −𝒀 force result 𝑞 ≤ 0 b • the integer product 𝑞 = 𝑦𝑧 given the upper bounds 𝑌 and 𝑍 𝑐 = 0 𝑐 = 1 ⌊log 2 (𝑍)⌋ binary representation 2 𝑗 𝑧 𝑗 𝑧 = ෍ 𝑗=0 ⌊log 2 (𝑍)⌋ sum binary products 2 𝑗 𝑦𝑧 𝑗 𝑞 = ෍ 𝑗=0 https://blog.adamfurmanek.pl/2015/09/26/ilp-part-6/ 11

spcl.inf.ethz.ch @spcl_eth Comparison to Auto-tuning, Heuristics, Hand-tuned, and Random Variants fastwaves diffusion advection 1.3 max 0.8 (74.0%) 1.6 measured time [ms] measured time [ms] measured time [ms] min 1.1 0.7 hand min hand max 1.2 0.6 absinthe 0.9 min hand 0.5 absinthe absinthe max 0.8 0.7 0.4 auto-tuning auto-tuning auto-tuning (-6.5%) (-0.8%) (-3.4%) 0.3 0.5 0.4 0.5 0.7 0.9 1.1 1.3 0.4 0.8 1.2 1.6 0.3 0.4 0.5 0.6 0.7 0.8 estimated time [ms] estimated time [ms] estimated time [ms] 12

spcl.inf.ethz.ch @spcl_eth Comparison to Halide and Polymage Absinthe 1.66x Halide 3.7x execution time [ms] 1.29x 20 Polymage 1x 1.4x 2.03x 1.06x 1x 10 1x 0 fastwaves advection diffusion R. T. Mullapudi, A. Adams, D. Sharlet, J. Ragan-Kelley, and K. Fatahalian, Automatically scheduling halide image processing pipelines . 2016. A. Jangda and U. Bondhugula, An effective fusion and tile size model for optimizing image processing pipelines . 2018. 13

spcl.inf.ethz.ch @spcl_eth Conclusions loop fusion and loop tiling learned performance model integer linear programming close to auto-tuning 14

spcl.inf.ethz.ch @spcl_eth Backup Slides 15

spcl.inf.ethz.ch @spcl_eth Model the Space of Possible Code Transformations 0 1 2 3 4 5 6 7 8 stencils 64x4x3 64x4x5 𝑕 0 = 0 𝑕 1 = 0 𝑕 2 = 0 𝑕 3 = 0 𝑕 4 = 0 𝑕 5 = 0 𝑕 6 = 1 𝑕 7 = 1 𝑕 8 = 1 fusion choices 0 ≤ 𝑕 𝑗+1 − 𝑕 𝑗 ≤ 1 ∀𝑗 ∈ 0,7 16

spcl.inf.ethz.ch @spcl_eth Model the Space of Possible Code Transformations 0 1 2 3 4 5 6 7 8 stencils 64x4x3 64x4x5 𝑦 = 1 𝑦 = 1 𝑦 = 1 𝑦 = 1 𝑜 0 𝑜 8 𝑜 5 𝑜 6 𝑧 = 16 𝑧 = 16 𝑧 = 16 𝑧 = 16 tile sizes … 𝑜 0 ... 𝑜 8 𝑜 5 𝑜 6 𝑨 = 20 𝑨 = 12 𝑨 = 20 𝑨 = 12 𝑜 0 𝑜 8 𝑜 5 𝑜 6 equality constraints 𝑧 ≤ 𝐸 𝑧 , 1 ≤ 𝑜 𝑗 𝑦 ≤ 𝐸 𝑦 , 1 ≤ 𝑜 𝑗 𝑨 ≤ 𝐸 𝑨 ∀𝑗 ∈ 0,8 1 ≤ 𝑜 𝑗 17

spcl.inf.ethz.ch @spcl_eth Limit the Cache Utilization stencils 0 1 2 𝑔 2 ≥ 𝐺 22 𝑔 2 + 𝐺 12 𝑕 2 − 𝑕 1 ≥ 𝐺 12 𝑔 2 + 𝐺 02 𝑕 2 − 𝑕 0 ≥ 𝐺 02 𝑨 − 𝑔 𝑧 𝑜 2 𝑦 𝑜 2 𝐷𝑜 2 2 ≥ 0 𝐺 02 = 6 𝐺 12 = 5 𝐺 22 = 4 18

Stencil Codes in One Shot spcl.inf.ethz.ch @spcl_eth COSMO - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , AND T ORSTEN H OEFLER Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot spcl.inf.ethz.ch @spcl_eth COSMO Atmospheric Model Regional

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

SHOT Brand Price NOTES WEST COAST MAGNUM SIZES 4 - 9 $ 39.20 Eagle shot prices may not be

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Building Codes Building Codes Building Codes Building Codes 1 1 Builder Responsibilities

ECEN 5682 Theory and Practice of Error Control Codes Cyclic Codes Peter Mathys University of

Formal Modeling in Cognitive Science Source Codes Lecture 30: Codes; Kraft Inequality; Source

Siamese Network & Matching Network for one-shot learning Reference Papers Siamese Neural

A Bayesian Approach to A Bayesian Approach to Unsupervised One- Unsupervised One -Shot Shot

CODES FOR ALL SEASONS Emina Soljanin, Bell Labs IN THE CLOUD? CODES Emina @ Bell Labs Codes at

G ENERALIZED R EED -S OLOMON CODES (GRS CODES ) A CHARACTERIZATION OF MDS CODES THAT HAVE AN ERROR

Lattices from Codes or Codes from Lattices Amin Sakzad Dept of Electrical and Computer Systems

Error-Correcting codes: Application of convolutional codes to Video Streaming Diego Napp

Information Theory Lecture 8 BCH codes BCH codes: R8.45 (R5.6) Decoding BCH (and

Cheers to Your Good Health Integrating Workplace Health Programs at Your Nonprofit Integrating

The Four Agreements 1. Stay Engaged 2. Speak Your Truth 3. Allow Discomfort 4. Expect &

APNA 30th Annual Conference Session 3011: October 21, 2016 ABOUT PENN MEDICINE The Aftermath of

Making Sure Kids are Healthy Enough to Learn: Innovations in Education Law and Policy September

Lessons Learned from Poison Control M A R C H 1 0 T H , 2 0 1 6 HALLAM GUGELMANN, MD MPH M E

no. 2 Aromatic Bitters Rye Spirit-Forward Aromatic Family of Cocktails Tuesday, 10 January 2012

707.009 Foundations of Knowledge Management g g Participative Knowledge Acquisition

Audio Track Presentation Presentation Title: The Hobbit Case Study Presentation Format: 60

Stencil Codes in One Shot spcl.inf.ethz.ch @spcl_eth COSMO - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , AND T ORSTEN H OEFLER Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot spcl.inf.ethz.ch @spcl_eth COSMO Atmospheric Model Regional

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

SHOT Brand Price NOTES WEST COAST MAGNUM SIZES 4 - 9 $ 39.20 Eagle shot prices may not be

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Building Codes Building Codes Building Codes Building Codes 1 1 Builder Responsibilities

ECEN 5682 Theory and Practice of Error Control Codes Cyclic Codes Peter Mathys University of

Formal Modeling in Cognitive Science Source Codes Lecture 30: Codes; Kraft Inequality; Source

Siamese Network &amp; Matching Network for one-shot learning Reference Papers Siamese Neural

A Bayesian Approach to A Bayesian Approach to Unsupervised One- Unsupervised One -Shot Shot

CODES FOR ALL SEASONS Emina Soljanin, Bell Labs IN THE CLOUD? CODES Emina @ Bell Labs Codes at

G ENERALIZED R EED -S OLOMON CODES (GRS CODES ) A CHARACTERIZATION OF MDS CODES THAT HAVE AN ERROR

Lattices from Codes or Codes from Lattices Amin Sakzad Dept of Electrical and Computer Systems

Error-Correcting codes: Application of convolutional codes to Video Streaming Diego Napp

Information Theory Lecture 8 BCH codes BCH codes: R8.45 (R5.6) Decoding BCH (and

Cheers to Your Good Health Integrating Workplace Health Programs at Your Nonprofit Integrating

The Four Agreements 1. Stay Engaged 2. Speak Your Truth 3. Allow Discomfort 4. Expect &amp;

APNA 30th Annual Conference Session 3011: October 21, 2016 ABOUT PENN MEDICINE The Aftermath of

Making Sure Kids are Healthy Enough to Learn: Innovations in Education Law and Policy September

Lessons Learned from Poison Control M A R C H 1 0 T H , 2 0 1 6 HALLAM GUGELMANN, MD MPH M E

no. 2 Aromatic Bitters Rye Spirit-Forward Aromatic Family of Cocktails Tuesday, 10 January 2012

707.009 Foundations of Knowledge Management g g Participative Knowledge Acquisition

Audio Track Presentation Presentation Title: The Hobbit Case Study Presentation Format: 60

Siamese Network & Matching Network for one-shot learning Reference Papers Siamese Neural

The Four Agreements 1. Stay Engaged 2. Speak Your Truth 3. Allow Discomfort 4. Expect &