Data Layout Transformation for Stencil Computations on Short-Vector - PowerPoint PPT Presentation

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures Tom Henretty 1 Kevin Stock 1 Louis-Noël Pouchet 1 Franz Franchetti 2 J. Ramanujam 3 P . Sadayappan 1 1 The Ohio State University 2 Carnegie Mellon University 3 Louisiana State University March 29, 2011 ETAPS CC’11 Saarbrucken, Germany

Outline: CC’11 Outline Introduction 1 Vectorization of Stencils 2 Stream Alignment Conflict 3 Data Layout Transformation 4 Compiler Framework 5 Experimental Results 6 Conclusion 7 OSU / CMU / LSU 2

Introduction: CC’11 Short-Vector SIMD ◮ Perform identical computation on small chunks of data ◮ Operations are independent ◮ Vector size: from 2 to 64 ◮ Packing operations to form a vector (shuffle, extract, ...) ◮ Low latency, multiple SIMD units per CPU ◮ Maximal Speedup equals the vector size ◮ Ubiquitous feature on modern processors ◮ x86 – SSE, AVX ◮ Power – VMX / VSX ◮ ARM – NEON ◮ Cell SPU OSU / CMU / LSU 3

Introduction: CC’11 A Brief on Stencil Computations ◮ Typically: iterative update of a structured (fixed) grid ◮ Compute a point from neighbor points values ◮ Same grid / multiple grids ◮ Numerous application domains use stencils ◮ Finite difference methods for solving PDEs ◮ Image processing ◮ Computational electromagnetics, CFD, numerical relativity, ... ◮ Domain-Specific Languages for Stencils (Fenics, RNPL, ...) OSU / CMU / LSU 4

Vectorization of Stencils: CC’11 Stencil Example (a) 5 point stencil C code for (t = 0; t < TMAX; ++t) for (i = 1; i < N-1; ++i) for (j = 1; j < M-1; ++j) a[i][j] = b[i+1][j] + b[i][j-1] + b[i ][j] + b[i][j+1] + b[i-1][j]; M j i N (b) Arrays a, b, and stencil detail OSU / CMU / LSU 5

Vectorization of Stencils: CC’11 Vectorization of Stencil Computation ◮ Two “main” types of stencils ◮ Jacobi-like: the output does not depend on the input ◮ Seidel-like: in-place update ◮ Loop transformations expose tiling possibilities, and at least one inner-most parallel loop ◮ Auto-vectorization successful (ICC, GCC)... ◮ ...But SIMD speedup is far from optimal! OSU / CMU / LSU 6

Vectorization of Stencils: CC’11 Performance Consideration for (t = 0; t < T; ++t) { for (t = 0; t < T; ++t) { for (i = 0; i < N; ++i) for (i = 0; i < N; ++i) for (j = 1; j < N+1; ++j) for (j = 0; j < N; ++j) S1: C[i][j] = A[i][j] + A[i][j-1]; S3: C[i][j] = A[i][j] + B[i][j]; for (i = 0; i < N; ++i) for (i = 0; i < N; ++i) for (j = 1; j < N+1; ++j) for (j = 0; j < N; ++j) S2: A[i][j] = C[i][j] + C[i][j-1]; S4: A[i][j] = C[i][j] + B[i][j]; } } AMD Phenom 1.9 GFlop/s AMD Phenom 1.2 GFlop/s Core2 3.5 GFlop/s Performance: Core2 6.0 GFlop/s Performance: Core i7 4.1 GFlop/s Core i7 6.7 GFlop/s (a) Stencil code (b) Non-Stencil code Stencil code (a) has much lower performance than the non-stencil code (b) despite accessing 50% fewer data elements OSU / CMU / LSU 7

Stream Alignment Conflict: CC’11 Stream Alignment Conflict for (i = 0; i < H; i++) for (j = 0; j < W - 1; j++) VECTOR REGISTERS A[i][j] = B[i][j] + B[i][j+1]; x86 ASSEMBLY xmm1 I J K L movaps B(...), %xmm1 movaps 16+B(...),%xmm2 movaps %xmm2, %xmm3 MEMORY CONTENTS palignr $4, %xmm1, %xmm3 M N O P xmm2 ;; Register state here addps %xmm1, %xmm3 A A B C D E F G H ... ... ... ... movaps %xmm3, A(...) xmm3 J K L M B I J K L M N O P ... ... ... ... ◮ Load and shuffle: ◮ Load [I,J,K,L] and [M,N,O,P] ◮ Shuffle to create [J,K,L,M] ◮ Multiple unaligned loads ◮ Load [I,J,K,L] and [J,K,L,M] ◮ Not possible on architectures with alignment constraints OSU / CMU / LSU 8

Data Layout Transformation: CC’11 Overview of the Solution ◮ Stream Alignment Conflict: adjacent elements in memory maps to adjacent vector slots ◮ Key idea: break this property, to have both operands in identical vector slot ◮ Achieved through Data Layout Transformation ◮ No shuffle needed ◮ No extra unaligned load ◮ But not trivial to achieve! OSU / CMU / LSU 9

Data Layout Transformation: CC’11 Data Layout Transformation Example 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 A B C D E F G H I J K L M N O P Q R S T U V W X (a) Original Layout V N V A G M S A B C D E F B H N T G H I J K L C I O U V N M N O P Q R V D J P V S T U V W X E K Q W F L R X (b) Dimension Lifted (c) Transposed 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 A G M S B H N T C I O U D J P V E K Q W F L R X (d) Transformed Layout for (i = 1; i < 24; ++i) B[i] = (A[i-1] + A[i] + A[i+1]) / 3; OSU / CMU / LSU 10

Data Layout Transformation: CC’11 Handling Boundaries Compute Boundaries of Array Z Shuffle Opposite F L R Boundaries of Array Y A G M S Compute Steady State Original of Array Z F L R X B H N T Array Y A G M S F L R G’ M’ S’ B H N T C I O U D J P V E K Q W E K Q W A G M S F L R X F L R X G M S G M S F’ L’ R’ OSU / CMU / LSU 11

Data Layout Transformation: CC’11 Higher-dimensional Stencils (a) Original Layout n0 n1 n2 n3 0 w0 c0 e0 w1 c1 e1 w2 c2 e2 w3 c3 e3 1 2 s0 s1 s2 s3 (b) Transformed Layout w0 w1 w2 w3 s0 s1 s2 s3 n0 n1 n2 n3 c0 c1 c2 c3 e0 e1 e2 e3 0 1 2 OSU / CMU / LSU 12

Compiler Framework: CC’11 Overview of Code Generation Algorithm Detect arrays/statements that suffers from SAC 1 Perform Dimension-Lift-and-Transpose of those arrays 2 Generate Vector code for the inner-loop 3 ◮ Ghost cell copy-in and copy-out code ◮ Boundary code ◮ Steady state code OSU / CMU / LSU 13

Compiler Framework: CC’11 Detection of Stream Alignment Conflict ◮ Standard compiler framework operating on array subscript functions ◮ Main idea: detect cross-iteration reuse ◮ Robust to stream offset via iteration shifting ◮ Minimize the reuse distance ◮ Some alignment conflicts are artificial and fixed with stream realignment ◮ Requires the window of the stencil to be constant ◮ The window size is used to compute the amount of ghost cells OSU / CMU / LSU 14

Experimental Results: CC’11 Experimental Setup ◮ Experiments run on 3 architectures (x86): ◮ Intel Core2 Quad (Kentsfield): SAC resolved with low-performance shuffles ◮ AMD Phenom (K10): SAC resolved with average-performance shuffles ◮ Intel Core i7 (Nehalem): SAC resolved with fast redundant loads ◮ Data is L1-resident ◮ assume tiling was performed beforehand if necessary ◮ Tested compiler: Intel ICC 11.1 OSU / CMU / LSU 15

Experimental Results: CC’11 Three Code Variants Evaluated Ref : reference code 1 ◮ Straightforward C implementation ◮ Always auto-vectorized by the compiler DLT : basic layout transformed 2 ◮ Straightforward C implementation with DLT arrays ◮ Always auto-vectorized by the compiler DLTi : intrinsics + layout transformed 3 ◮ C implementation with DLT arrays and SSE vector intrinsics OSU / CMU / LSU 16

Experimental Results: CC’11 Single Precision Results Single ¡Precision ¡DLT ¡Results ¡ L1 ¡Cache ¡Resident ¡ 16 ¡ 14 ¡ 12 ¡ 10 ¡ Gflop/s ¡ 8 ¡ Ref. ¡ 6 ¡ DLT ¡ DLTi ¡ 4 ¡ 2 ¡ 0 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ J-‑1D ¡ J-‑2D-‑5pt ¡ J-‑2D-‑9pt ¡ J-‑3D ¡ Hea?tut-‑3D ¡ FDTD-‑2D ¡ Rician-‑2D ¡ Benchmark ¡/ ¡Microarchitecture ¡ OSU / CMU / LSU 17

Experimental Results: CC’11 Double Precision Results Double Precision DLT Results L1 Cache Resident 8 7 6 5 Gflop/s 4 Ref. 3 DLT DLTi 2 1 0 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 J-1D J-2D-5pt J-2D-9pt J-3D Heatttut-3D FDTD-2D Rician-2D Benchmark / Microarchitecture OSU / CMU / LSU 18

Experimental Results: CC’11 Summary of Experiments ◮ Performance improvement matches the shuffle/unaligned load costs ◮ Tested higher-dimensional stencils show less improvement: ◮ more intra-stencil dependences ◮ higher cache pressure ◮ Manual check of the ASM showed no shuffle, no redundant load instructions OSU / CMU / LSU 19

Conclusion: CC’11 Conclusion ◮ Stream Alignment Conflict is the performance bottleneck for auto-vectorized stencils ◮ Impact varies with micro-architecture characteristics, but is always significant ◮ A data layout transformation can solve this problem ◮ Strong performance improvement observed ◮ Manual vectorization still beats automatic vectorization OSU / CMU / LSU 20

Data Layout Transformation for Stencil Computations on Short-Vector - PowerPoint PPT Presentation

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures Tom Henretty 1 Kevin Stock 1 Louis-Nol Pouchet 1 Franz Franchetti 2 J. Ramanujam 3 P . Sadayappan 1 1 The Ohio State University 2 Carnegie Mellon University 3

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

Layout design I. Chapter 6 Basic layout types Systematic layout planning procedure Computerized

Layout Dynamic layout Layout design pattern Layout strategies 2 Dynamic Layout Applications

Layout Dynamic layout Layout design pattern Layout strategies 2 Dynamic Layout Applications

Realizing OutofCore Stencil Computations using MultiTier Memory Hierarchy on GPGPU

Layouts Dynamic layout Swing and Layout Managers Layout strategies 1 CS 349 - Layouts 2 CS

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

An Auto-Tuning Framework for Parallel Multicore Stencil Computations Shoaib Kamil , Cy Chan ,

Layout design III. Chapter 6 Layout generation MCRAFT BLOCPLAN LOGIC Methods for layout

CS/EE 6710 Introduction to Layout Inverter Layout Example Layout Design Rules Composite Layout

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Realizing Extremely LargeScale Stencil Applications on GPU Supercomputers Toshio Endo, Yuki

Layout design II. Chapter 6 Layout generation Pairwise exchange method Graph-based method

Lecture 12 Stencil methods Atomics Announcements Midterm scores have been posted to Moodle

Lecture 12 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Stencil methods

Bandwidth Avoiding Stencil Computations By Kaushik Datta , Sam Williams, Kathy Yelick, and Jim

Scientific Computing I Module 8: Discretisation of PDEs Michael Bader Lehrstuhl Informatik V

Iterator-Based Optimization of Imperfectly-Nested Loops DANIEL FESHBACH, MARY GLASER, MICHELLE

CS475/CM375 Lecture 4: Sept 22 Sparse Gaussian Elimination, Graph Representation Reading: [Saad]

Edge-Adaptive Image Interpolation with Contour Stencils Pascal Getreuer Dec 27, 2010 TV along

HYPRE: High Performance Preconditioners October 18, 2013 Robert D. Falgout Center for Applied