Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures Tom Henretty 1 Kevin Stock 1 Louis-Noël Pouchet 1 Franz Franchetti 2 J. Ramanujam 3 P . Sadayappan 1 1 The Ohio State University 2 Carnegie Mellon University 3 Louisiana State University March 29, 2011 ETAPS CC’11 Saarbrucken, Germany
Outline: CC’11 Outline Introduction 1 Vectorization of Stencils 2 Stream Alignment Conflict 3 Data Layout Transformation 4 Compiler Framework 5 Experimental Results 6 Conclusion 7 OSU / CMU / LSU 2
Introduction: CC’11 Short-Vector SIMD ◮ Perform identical computation on small chunks of data ◮ Operations are independent ◮ Vector size: from 2 to 64 ◮ Packing operations to form a vector (shuffle, extract, ...) ◮ Low latency, multiple SIMD units per CPU ◮ Maximal Speedup equals the vector size ◮ Ubiquitous feature on modern processors ◮ x86 – SSE, AVX ◮ Power – VMX / VSX ◮ ARM – NEON ◮ Cell SPU OSU / CMU / LSU 3
Introduction: CC’11 A Brief on Stencil Computations ◮ Typically: iterative update of a structured (fixed) grid ◮ Compute a point from neighbor points values ◮ Same grid / multiple grids ◮ Numerous application domains use stencils ◮ Finite difference methods for solving PDEs ◮ Image processing ◮ Computational electromagnetics, CFD, numerical relativity, ... ◮ Domain-Specific Languages for Stencils (Fenics, RNPL, ...) OSU / CMU / LSU 4
Vectorization of Stencils: CC’11 Stencil Example (a) 5 point stencil C code for (t = 0; t < TMAX; ++t) for (i = 1; i < N-1; ++i) for (j = 1; j < M-1; ++j) a[i][j] = b[i+1][j] + b[i][j-1] + b[i ][j] + b[i][j+1] + b[i-1][j]; M j i N (b) Arrays a, b, and stencil detail OSU / CMU / LSU 5
Vectorization of Stencils: CC’11 Vectorization of Stencil Computation ◮ Two “main” types of stencils ◮ Jacobi-like: the output does not depend on the input ◮ Seidel-like: in-place update ◮ Loop transformations expose tiling possibilities, and at least one inner-most parallel loop ◮ Auto-vectorization successful (ICC, GCC)... ◮ ...But SIMD speedup is far from optimal! OSU / CMU / LSU 6
Vectorization of Stencils: CC’11 Performance Consideration for (t = 0; t < T; ++t) { for (t = 0; t < T; ++t) { for (i = 0; i < N; ++i) for (i = 0; i < N; ++i) for (j = 1; j < N+1; ++j) for (j = 0; j < N; ++j) S1: C[i][j] = A[i][j] + A[i][j-1]; S3: C[i][j] = A[i][j] + B[i][j]; for (i = 0; i < N; ++i) for (i = 0; i < N; ++i) for (j = 1; j < N+1; ++j) for (j = 0; j < N; ++j) S2: A[i][j] = C[i][j] + C[i][j-1]; S4: A[i][j] = C[i][j] + B[i][j]; } } AMD Phenom 1.9 GFlop/s AMD Phenom 1.2 GFlop/s Core2 3.5 GFlop/s Performance: Core2 6.0 GFlop/s Performance: Core i7 4.1 GFlop/s Core i7 6.7 GFlop/s (a) Stencil code (b) Non-Stencil code Stencil code (a) has much lower performance than the non-stencil code (b) despite accessing 50% fewer data elements OSU / CMU / LSU 7
Stream Alignment Conflict: CC’11 Stream Alignment Conflict for (i = 0; i < H; i++) for (j = 0; j < W - 1; j++) VECTOR REGISTERS A[i][j] = B[i][j] + B[i][j+1]; x86 ASSEMBLY xmm1 I J K L movaps B(...), %xmm1 movaps 16+B(...),%xmm2 movaps %xmm2, %xmm3 MEMORY CONTENTS palignr $4, %xmm1, %xmm3 M N O P xmm2 ;; Register state here addps %xmm1, %xmm3 A A B C D E F G H ... ... ... ... movaps %xmm3, A(...) xmm3 J K L M B I J K L M N O P ... ... ... ... ◮ Load and shuffle: ◮ Load [I,J,K,L] and [M,N,O,P] ◮ Shuffle to create [J,K,L,M] ◮ Multiple unaligned loads ◮ Load [I,J,K,L] and [J,K,L,M] ◮ Not possible on architectures with alignment constraints OSU / CMU / LSU 8
Data Layout Transformation: CC’11 Overview of the Solution ◮ Stream Alignment Conflict: adjacent elements in memory maps to adjacent vector slots ◮ Key idea: break this property, to have both operands in identical vector slot ◮ Achieved through Data Layout Transformation ◮ No shuffle needed ◮ No extra unaligned load ◮ But not trivial to achieve! OSU / CMU / LSU 9
Data Layout Transformation: CC’11 Data Layout Transformation Example 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 A B C D E F G H I J K L M N O P Q R S T U V W X (a) Original Layout V N V A G M S A B C D E F B H N T G H I J K L C I O U V N M N O P Q R V D J P V S T U V W X E K Q W F L R X (b) Dimension Lifted (c) Transposed 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 A G M S B H N T C I O U D J P V E K Q W F L R X (d) Transformed Layout for (i = 1; i < 24; ++i) B[i] = (A[i-1] + A[i] + A[i+1]) / 3; OSU / CMU / LSU 10
Data Layout Transformation: CC’11 Handling Boundaries Compute Boundaries of Array Z Shuffle Opposite F L R Boundaries of Array Y A G M S Compute Steady State Original of Array Z F L R X B H N T Array Y A G M S F L R G’ M’ S’ B H N T C I O U D J P V E K Q W E K Q W A G M S F L R X F L R X G M S G M S F’ L’ R’ OSU / CMU / LSU 11
Data Layout Transformation: CC’11 Higher-dimensional Stencils (a) Original Layout n0 n1 n2 n3 0 w0 c0 e0 w1 c1 e1 w2 c2 e2 w3 c3 e3 1 2 s0 s1 s2 s3 (b) Transformed Layout w0 w1 w2 w3 s0 s1 s2 s3 n0 n1 n2 n3 c0 c1 c2 c3 e0 e1 e2 e3 0 1 2 OSU / CMU / LSU 12
Compiler Framework: CC’11 Overview of Code Generation Algorithm Detect arrays/statements that suffers from SAC 1 Perform Dimension-Lift-and-Transpose of those arrays 2 Generate Vector code for the inner-loop 3 ◮ Ghost cell copy-in and copy-out code ◮ Boundary code ◮ Steady state code OSU / CMU / LSU 13
Compiler Framework: CC’11 Detection of Stream Alignment Conflict ◮ Standard compiler framework operating on array subscript functions ◮ Main idea: detect cross-iteration reuse ◮ Robust to stream offset via iteration shifting ◮ Minimize the reuse distance ◮ Some alignment conflicts are artificial and fixed with stream realignment ◮ Requires the window of the stencil to be constant ◮ The window size is used to compute the amount of ghost cells OSU / CMU / LSU 14
Experimental Results: CC’11 Experimental Setup ◮ Experiments run on 3 architectures (x86): ◮ Intel Core2 Quad (Kentsfield): SAC resolved with low-performance shuffles ◮ AMD Phenom (K10): SAC resolved with average-performance shuffles ◮ Intel Core i7 (Nehalem): SAC resolved with fast redundant loads ◮ Data is L1-resident ◮ assume tiling was performed beforehand if necessary ◮ Tested compiler: Intel ICC 11.1 OSU / CMU / LSU 15
Experimental Results: CC’11 Three Code Variants Evaluated Ref : reference code 1 ◮ Straightforward C implementation ◮ Always auto-vectorized by the compiler DLT : basic layout transformed 2 ◮ Straightforward C implementation with DLT arrays ◮ Always auto-vectorized by the compiler DLTi : intrinsics + layout transformed 3 ◮ C implementation with DLT arrays and SSE vector intrinsics OSU / CMU / LSU 16
Experimental Results: CC’11 Single Precision Results Single ¡Precision ¡DLT ¡Results ¡ L1 ¡Cache ¡Resident ¡ 16 ¡ 14 ¡ 12 ¡ 10 ¡ Gflop/s ¡ 8 ¡ Ref. ¡ 6 ¡ DLT ¡ DLTi ¡ 4 ¡ 2 ¡ 0 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ J-‑1D ¡ J-‑2D-‑5pt ¡ J-‑2D-‑9pt ¡ J-‑3D ¡ Hea?tut-‑3D ¡ FDTD-‑2D ¡ Rician-‑2D ¡ Benchmark ¡/ ¡Microarchitecture ¡ OSU / CMU / LSU 17
Experimental Results: CC’11 Double Precision Results Double Precision DLT Results L1 Cache Resident 8 7 6 5 Gflop/s 4 Ref. 3 DLT DLTi 2 1 0 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 J-1D J-2D-5pt J-2D-9pt J-3D Heatttut-3D FDTD-2D Rician-2D Benchmark / Microarchitecture OSU / CMU / LSU 18
Experimental Results: CC’11 Summary of Experiments ◮ Performance improvement matches the shuffle/unaligned load costs ◮ Tested higher-dimensional stencils show less improvement: ◮ more intra-stencil dependences ◮ higher cache pressure ◮ Manual check of the ASM showed no shuffle, no redundant load instructions OSU / CMU / LSU 19
Conclusion: CC’11 Conclusion ◮ Stream Alignment Conflict is the performance bottleneck for auto-vectorized stencils ◮ Impact varies with micro-architecture characteristics, but is always significant ◮ A data layout transformation can solve this problem ◮ Strong performance improvement observed ◮ Manual vectorization still beats automatic vectorization OSU / CMU / LSU 20
Recommend
More recommend