image courtesy southern california earthquake center
play

Image courtesy: Southern California Earthquake Center Matthias - PowerPoint PPT Presentation

Image courtesy: Southern California Earthquake Center Matthias Christen, Cetus Users and Compiler Infastructure Workshop 2 /*


  1. Image courtesy: Southern California Earthquake Center Matthias Christen, Cetus Users and Compiler Infastructure Workshop 2

  2. … /* (u[0][0][0][1][0]=((alpha*u[0][0][0 ][0][0])+(beta*((u[1][0][0][0][0]+( stencil laplacian u[- { 1][0][0][0][0]+u[0][1][0][0][0]))+( operation(float grid u, u[0][- 1][0][0][0]+(u[0][0][1][0][0]+u[0][ float param alpha, 0][-1][0][0])))))) float param beta) */ { __global__ void laplacian(float * u[x, y, z; t+1] = * u_0_1_out, float * u_0_0, float alpha * u[x, y, z; t] + * u_0_1, float alpha, float beta, beta * ( int x_max, int y_max, int z_max, P ATUS u[x+1,y,z;t]+u[x-1,y,z;t]+ int tbx, int tby, int tbz, int c) { u[x,y+1,z;t]+u[x,y-1,z;t]+ float * const u__u_0[] = { u[x,y,z+1;t]+u[x,y,z-1;t]); u_0_0, u_0_1 } ; } size_1_1=(y_max/blockDim.y); } size_1_2=(z_max/blockDim.z); idx_1_2=(blockIdx.y/size_1_2); Matthias Christen, Cetus Users and Compiler Infastructure Workshop 3

  3. Gradient Matthias Christen, Cetus Users and Compiler Infastructure Workshop 4

  4. Arithmetic Intensity := Flops / Transferred Data Stencils Particle methods Sparse Linear Algebra FFT Dense Linear Algebra BLAS 1 Lattice Methods (BLAS3) BLAS 2 Low arithmetic intensity High arithmetic intensity  memory bandwidth bound  processor bound Matthias Christen, Cetus Users and Compiler Infastructure Workshop 5

  5. Matthias Christen, Cetus Users and Compiler Infastructure Workshop 6

  6. Cetus Cetus Coco/R M. Christen et al., PATUS: A Code Generation and Autotuning Framework For Parallel Iterative Stencil Computations on Modern Microarchitectures, IPDPS 2011 Matthias Christen, Cetus Users and Compiler Infastructure Workshop 7

  7. stencil pmcl3d_uxx1 { domainsize = (nxb .. nxe, nyb .. nye, nzb .. nze); t_max = 1; operation ( const float grid d1(-1 .. nxt+2, -1 .. nyt+2, -1 .. nzt+2), float grid u1(-1 .. nxt+2, -1 .. nyt+2, -1 .. nzt+2), ... float param dth) { float d = 0.25*(d1[x,y,z]+d1[x,y-1,z]+d1[x,y,z-1]+d1[x,y-1,z-1]); u1[x,y,z; t+1] = u1[x,y,z; t] + (dth / d) * ( c1 * (xx[x,y,z] - xx[x-1,y,z] + xy[x,y,z] - xy[x,y-1,z] + xz[x,y,z ] - xz[x,y,z-1]) + c2 * (xx[x+1,y,z] - xx[x-2,y,z] + xy[x,y+1,z] - xy[x,y-2,z] + xz[x,y,z+1] - xz[x,y,z-2]) ); } } Matthias Christen, Cetus Users and Compiler Infastructure Workshop 8

  8. strategy cacheblocking (domain u, auto dim cb) { // iterate over time steps for t = 1 .. stencil.t_max { // iterate over subdomain for subdomain v(cb) in u(:; t) parallel { for point p in v(:; t) v[p; t+1] = stencil (v[p; t]); } } } Matthias Christen, Cetus Users and Compiler Infastructure Workshop 9

  9. cb x cb y 20 15 10 5 cb y GFlop/s cb z cb z =4 cb x =96 Matthias Christen, Cetus Users and Compiler Infastructure Workshop 10

  10. Auto-Tuning Process Duration Single Precision Wave Stencil Wave | Upstream 60 50 45 50 Optimization Duration [minutes] 40 Single Precision GFlop/s 35 40 30 25 30 20 DIRECT 20 15 GCE Genetic 10 Greedy 10 Hooke-Jeeves 5 Simplex Search 0 0 2 4 8 16 32 64 128 256 512 1024 Number of Benchmark Runs Matthias Christen, Cetus Users and Compiler Infastructure Workshop 11

  11.  Vectorization  Loop unrolling  Benchmarking Harness  Time measurement & performance calculation  Result validation Matthias Christen, Cetus Users and Compiler Infastructure Workshop 12

  12. Matthias Christen, Cetus Users and Compiler Infastructure Workshop 13

  13. Image courtesy: Southern California Earthquake Center Dynamic Coulomb failure stress changes in a shakeout simulation of an earthquake on the southern San Andreas Fault HP2C Petaquake project Matthias Christen, Cetus Users and Compiler Infastructure Workshop 14

  14. Kernel Description Discretization Flops/Stencil Arith. Intens. uxx1 Velocity in one direction 4th order 20 Flops 0.83 Flop/Byte xy1 Diagonal stress in one direction 4th order 16 Flops 0.80 Flop/Byte xyz1 Stresses parallel to axes 4th order 90 Flops 2.04 Flop/Byte xyzq Stresses parallel to axes in viscous mode 4th order 129 Flops 1.61 Flop/Byte Matthias Christen, Cetus Users and Compiler Infastructure Workshop 15

  15. Core 0 Core 1 Core 5 Exec Unit Exec Unit Exec Unit … 64K 64K 64K 64K 64K 64K L1I L1D L1I L1D L1I L1D 512K L2 512K L2 512K L2 6M L3 Probe Filter x2 Directory System Request Interface / Crossbar Mem Probe Hyper- Ctrlr Filter transpor t 53GB/s (total) DRAM HT Links Matthias Christen, Cetus Users and Compiler Infastructure Workshop 16

  16. 45 40 Single Precision GFlop/s 35 30 25 20 15 10 5 0 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Reference (Fortran) Matthias Christen, Cetus Users and Compiler Infastructure Workshop 17

  17. 45 40 Single Precision GFlop/s 35 30 25 20 15 10 5 0 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Peak Reference (Fortran) Matthias Christen, Cetus Users and Compiler Infastructure Workshop 18

  18. 45 40 Single Precision GFlop/s 35 30 25 20 15 10 5 0 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Patus, Basic Cache Blocking Peak Reference (Fortran) Matthias Christen, Cetus Users and Compiler Infastructure Workshop 19

  19. 45 40 Single Precision GFlop/s 35 30 25 20 15 10 5 0 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Patus, Basic Cache Blocking +SSE Peak Reference (Fortran) Matthias Christen, Cetus Users and Compiler Infastructure Workshop 20

  20. 45 40 Single Precision GFlop/s 35 30 25 20 15 10 5 0 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Patus, Basic Cache Blocking +SSE +Loop Unrolling Peak Reference (Fortran) Matthias Christen, Cetus Users and Compiler Infastructure Workshop 21

  21. GPC SM Raster Engine L/S L/S x4 SM SM SM L/S L/S SFU SFU L/S L/S PolyMo PolyMo PolyMo L/S L/S rph E rph E rph E L/S L/S 786K L2 L/S L/S SFU SFU L/S L/S L/S L/S 64K Shared Memory Giga Host Mem Mem Mem IF Thd Ctrlr Ctrlr Ctrlr Eng PCIe GDDR5 Matthias Christen, Cetus Users and Compiler Infastructure Workshop 22

  22. 80 70 Single Precision GFlop/s 60 50 40 30 20 10 0 (1,1) (2,3) (3,3) (1,1) (2,3) (3,3) (1,1) (2,3) (3,3) (1,1) (2,3) (3,3) uxx1 xy1 xyz1 xyzq Default +Blocking +Loop Unrolling Matthias Christen, Cetus Users and Compiler Infastructure Workshop 23

  23.  Code generation framework for stencil codes of arbitrary stencil shapes for different types of hardware  Experimenting environment for parallelization and blocking strategies  Auto-tuning as methodology to achieve best performance on a given hardware architecture for a given strategy  Performance improvement up to ~ 4x for 3D APW-ODC stencil kernels Matthias Christen, Cetus Users and Compiler Infastructure Workshop 24

  24. Matthias Christen, Cetus Users and Compiler Infastructure Workshop 25

Recommend


More recommend