Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 · October 5, 2010 Discuss HW1 Intro to GPU Computing
Outline Discuss HW1 Intro to GPU Computing Discuss HW1 Intro to GPU Computing
Outline Discuss HW1 Intro to GPU Computing Discuss HW1 Intro to GPU Computing
Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example matrix multiplication code. Why is blocked matmul faster than un-blocked? Key: Computational Intensity Definition: Flops per FPN moved up the memory hierarchy Large intensity: good for deep memory hierarchies Discuss HW1 Intro to GPU Computing
Computational Intensity for Scalar Matmul Floating Point operations: 2 N 3 Assume: Size(L1) ≪ N 2 FPNs N 2 read each row of A once + N 3 read each column of B N times + 2 N 2 read/write C N 3 + 3 N 2 FPN-size cache misses (neglecting cache lines, etc.) Computational Intensity: about 2 Discuss HW1 Intro to GPU Computing
Computational Intensity for Blocked Matmul Floating Point operations: still 2 N 3 b : block size n : ⌈ N / b ⌉ read each block of A n 3 times b 2 n 3 + b 2 n 3 same for B + 2 N 2 read/write C 2 b 2 n 3 + 2 N 2 FPN-size cache misses Rewrite: b 2 n 3 ≈ b 2 N 3 b 3 = N 3 b Computational Intensity: 2 N 3 2 N 3 2 N 3 / b + 2 N 2 ≈ 2 N 3 / b = b → incentive to choose b ≫ 2 Discuss HW1 Intro to GPU Computing
Computational Intensity for Blocked Matmul Floating Point operations: still 2 N 3 b : block size n : ⌈ N / b ⌉ read each block of A n 3 times b 2 n 3 + b 2 n 3 same for B + 2 N 2 read/write C 2 b 2 n 3 + 2 N 2 FPN-size cache misses Rewrite: b 2 n 3 ≈ b 2 N 3 b 3 = N 3 b Computational Intensity: 2 N 3 2 N 3 2 N 3 / b + 2 N 2 ≈ 2 N 3 / b = b The power of assumptions: Can we choose b = N ? → incentive to choose b ≫ 2 Discuss HW1 Intro to GPU Computing
Hatching a Plan Consider each level of the memory hierarchy. How do we exploit. . . • . . . L2: Ignore–we’re nearly L2-local at most sizes. • . . . L1: 32 KiB = 4096 Floats. Key: memory layout. • . . . registers: 16 FP registers. Key: loop/operation ordering. Discuss HW1 Intro to GPU Computing
Optimizing for L1: Memory Layout Memory layout of A : column-major. Only use one entry of each cache line per fetch. Better to store A in row-major order. Input is row-major. If memory available (not swap!), storing a transposed copy of A can be a good idea. (Copy takes O ( N 2 ) time.) Discuss HW1 Intro to GPU Computing
Optimizing for L1: Memory Layout Memory layout of A : column-major. Only use one entry of each cache line per fetch. Better to store A in row-major order. Input is row-major. If memory available (not swap!), storing a transposed copy of A can be a good idea. (Copy takes O ( N 2 ) time.) Discuss HW1 Intro to GPU Computing
Optimizing for L1: Reuse Pattern, Block Size Question Blocking: good idea. Optimal b L1 ? Follow-up question: How much needs to fit in L1? Discuss HW1 Intro to GPU Computing
Optimizing for L1: Reuse Pattern, Block Size Question Blocking: good idea. Optimal b L1 ? Follow-up question: How much needs to fit in L1? One block of each of A , B , C . Discuss HW1 Intro to GPU Computing
Optimizing for L1: Reuse Pattern, Block Size Question Blocking: good idea. Optimal b L1 ? Follow-up question: How much needs to fit in L1? One block of each of A , B , C . Discuss HW1 Intro to GPU Computing
Optimizing for L1: Reuse Pattern, Block Size Question Blocking: good idea. Optimal b L1 ? Follow-up question: How much needs to fit in L1? One block of each of A , B , C . All of A , plus one column of B and C . 32 kiB: 8 b 2 L1 + 2 · 8 b L1 → b L1 ≤ 60 Discuss HW1 Intro to GPU Computing
L1 Block Copy Further concerns: • Cache line boundaries • SIMD • Cache set conflicts All solved by small-block copy optimization. Copy all of A . Copy b L1 -sized blocks of A , B , and C , operate on those, then copy output back. Discuss HW1 Intro to GPU Computing
L1 Block Copy: The Plan Basic plan: For each i : For each j : Load Block C [ i , j ] For each k : Load Block A [ i , k ] Load Block B [ k , j ] ⌈ b L1 / b r ⌉ 3 register kernels: C + = AB Store Block C [ i , j ] (can be improved: many A , B loads) Discuss HW1 Intro to GPU Computing
L1 Block Copy: The Plan Basic plan: For each i : For each j : Load Block C [ i , j ] For each k : Load Block A [ i , k ] Load Block B [ k , j ] ⌈ b L1 / b r ⌉ 3 register kernels: C + = AB Store Block C [ i , j ] Aside: Also neatly deals with fringes. (can be improved: many A , B loads) So: how does this solve the problems above? Can you define “alignment”? Discuss HW1 Intro to GPU Computing
Alignment A memory address a is n -byte aligned when n is a power of two and a is a multiple of n bytes. (see also IBM devWorks article) #include < stdlib.h > / ∗ dynamic allocation ∗ / double ∗ attribute (( aligned (64))) var; int error = posix memalign( ( void ∗∗ ) &var, 64, array size ); if ( error ) abort (); / ∗ static allocation ∗ / double attribute (( aligned (64))) ary2 [500]; Discuss HW1 Intro to GPU Computing
Alignment A memory address a is n -byte aligned when n is a power of two and a is a multiple of n bytes. (see also IBM devWorks article) #include < stdlib.h > / ∗ dynamic allocation ∗ / double ∗ attribute (( aligned (64))) var; int error = posix memalign( ( void ∗∗ ) &var, 64, array size ); if ( error ) abort (); Examples: Cache-line-aligned, SIMD-aligned. / ∗ static allocation ∗ / double attribute (( aligned (64))) ary2 [500]; Code generation in the non-aligned case? Discuss HW1 Intro to GPU Computing
Register Kernel Choose block size b r = 2 k , with b L1 mod b r = 0. for ( int j = 0; j < b r; ++j) for ( int k = 0; k < b r; ++k) for ( int i = 0; i < b r; ++i) C[i+j ∗ b l1] += A[i+k ∗ b l1] ∗ B[k+j ∗ b l1]; For each Ab matvec: Perform b r scalar · vector updates. • Vectorizable • Pipeline-friendly (min. data dependencies) • Access to A , C unit-stride • Access to B is inner-loop invariant • Unrolling, software pipelining: Compiler Discuss HW1 Intro to GPU Computing
Psychoanalyzing the Compiler Flags for Intel: -O3 -fno-alias -funroll-loops -std=c99 -D XOPEN SOURCE=500 -opt-streaming-stores auto -static -fast -xHost Flags for GCC: -O3 -funroll-loops -march=native -std=c99 -D XOPEN SOURCE=500 -ftree-vectorizer-verbose=2 -ffast-math GCC 4.3 sometimes better than GCC 4.4. Self-study material: • Compiler Reference: Intel GNU • C99 restrict keyword, Aliasing Discuss HW1 Intro to GPU Computing
Profiling OProfile: A sampling profiler. Uses Performance counters. Linux only, needs root. Discuss HW1 Intro to GPU Computing
Profiling Many event types countable: CPU CLK UNHALTED : Clock cycles when not halted L2 RQSTS : number of L2 cache requests LLC MISSES : L2 cache demand requests from this core that missed the L2 FLOPS : number of FP computational micro-ops executed IDLE DURING DIV : cycles divider is busy and all other execution units are idle. L1D ALL REF : All references to the L1 data cache L1D PEND MISS : Total number of outstanding L1 data cache misses at any cycle IFU MEM STALL : cycles instruction fetch pipe is stalled INST RETIRED : number of instructions retired UOPS RETIRED : number of UOPs retired MACHINE NUKES SMC : number of pipeline flushing events RAT STALLS : Partial register stall cycles BR INST DECODED : number of branch instructions decoded OProfile: A sampling profiler. Uses Performance counters. Linux only, needs root. Discuss HW1 Intro to GPU Computing
Profiling OProfile: A sampling profiler. Uses Performance counters. Linux only, needs root. Discuss HW1 Intro to GPU Computing
Profiling FLOPS L1D PEND MISS 8 2.6e − 04 18 0.7037 movsd 0x50(%rax),%xmm7 187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm5 7 2.3e − 04 24 0.9382 movsd 0x60(%rax),%xmm3 470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm4 49 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2 2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1 434 0.0144 8 0.3127 xchg %ax,%ax 184312 6.0959 26 1.0164 movsd (%rdx),%xmm0 2022 0.0669 14 0.5473 inc %esi 19 6.3e − 04 3 0.1173 mulsd (%rcx),%xmm0 5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm0 31888 1.0547 68 2.6583 movsd %xmm0,(%rax) 66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp) 114001 3.7704 43 1.6810 movsd (%rcx),%xmm0 1131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm0 11913 0.3940 2 0.0782 addsd %xmm0,%xmm14 94565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax) 108501 3.5885 25 0.9773 movsd (%rcx),%xmm0 4 1.3e − 04 1 0.0391 mulsd 0x10(%rdx),%xmm0 76622 2.5342 81 3.1665 addsd %xmm0,%xmm15 82075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax) 119036 3.9370 36 1.4073 movsd (%rcx),%xmm0 5 1.7e − 04 0 0 mulsd 0x18(%rdx),%xmm0 2700 0.0893 0 0 addsd %xmm0,%xmm12 14861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax) OProfile: A sampling profiler. Uses Performance counters. Linux only, needs root. Discuss HW1 Intro to GPU Computing
Profiling OProfile: A sampling profiler. Uses Performance counters. Linux only, needs root. Discuss HW1 Intro to GPU Computing
Solution Performance 9000 basic 8000 tuned 7000 blas 6000 MFlops/s 5000 4000 3000 2000 1000 0 0 100 200 300 400 500 600 700 800 Matrix dimension N git clone ssh://git@forge.tiker.net:2234/hw1-solution.git (Private, works if you signed up for an account.) Discuss HW1 Intro to GPU Computing
Recommend
More recommend