vuduc.org/cse6230 GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 — September 30 & October 2
Recall: 2
Recall: 6 GB/s 2
Recall: 3
Recall: 4
Recall: 5
Recall: 6
Recall: 7
Recall: 8
Recall: 9
vuduc.org/cse6230 Performance engineering principles (See HPCA’10 tutorial)
Slow memory Q mops # (fl)ops W ≡ Fast memory # mem. ops (mops) Q ≡ (total size = Z ) = Q ( Z ) xPU W (fl)ops von Neumann bottleneck Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
= max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
= max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
= max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
= max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
= max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) Intensity = W ✏ flop + Q ✏ mem E (flop : mop) xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
= max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) Intensity = W ✏ flop + Q ✏ mem E (flop : mop) xPU ✓ ◆ 1 + B ✏ = W ✏ flop Balance I τ flop = time/flop (flop : mop) W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
= max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) Intensity = W ✏ flop + Q ✏ mem E (flop : mop) xPU ✓ ◆ 1 + B ✏ = W ✏ flop Balance I τ flop = time/flop (flop : mop) W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …
“Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 1/2 Relative performance 1/4 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)
“Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 1/2 Relative performance 1/4 Balance (flop : mop) 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)
“Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Balance (flop : mop) 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)
“Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)
“Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 Dense matrix multiply 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)
“Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 Dense matrix multiply sparse matvec; 1/16 stencils 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)
“Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 Dense matrix multiply sparse matvec; FFTs 1/16 stencils 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)
vuduc.org/cse6230 TLP vs. ILP ( thread - vs. instruction -level parallelism) See also: https://bitbucket.org/rvuduc/volkov-gtc10 http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf http://www.realworldtech.com/fermi/
Throughput [ops/time] ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ Recall Little’s Law , which quantifies the degree of concurrency needed Latency ↓ ↓ ↓ ↓ [time] to hide latency. ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
Throughput [ops/time] ↓ ↓ ↓ ↓ The NVIDIA M2090 implements a ↓ ↓ ↓ ↓ fused multiply-add ( FMA ) with a ↓ ↓ ↓ ↓ latency of ~ 20 cycle . It issues up to Latency ↓ ↓ ↓ ↓ 32 FMAs per cycle . [time] ↓ ↓ ↓ ↓ � Concurrency ~ (20 cy) * (32 ops/cy), ↓ ↓ ↓ ↓ or 640 operations . ↓ ↓ ↓ ↓ � So, a thread block size of 640 threads ↓ ↓ ↓ ↓ should fully hide the latency.
#define N < constant-value > � __global__ void kernel ( float *pa, float b, float c) { float a = *pa; � #pragma unroll 8 for ( int i=0; i<N; ++i) a = a * b + c; � *pa = a; } https://bitbucket.org/rvuduc/volkov-gtc10
Plateau starts roughly where expected (~ 640 threads) vuduc.org/cse6230 Fraction of peak 1 0.63 ● ● ● ● ● ● ● ● ● ● ● 0.05 ● 0 32 64 96128 192 256 384 512 640 768 896 1024 Threads per block https://bitbucket.org/rvuduc/volkov-gtc10
#define N < constant-value > � __global__ void kernel ( float *pa, float b, float c) { float a[2] = {0, 0}; � #pragma unroll 8 for ( int i=0; i<N; ++i) { a[0] = a[0] * b + c; a[1] = a[1] * b + c; } � *pa += a[0] + a[1]; } https://bitbucket.org/rvuduc/volkov-gtc10
#define N < constant-value > � __global__ void kernel ( float *pa, float b, float c) { float a[2] = {0, 0}; � #pragma unroll 8 for ( int i=0; i<N; ++i) { a[0] = a[0] * b + c; Mutually independent a[1] = a[1] * b + c; } � *pa += a[0] + a[1]; } https://bitbucket.org/rvuduc/volkov-gtc10
Plateau starts roughly where expected (~ 640 threads) vuduc.org/cse6230 Fraction of peak 1 0.63 ● ● ● ● ● ● ● ● ● ● ● 0.05 ● 0 32 64 96128 192 256 384 512 640 768 896 1024 Threads per block https://bitbucket.org/rvuduc/volkov-gtc10
Recommend
More recommend