gpu tuning part 1 updated
play

GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall - PowerPoint PPT Presentation

vuduc.org/cse6230 GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 September 30 & October 2 Recall: 2 Recall: 6 GB/s 2 Recall: 3 Recall: 4 Recall: 5 Recall: 6 Recall: 7 Recall: 8 Recall: 9


  1. vuduc.org/cse6230 GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 — September 30 & October 2

  2. Recall: 2

  3. Recall: 6 GB/s 2

  4. Recall: 3

  5. Recall: 4

  6. Recall: 5

  7. Recall: 6

  8. Recall: 7

  9. Recall: 8

  10. Recall: 9

  11. vuduc.org/cse6230 Performance engineering principles (See HPCA’10 tutorial)

  12. Slow memory Q mops # (fl)ops W ≡ Fast memory # mem. ops (mops) Q ≡ (total size = Z ) = Q ( Z ) xPU W (fl)ops von Neumann bottleneck Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  13. = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  14. = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  15. = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  16. = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) = W ✏ flop + Q ✏ mem E xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  17. = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) Intensity = W ✏ flop + Q ✏ mem E (flop : mop) xPU ✓ ◆ 1 + B ✏ = W ✏ flop I τ flop = time/flop W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  18. = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory Minimum time (total size = Z ) Intensity = W ✏ flop + Q ✏ mem E (flop : mop) xPU ✓ ◆ 1 + B ✏ = W ✏ flop Balance I τ flop = time/flop (flop : mop) W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  19. = max ( W ⌧ flop , Q ⌧ mem ) T Slow memory ✓ ◆ 1 , Q ⌧ mem = W ⌧ flop max W ⌧ flop τ mem = time/mop ✓ ◆ 1 , B ⌧ = W ⌧ flop max I Fast memory (total size = Z ) Intensity = W ✏ flop + Q ✏ mem E (flop : mop) xPU ✓ ◆ 1 + B ✏ = W ✏ flop Balance I τ flop = time/flop (flop : mop) W ⌧ flop W ✏ flop Consider: and von Neumann bottleneck T E Balance analysis — Kung (1986); Hockney & Curington (1989); Blelloch (1994); McCalpin (1995); Williams et al. (2009); Czechowski et al. (2011); …

  20. “Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 1/2 Relative performance 1/4 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

  21. “Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 1/2 Relative performance 1/4 Balance (flop : mop) 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

  22. “Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Balance (flop : mop) 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

  23. “Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

  24. “Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 Dense matrix multiply 1/16 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

  25. “Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 Dense matrix multiply sparse matvec; 1/16 stencils 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

  26. “Roofline” — Williams et al. ( Comm. ACM ’09) GFLOP/s 1 Compute bound 1/2 Relative performance 1/4 Memory Balance (bandwidth) (flop : mop) bound 1/8 Dense matrix multiply sparse matvec; FFTs 1/16 stencils 3.6 flop:byte 1/32 1/2 1 2 4 8 16 32 64 128 Intensity (FLOP:Byte) Balance estimate for a high-end NVIDIA Fermi in double-precision , according to Keckler et al. IEEE Micro (2011)

  27. vuduc.org/cse6230 TLP vs. ILP ( thread - vs. instruction -level parallelism) See also: https://bitbucket.org/rvuduc/volkov-gtc10 http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf http://www.realworldtech.com/fermi/

  28. Throughput [ops/time] ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ Recall Little’s Law , which quantifies the degree of concurrency needed Latency ↓ ↓ ↓ ↓ [time] to hide latency. ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

  29. Throughput [ops/time] ↓ ↓ ↓ ↓ The NVIDIA M2090 implements a ↓ ↓ ↓ ↓ fused multiply-add ( FMA ) with a ↓ ↓ ↓ ↓ latency of ~ 20 cycle . It issues up to Latency ↓ ↓ ↓ ↓ 32 FMAs per cycle . [time] ↓ ↓ ↓ ↓ � Concurrency ~ (20 cy) * (32 ops/cy), ↓ ↓ ↓ ↓ or 640 operations . ↓ ↓ ↓ ↓ � So, a thread block size of 640 threads ↓ ↓ ↓ ↓ should fully hide the latency.

  30. #define N < constant-value > � __global__ void kernel ( float *pa, float b, float c) { float a = *pa; � #pragma unroll 8 for ( int i=0; i<N; ++i) a = a * b + c; � *pa = a; } https://bitbucket.org/rvuduc/volkov-gtc10

  31. Plateau starts roughly where expected (~ 640 threads) vuduc.org/cse6230 Fraction of peak 1 0.63 ● ● ● ● ● ● ● ● ● ● ● 0.05 ● 0 32 64 96128 192 256 384 512 640 768 896 1024 Threads per block https://bitbucket.org/rvuduc/volkov-gtc10

  32. #define N < constant-value > � __global__ void kernel ( float *pa, float b, float c) { float a[2] = {0, 0}; � #pragma unroll 8 for ( int i=0; i<N; ++i) { a[0] = a[0] * b + c; a[1] = a[1] * b + c; } � *pa += a[0] + a[1]; } https://bitbucket.org/rvuduc/volkov-gtc10

  33. #define N < constant-value > � __global__ void kernel ( float *pa, float b, float c) { float a[2] = {0, 0}; � #pragma unroll 8 for ( int i=0; i<N; ++i) { a[0] = a[0] * b + c; Mutually independent a[1] = a[1] * b + c; } � *pa += a[0] + a[1]; } https://bitbucket.org/rvuduc/volkov-gtc10

  34. Plateau starts roughly where expected (~ 640 threads) vuduc.org/cse6230 Fraction of peak 1 0.63 ● ● ● ● ● ● ● ● ● ● ● 0.05 ● 0 32 64 96128 192 256 384 512 640 768 896 1024 Threads per block https://bitbucket.org/rvuduc/volkov-gtc10

Recommend


More recommend