Analyzing Throughput of GPUs Analyzing Throughput of GPUs - - PowerPoint PPT Presentation
Analyzing Throughput of GPUs Analyzing Throughput of GPUs - - PowerPoint PPT Presentation
Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core Exploiting Within-Die Core-to-Core Frequency Variation Frequency Variation Jungseob Lee, Paritosh Ajgaonkar, Nam Sung Kim Apr 12, 2011 Department of
Outline Outline
Introduction GPU architecture and impact of WID variations
- n GPUs
Throughput improvement techniques
Allowing per-SM clocking (PSMC) Disabling the slowest SMs (DSSM)
Impact of main memory latency and bandwidth
- n throughput improvement
Conclusion
Introduction Introduction
Improve throughput of GPU applications
GPUs can provide high throughput for general-purpose and data-intensive applications. Two techniques for mitigating the negative impact of WID C2C frequency variations on throughput of GPUs Two techniques for mitigating the negative impact of WID C2C frequency variations on throughput of GPUs Increasing WID Core-to-Core frequency variations affect Fmax of GPUs. Increasing WID Core-to-Core frequency variations affect Fmax of GPUs. Slowest core Limit Fmax Fastest core Die PSMC & DSSM PSMC & DSSM
Outline Outline
Introduction GPU architecture and impact of WID variations on GPUs Throughput improvement techniques
Allowing per-SM clocking (PSMC) Disabling the slowest SMs (DSSM)
Impact of main memory latency and bandwidth
- n throughput improvement
Conclusion
GPU architecture GPU architecture
GPU architecture:
Streaming Multiprocessor (SM)s, off-chip DRAM, and on-chip
interconnect network
Each SM: 1) 8 to 32 streaming processors (SPs), 2) an
instruction scheduler, 3) instruction cache, 4) register files, 5) special function units (SFU), and 6) shared memory/cache
Fermi Architecture
Streaming Multiprocessor (SM)
SM SP
[1] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
[1] [1]
D2D & WID variations D2D & WID variations
Die to Die (D2D) variations
Affect all transistors on a die identically
Within-Die (WID) variations
Different transistor characteristics within a single die. With technology scaling (i.e., more cores per a die)
Considerable C2C FMAX variation due to spatially correlated WID
variations
Within-Die (WID) Variations
Systematic
Die-to-Die (D2D) Variations
Feature Scale Die Scale Random Wafer Scale
Courtesy: K. Bowman from Intel
Impact of WID C2C Fmax variations
C2C frequency variation affects GPU’s FMAX
Fmax is limited by the slowest core in a GPU designed to operate all
SMs at the same frequency (i.e, per-chip clocking).
More SMs in a die SM-to-SM frequency variation increases Power inefficiency faster SMs consume more leakage power
WID C2C FMAX Variations WID C2C FMAX Variations
[2] S. Herbert et al., “Characterizing chip-multiprocessor variability-tolerance,” in Proc. IEEE DAC, 2008.
A WID Vth/Leff variation map for a 16-SM GPU The corresponding Fmax map
Fmax
Each variation map has 80x80 grid points
1 grid point [2]
SM ID
Outline Outline
Introduction GPU architecture and impact of WID variations
- n GPUs
Throughput improvement techniques
Allowing per-SM clocking (PSMC) Disabling the slowest SMs (DSSM)
Impact of main memory latency and bandwidth
- n throughput improvement
Conclusion
Per-SM clocking (PSMC) Per-SM clocking (PSMC)
Each SM executing independent thread blocks
Enabling PSMC efficiently for GPUs w/ per-SM PLL
Many SP-to-SP communications through a shared memory in an SM
2.5 2.0 1.5 1.0
SM1 SM2 SM3 SM4
BLOCK 1 BLOCK 2 BLOCK 28
…
Queue
Execution Status
SM1 SM2 SM3 SM4
…
- Rel. Exec. Time = 1
Fmax
4 8
…
3 7
…
2 6
…
1 5 28 27 26 25 12 11 10 9
2.5 2.0 1.5 1.0
SM1 SM2 SM3 SM4
BLOCK 1 BLOCK 2 BLOCK 28
…
Queue
Execution Status
SM1 SM2 SM3 SM4
… … … …
- Rel. Exec. Time = 0.57
Fmax, SM1
1
2 3 4
5
6 7
8
9 10
11
12 13
14
24
28
27 26
GPGPU-Sim config. and benchmarks GPGPU-Sim config. and benchmarks
GPGPU-Sim parameters 12 CUDA benchmarks [3, 4]
[3] A. Bakhoda., “Analyzing cuda workloads using a detailed GPU simulator,” in Proc. ISPASS, 2009. [4] “ERCBench, A Benchmark Suite for Embedded and Reconfigurable Computing,” http://ercbench.ece.wisc.edu/index.php
Number of Core (SM)s 16 / 32 / 64 Memory Channels 4 / 8 / 8 Core (SM) Frequency (GHz) 1.688 / 1.476 / 1.401 Memory Frequency (GHz) 1.100 / 1.242 / 1.848 Interconnection Frequency (GHz) 0.85 / 1.00 / 1.50 Memory bandwidth (GB/s) 70.4 / 159 / 236.5 Warp Size 32 Bandwidth / Memory module 4 (Bytes/Cycle) SIMD Pipeline Width 8 Memory Controller FR-FCFS Number of Threads / Core 1024 Branch Divergence Method
Immediate Post Dominator
Number of CTAs / Core 8 Warp Scheduling Policy Round Robin Number of Regs / Core 16384 Constant Cache Size / Core 8 KB Shared Memory / Core 16 KB Texture Cache Size / Core 8 KB AES encryption (AES) Black-scholes (BLK) gpuDG (DG) 3D Laplace Solver (LPS) Ray Tracing (RAY) StoreGPU (STO) Breadth-First Search (BFS) LIBOR Monte Carlo (LIB) MUMmerGPU (MUM) Neural Network (NN) Image Denoising (IMG) Sum of Absolute Differences (SAD)
Per-SM clocking (PSMC) Per-SM clocking (PSMC)
Theoretical throughput improvement:
- 10%, 14%, and 16% higher throughput for entry-level, mid-range,
and high-end GPUs on average.
( )
N F F Speedup
N i slowest max, i max,
= ∑
=1
σsys=6.4% σsys=3.2%
Outline Outline
Introduction GPU architecture and impact of WID variations
- n GPUs
Throughput improvement techniques
Allowing per-SM clocking (PSMC) Disabling the slowest SMs (DSSM)
Impact of main memory latency and bandwidth
- n throughput improvement
Conclusion
Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM)
Problem size-bounded applications
small problem size relative to the number of SMs throughput does not increase with more available SMs
- Disabling the slowest SM(s) higher Fmax for the GPU
1 5 2 6 3 4
SM1 SM2 SM3 SM4
Execution Status 2.5 2.0 1.5 1.0 Queue
BLK 1 BLK 2 BLK 12
…
- Rel. Exec. Time = 1
SM1 SM2 SM3 SM4
Fmax = 1.0 1 4 2 5 3 6
SM1 SM2 SM3
Execution Status 2.5 2.0 1.5
- Rel. Exec. Time = 0.67
SM1 SM2 SM3
Queue
…
Fmax = 1.5
BLK 1 BLK 2 BLK 12
Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM)
The slowest SMs are disabled one by one Relative throughput of 12 applications for entry-level and mid-
range GPU
- Problem size-bounded applications benefit from DSSM
No change in number of execution rounds by disabling SMs GPU’s Fmax increases with more disabled SMs.
- Memory-bounded applications
Fewer SMs request less concurrent memory accesses (higher rate) Higher Fmax request more memory accesses (lower rate)
- Compute-bounded applications benefit from
more SMs than higher Fmax.
Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM)
The slowest SMs are disabled one by one Relative throughput of 12 applications for high-end GPU If appropriate number of the slowest SMs are disabled (i.e., 2 to 6
- ut of 32 SMs and 4 to 32 out of 64 SMs)
1%~7% and 4%~19% throughput improvement for certain
applications
Outline Outline
Introduction GPU architecture and impact of WID variations
- n GPUs
Throughput improvement techniques
Allowing per-SM clocking (PSMC) Disabling the slowest SMs (DSSM)
Impact of main memory latency and bandwidth on throughput improvement Conclusion
Per-SM clocking (PSMC) Per-SM clocking (PSMC)
Emerging memory technology
32% lower latency and 6 times higher bandwidth
Relative throughput improvement of applications adopting the
PSMC scheme
15%, 18%, and 24% higher throughput than the baselines for
entry-level, mid-range, and high-end GPUs on average
Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM)
The slowest SMs are disabled one by one Relative throughput of 12 applications for entry-level and mid-
range GPUs
- Problem size bounded applications still benefit from DSSM
- Memory-bounded applications look more like compute-
bounded ones
Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM)
The slowest SMs are disabled one by one Relative throughput of 12 applications for high-end GPU If appropriate number of the slowest SMs are disabled for high-
end GPU
7%~20% throughput improvement for certain applications
Conclusion Conclusion
Two throughput improvement techniques to exploit WID SM- to-SM frequency variations in GPUs.
Allowing each SM to operate at its own Fmax Disabling the slowest SMs
PSMC
10%~16% throughput improvement of applications on average
DSSM
up to 19% throughput improvement of applications
Impact of main memory latency and bandwidth on throughput
Emerging memory technology
lower latency and higher bandwidth
Throughput improvement by up to 24% and 20%