Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core Exploiting Within-Die Core-to-Core Frequency Variation Frequency Variation Jungseob Lee, Paritosh Ajgaonkar, Nam Sung Kim Apr 12, 2011 Department of Electrical and Computer Engineering University of Wisconsin - Madison
Outline Outline � Introduction � GPU architecture and impact of WID variations on GPUs � Throughput improvement techniques � Allowing per-SM clocking (PSMC) � Disabling the slowest SMs (DSSM) � Impact of main memory latency and bandwidth on throughput improvement � Conclusion
Introduction Introduction � Improve throughput of GPU applications GPUs can provide high throughput for general-purpose and data-intensive applications. Slowest core � Limit F max PSMC & DSSM PSMC & DSSM Die Two techniques for mitigating Two techniques for mitigating the negative impact of WID the negative impact of WID Fastest core C2C frequency variations on C2C frequency variations on throughput of GPUs throughput of GPUs Increasing WID Core-to-Core frequency Increasing WID Core-to-Core frequency variations affect F max of GPUs. variations affect F max of GPUs.
Outline Outline � Introduction � GPU architecture and impact of WID variations on GPUs � Throughput improvement techniques � Allowing per-SM clocking (PSMC) � Disabling the slowest SMs (DSSM) � Impact of main memory latency and bandwidth on throughput improvement � Conclusion
GPU architecture GPU architecture � GPU architecture: � Streaming Multiprocessor (SM)s, off-chip DRAM, and on-chip interconnect network � Each SM: 1) 8 to 32 streaming processors (SPs), 2) an instruction scheduler, 3) instruction cache, 4) register files, 5) special function units (SFU), and 6) shared memory/cache SM SP Streaming Multiprocessor (SM) Fermi Architecture [1] [1] [1] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
D2D & WID variations D2D & WID variations � Die to Die (D2D) variations � Affect all transistors on a die identically � Within-Die (WID) variations � Different transistor characteristics within a single die. � With technology scaling (i.e., more cores per a die) � Considerable C2C F MAX variation due to spatially correlated WID variations Die-to-Die (D2D) Variations Within-Die (WID) Variations Systematic Random Die Scale Feature Scale Wafer Scale Courtesy : K. Bowman from Intel
WID C2C F MAX Variations WID C2C F MAX Variations � Impact of WID C2C F max variations F max SM ID Each variation map has 80x80 grid points 1 grid point [2] The corresponding A WID Vth/L eff variation map F max map for a 16-SM GPU � C2C frequency variation affects GPU’s F MAX � F max is limited by the slowest core in a GPU designed to operate all SMs at the same frequency (i.e, per-chip clocking). � More SMs in a die � SM-to-SM frequency variation increases � Power inefficiency � faster SMs consume more leakage power [2] S. Herbert et al., “Characterizing chip-multiprocessor variability-tolerance,” in Proc. IEEE DAC, 2008.
Outline Outline � Introduction � GPU architecture and impact of WID variations on GPUs � Throughput improvement techniques � Allowing per-SM clocking (PSMC) � Disabling the slowest SMs (DSSM) � Impact of main memory latency and bandwidth on throughput improvement � Conclusion
Per-SM clocking (PSMC) Per-SM clocking (PSMC) � Each SM executing independent thread blocks � Enabling PSMC efficiently for GPUs w/ per-SM PLL � Many SP-to-SP communications through a shared memory in an SM BLOCK 28 BLOCK 28 … Queue … Queue BLOCK 2 BLOCK 2 BLOCK 1 BLOCK 1 F max F max, SM1 2.5 2.0 1.5 1.0 2.5 2.0 1.5 1.0 SM1 SM2 SM3 SM4 SM1 SM2 SM3 SM4 Execution Status Execution Status Rel. Exec. Time = 0.57 1 2 3 1 2 3 4 4 5 6 8 7 Rel. Exec. Time = 1 9 5 6 7 8 11 10 12 13 14 … 9 10 11 12 … … … 24 26 27 28 … … … … SM1 SM2 SM3 SM4 25 26 27 28 SM1 SM2 SM3 SM4
GPGPU-Sim config. and benchmarks GPGPU-Sim config. and benchmarks � GPGPU-Sim parameters Number of Core (SM)s 16 / 32 / 64 Memory Channels 4 / 8 / 8 Core (SM) Frequency (GHz) 1.688 / 1.476 / 1.401 Memory Frequency (GHz) 1.100 / 1.242 / 1.848 Interconnection Frequency (GHz) 0.85 / 1.00 / 1.50 Memory bandwidth (GB/s) 70.4 / 159 / 236.5 Warp Size 32 Bandwidth / Memory module 4 (Bytes/Cycle) SIMD Pipeline Width 8 Memory Controller FR-FCFS Number of Threads / Core 1024 Branch Divergence Method Immediate Post Dominator Number of CTAs / Core 8 Warp Scheduling Policy Round Robin Number of Regs / Core 16384 Constant Cache Size / Core 8 KB Shared Memory / Core 16 KB Texture Cache Size / Core 8 KB � 12 CUDA benchmarks [3, 4] AES encryption (AES) Black-scholes (BLK) gpuDG (DG) 3D Laplace Solver (LPS) Ray Tracing (RAY) StoreGPU (STO) Breadth-First Search (BFS) LIBOR Monte Carlo (LIB) MUMmerGPU (MUM) Neural Network (NN) Image Denoising (IMG) Sum of Absolute Differences (SAD) [3] A. Bakhoda., “Analyzing cuda workloads using a detailed GPU simulator,” in Proc. ISPASS, 2009. [4] “ERCBench, A Benchmark Suite for Embedded and Reconfigurable Computing,” http://ercbench.ece.wisc.edu/index.php
Per-SM clocking (PSMC) Per-SM clocking (PSMC) � Theoretical throughput improvement: = ∑ N ( ) � Speedup F F N max, i max, slowest i = 1 σ sys =6.4% σ sys =3.2% � 10%, 14%, and 16% higher throughput for entry-level, mid-range, and high-end GPUs on average.
Outline Outline � Introduction � GPU architecture and impact of WID variations on GPUs � Throughput improvement techniques � Allowing per-SM clocking (PSMC) � Disabling the slowest SMs (DSSM) � Impact of main memory latency and bandwidth on throughput improvement � Conclusion
Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM) � Problem size-bounded applications � small problem size relative to the number of SMs � throughput does not increase with more available SMs Disabling the slowest SM(s) � higher F max for the GPU � BLK 12 BLK 12 Queue … Queue … BLK 2 BLK 2 BLK 1 BLK 1 F max = 1.0 F max = 1.5 2.5 2.0 1.5 1.0 2.5 2.0 1.5 SM1 SM2 SM3 SM4 SM1 SM2 SM3 Execution Status Execution Status Rel. Exec. Time = 0.67 Rel. Exec. Time = 1 1 2 3 1 2 3 4 4 5 6 5 6 SM1 SM2 SM3 SM1 SM2 SM3 SM4
Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM) � The slowest SMs are disabled one by one � Relative throughput of 12 applications for entry-level and mid- range GPU • Problem size-bounded applications benefit from DSSM � No change in number of execution rounds by disabling SMs � GPU’s F max increases with more disabled SMs. • Memory-bounded applications � Fewer SMs request less concurrent memory accesses (higher rate) � Higher F max request more memory accesses (lower rate) • Compute-bounded applications benefit from more SMs than higher F max .
Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM) � The slowest SMs are disabled one by one � Relative throughput of 12 applications for high-end GPU � If appropriate number of the slowest SMs are disabled (i.e., 2 to 6 out of 32 SMs and 4 to 32 out of 64 SMs) � 1%~7% and 4%~19% throughput improvement for certain applications
Outline Outline � Introduction � GPU architecture and impact of WID variations on GPUs � Throughput improvement techniques � Allowing per-SM clocking (PSMC) � Disabling the slowest SMs (DSSM) � Impact of main memory latency and bandwidth on throughput improvement � Conclusion
Per-SM clocking (PSMC) Per-SM clocking (PSMC) � Emerging memory technology � 32% lower latency and 6 times higher bandwidth � Relative throughput improvement of applications adopting the PSMC scheme � 15%, 18%, and 24% higher throughput than the baselines for entry-level, mid-range, and high-end GPUs on average
Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM) � The slowest SMs are disabled one by one � Relative throughput of 12 applications for entry-level and mid- range GPUs •Problem size bounded applications still benefit from DSSM •Memory-bounded applications look more like compute- bounded ones
Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM) � The slowest SMs are disabled one by one � Relative throughput of 12 applications for high-end GPU � If appropriate number of the slowest SMs are disabled for high- end GPU � 7%~20% throughput improvement for certain applications
Conclusion Conclusion � Two throughput improvement techniques to exploit WID SM- to-SM frequency variations in GPUs. � Allowing each SM to operate at its own F max � Disabling the slowest SMs � PSMC � 10%~16% throughput improvement of applications on average � DSSM � up to 19% throughput improvement of applications � Impact of main memory latency and bandwidth on throughput � Emerging memory technology � lower latency and higher bandwidth � Throughput improvement by up to 24% and 20%
Recommend
More recommend