Analyzing Throughput of GPUs Analyzing Throughput of GPUs - - PowerPoint PPT Presentation

analyzing throughput of gpus analyzing throughput of gpus
SMART_READER_LITE
LIVE PREVIEW

Analyzing Throughput of GPUs Analyzing Throughput of GPUs - - PowerPoint PPT Presentation

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core Exploiting Within-Die Core-to-Core Frequency Variation Frequency Variation Jungseob Lee, Paritosh Ajgaonkar, Nam Sung Kim Apr 12, 2011 Department of


slide-1
SLIDE 1

Department of Electrical and Computer Engineering University of Wisconsin - Madison

Jungseob Lee, Paritosh Ajgaonkar, Nam Sung Kim Apr 12, 2011

Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core Frequency Variation Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core Frequency Variation

slide-2
SLIDE 2

Outline Outline

Introduction GPU architecture and impact of WID variations

  • n GPUs

Throughput improvement techniques

Allowing per-SM clocking (PSMC) Disabling the slowest SMs (DSSM)

Impact of main memory latency and bandwidth

  • n throughput improvement

Conclusion

slide-3
SLIDE 3

Introduction Introduction

Improve throughput of GPU applications

GPUs can provide high throughput for general-purpose and data-intensive applications. Two techniques for mitigating the negative impact of WID C2C frequency variations on throughput of GPUs Two techniques for mitigating the negative impact of WID C2C frequency variations on throughput of GPUs Increasing WID Core-to-Core frequency variations affect Fmax of GPUs. Increasing WID Core-to-Core frequency variations affect Fmax of GPUs. Slowest core Limit Fmax Fastest core Die PSMC & DSSM PSMC & DSSM

slide-4
SLIDE 4

Outline Outline

Introduction GPU architecture and impact of WID variations on GPUs Throughput improvement techniques

Allowing per-SM clocking (PSMC) Disabling the slowest SMs (DSSM)

Impact of main memory latency and bandwidth

  • n throughput improvement

Conclusion

slide-5
SLIDE 5

GPU architecture GPU architecture

GPU architecture:

Streaming Multiprocessor (SM)s, off-chip DRAM, and on-chip

interconnect network

Each SM: 1) 8 to 32 streaming processors (SPs), 2) an

instruction scheduler, 3) instruction cache, 4) register files, 5) special function units (SFU), and 6) shared memory/cache

Fermi Architecture

Streaming Multiprocessor (SM)

SM SP

[1] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

[1] [1]

slide-6
SLIDE 6

D2D & WID variations D2D & WID variations

Die to Die (D2D) variations

Affect all transistors on a die identically

Within-Die (WID) variations

Different transistor characteristics within a single die. With technology scaling (i.e., more cores per a die)

Considerable C2C FMAX variation due to spatially correlated WID

variations

Within-Die (WID) Variations

Systematic

Die-to-Die (D2D) Variations

Feature Scale Die Scale Random Wafer Scale

Courtesy: K. Bowman from Intel

slide-7
SLIDE 7

Impact of WID C2C Fmax variations

C2C frequency variation affects GPU’s FMAX

Fmax is limited by the slowest core in a GPU designed to operate all

SMs at the same frequency (i.e, per-chip clocking).

More SMs in a die SM-to-SM frequency variation increases Power inefficiency faster SMs consume more leakage power

WID C2C FMAX Variations WID C2C FMAX Variations

[2] S. Herbert et al., “Characterizing chip-multiprocessor variability-tolerance,” in Proc. IEEE DAC, 2008.

A WID Vth/Leff variation map for a 16-SM GPU The corresponding Fmax map

Fmax

Each variation map has 80x80 grid points

1 grid point [2]

SM ID

slide-8
SLIDE 8

Outline Outline

Introduction GPU architecture and impact of WID variations

  • n GPUs

Throughput improvement techniques

Allowing per-SM clocking (PSMC) Disabling the slowest SMs (DSSM)

Impact of main memory latency and bandwidth

  • n throughput improvement

Conclusion

slide-9
SLIDE 9

Per-SM clocking (PSMC) Per-SM clocking (PSMC)

Each SM executing independent thread blocks

Enabling PSMC efficiently for GPUs w/ per-SM PLL

Many SP-to-SP communications through a shared memory in an SM

2.5 2.0 1.5 1.0

SM1 SM2 SM3 SM4

BLOCK 1 BLOCK 2 BLOCK 28

Queue

Execution Status

SM1 SM2 SM3 SM4

  • Rel. Exec. Time = 1

Fmax

4 8

3 7

2 6

1 5 28 27 26 25 12 11 10 9

2.5 2.0 1.5 1.0

SM1 SM2 SM3 SM4

BLOCK 1 BLOCK 2 BLOCK 28

Queue

Execution Status

SM1 SM2 SM3 SM4

… … … …

  • Rel. Exec. Time = 0.57

Fmax, SM1

1

2 3 4

5

6 7

8

9 10

11

12 13

14

24

28

27 26

slide-10
SLIDE 10

GPGPU-Sim config. and benchmarks GPGPU-Sim config. and benchmarks

GPGPU-Sim parameters 12 CUDA benchmarks [3, 4]

[3] A. Bakhoda., “Analyzing cuda workloads using a detailed GPU simulator,” in Proc. ISPASS, 2009. [4] “ERCBench, A Benchmark Suite for Embedded and Reconfigurable Computing,” http://ercbench.ece.wisc.edu/index.php

Number of Core (SM)s 16 / 32 / 64 Memory Channels 4 / 8 / 8 Core (SM) Frequency (GHz) 1.688 / 1.476 / 1.401 Memory Frequency (GHz) 1.100 / 1.242 / 1.848 Interconnection Frequency (GHz) 0.85 / 1.00 / 1.50 Memory bandwidth (GB/s) 70.4 / 159 / 236.5 Warp Size 32 Bandwidth / Memory module 4 (Bytes/Cycle) SIMD Pipeline Width 8 Memory Controller FR-FCFS Number of Threads / Core 1024 Branch Divergence Method

Immediate Post Dominator

Number of CTAs / Core 8 Warp Scheduling Policy Round Robin Number of Regs / Core 16384 Constant Cache Size / Core 8 KB Shared Memory / Core 16 KB Texture Cache Size / Core 8 KB AES encryption (AES) Black-scholes (BLK) gpuDG (DG) 3D Laplace Solver (LPS) Ray Tracing (RAY) StoreGPU (STO) Breadth-First Search (BFS) LIBOR Monte Carlo (LIB) MUMmerGPU (MUM) Neural Network (NN) Image Denoising (IMG) Sum of Absolute Differences (SAD)

slide-11
SLIDE 11

Per-SM clocking (PSMC) Per-SM clocking (PSMC)

Theoretical throughput improvement:

  • 10%, 14%, and 16% higher throughput for entry-level, mid-range,

and high-end GPUs on average.

( )

N F F Speedup

N i slowest max, i max,

        = ∑

=1

σsys=6.4% σsys=3.2%

slide-12
SLIDE 12

Outline Outline

Introduction GPU architecture and impact of WID variations

  • n GPUs

Throughput improvement techniques

Allowing per-SM clocking (PSMC) Disabling the slowest SMs (DSSM)

Impact of main memory latency and bandwidth

  • n throughput improvement

Conclusion

slide-13
SLIDE 13

Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM)

Problem size-bounded applications

small problem size relative to the number of SMs throughput does not increase with more available SMs

  • Disabling the slowest SM(s) higher Fmax for the GPU

1 5 2 6 3 4

SM1 SM2 SM3 SM4

Execution Status 2.5 2.0 1.5 1.0 Queue

BLK 1 BLK 2 BLK 12

  • Rel. Exec. Time = 1

SM1 SM2 SM3 SM4

Fmax = 1.0 1 4 2 5 3 6

SM1 SM2 SM3

Execution Status 2.5 2.0 1.5

  • Rel. Exec. Time = 0.67

SM1 SM2 SM3

Queue

Fmax = 1.5

BLK 1 BLK 2 BLK 12

slide-14
SLIDE 14

Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM)

The slowest SMs are disabled one by one Relative throughput of 12 applications for entry-level and mid-

range GPU

  • Problem size-bounded applications benefit from DSSM

No change in number of execution rounds by disabling SMs GPU’s Fmax increases with more disabled SMs.

  • Memory-bounded applications

Fewer SMs request less concurrent memory accesses (higher rate) Higher Fmax request more memory accesses (lower rate)

  • Compute-bounded applications benefit from

more SMs than higher Fmax.

slide-15
SLIDE 15

Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM)

The slowest SMs are disabled one by one Relative throughput of 12 applications for high-end GPU If appropriate number of the slowest SMs are disabled (i.e., 2 to 6

  • ut of 32 SMs and 4 to 32 out of 64 SMs)

1%~7% and 4%~19% throughput improvement for certain

applications

slide-16
SLIDE 16

Outline Outline

Introduction GPU architecture and impact of WID variations

  • n GPUs

Throughput improvement techniques

Allowing per-SM clocking (PSMC) Disabling the slowest SMs (DSSM)

Impact of main memory latency and bandwidth on throughput improvement Conclusion

slide-17
SLIDE 17

Per-SM clocking (PSMC) Per-SM clocking (PSMC)

Emerging memory technology

32% lower latency and 6 times higher bandwidth

Relative throughput improvement of applications adopting the

PSMC scheme

15%, 18%, and 24% higher throughput than the baselines for

entry-level, mid-range, and high-end GPUs on average

slide-18
SLIDE 18

Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM)

The slowest SMs are disabled one by one Relative throughput of 12 applications for entry-level and mid-

range GPUs

  • Problem size bounded applications still benefit from DSSM
  • Memory-bounded applications look more like compute-

bounded ones

slide-19
SLIDE 19

Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM)

The slowest SMs are disabled one by one Relative throughput of 12 applications for high-end GPU If appropriate number of the slowest SMs are disabled for high-

end GPU

7%~20% throughput improvement for certain applications

slide-20
SLIDE 20

Conclusion Conclusion

Two throughput improvement techniques to exploit WID SM- to-SM frequency variations in GPUs.

Allowing each SM to operate at its own Fmax Disabling the slowest SMs

PSMC

10%~16% throughput improvement of applications on average

DSSM

up to 19% throughput improvement of applications

Impact of main memory latency and bandwidth on throughput

Emerging memory technology

lower latency and higher bandwidth

Throughput improvement by up to 24% and 20%