Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong † , Justin Huang ‡ and Murali Annavaram* * University of Southern California † University of California, Riverside ‡ Stanford University
Outline • GPU overview • Motivation and related work • Warp Folding • Origami Scheduler • Evaluation �2
GPGPU Overview (GTX480) SM LD/ST Instruction Cache C C C C Operands LD/ST Fetch and decode SFU LD/ST C C C C INT FP LD/ST Unit Unit LD/ST C C C C LD/ST Warp Scheduler SFU Result Queue LD/ST C C C C (2-level) LD/ST LD/ST C C C C LD/ST Register File SFU LD/ST 128KB C C C C LD/ST LD/ST Execution Units C C C C LD/ST SFU LD/ST 64KB shared C C C C LD/ST Memory/L1 cache SFU LD/ST 19 3
GPGPU Overview (GTX480) SM LD/ST Instruction Cache C C C C Operands LD/ST Fetch and decode SFU LD/ST C C C C INT FP LD/ST Unit Unit LD/ST C C C C LD/ST Warp Scheduler SFU Result Queue LD/ST C C C C (2-level) LD/ST LD/ST C C C C LD/ST Register File SFU LD/ST 128KB C C C C LD/ST LD/ST Execution Units C C C C LD/ST SFU LD/ST 64KB shared C C C C LD/ST Memory/L1 cache SFU LD/ST 19 3
GPGPU Power Break-Down DRAM EXE 0.178 0.201 L2 RF 0.045 0.134 MC 0.048 Other Pipeline 0.072 0.114 NOC Constant 0.095 0.112 GPUWattch, ISCA 2013 �4
GPGPU Power Break-Down DRAM EXE 0.178 0.201 L2 EXE 20.1% RF 0.045 0.134 MC 0.048 Other 0.072 Pipeline 0.114 NOC Constant 0.095 0.112 GPUWattch, ISCA 2013 �4
GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5
GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5
GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5
GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5
GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5
Technology Scaling • As technology scales leakage power will increase – Accounts for 50% of the execution units power • Power Gating can be used to reduce the leakage power – Need long idle periods to be effective Warped Gates, MICRO 2012 6 6
Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET �7
Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Lost Opportunity �7
Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Energy Loss or Neutral Lost Opportunity �7
Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Energy Loss or Neutral Energy Savings Lost Opportunity �7
Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Energy Loss or Neutral Energy Savings Lost Opportunity �7
Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Energy Loss or Neutral Energy Savings Lost Opportunity Need to increase idle period length �7
Warp Scheduler Effect on Power Gating INT FP INT INT FP INTO Ready Warps INT FP 8 8
Warp Scheduler Effect on Power Gating Ready Warps C C C C C C C C (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) C C C C C C C C (FP) (FP) (FP) (FP) (INT) (INT) (INT) (INT) C C C C C C C C (FP) (FP) (FP) (FP) (INT) (INT) (INT) (INT) C C C C C Busy C C C C (FP) (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) C C C C C C C C C Idle (FP) (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) C C C C C C C C (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) INT FP 8 8
Warp Scheduler Effect on Power Gating Ready Warps C C C C C C C C (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) C C C C C C C C (FP) (FP) (FP) (FP) (INT) (INT) (INT) (INT) C C C C C C C C • Idle periods (FP) (FP) (FP) (FP) (INT) (INT) (INT) (INT) C interrupted Need to coalesce warp issues C C C C Busy C C C C (FP) (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) by instructions C by resource type C C C C C C C C Idle (FP) that are greedily (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) scheduled C C C C C C C C (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) INT FP 8 8
Related Work/Warped-Gates* • Schedule instructions based on their type • Force power gated units to stay in power gating state for at least the breakeven time 30% 54.3% 0.0% 45.7% 23% Frequency 15% 8% 0% 0 6 13 19 25 Idle period length �9 *Warped-Gates, MICRO 2013
Related Work/Warped-Gates* • Schedule instructions based on their type • Force power gated units to stay in power gating state for at least the breakeven time 30% 54.3% 0.0% 45.7% 23% Frequency 15% 8% 0% 0 6 13 19 25 Idle period length �9 *Warped-Gates, MICRO 2013
Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities �10
Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler INT FP INT FP INT FP FP INT �10
Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler INT FP INT FP INT FP FP INT �10
Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT FP INT FP INT FP FP INT �10
Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP INT FP INT FP FP INT �10
Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP Bubble Cycle X+1 INT FP INT FP FP INT �10
Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP Bubble Cycle X+1 Cycle X+2 1111 1111 INT FP INT FP FP INT �10
Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP Bubble Cycle X+1 Cycle X+2 1111 1111 INT FP Cycle X+3 Bubble INT FP FP INT �10
Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP Bubble Cycle X+1 Cycle X+2 1111 1111 INT FP Cycle X+3 Bubble INT FP FP INT �10
Fine grain idleness • Spatial Idleness – Lanes have different activity • Branch divergence • Insufficient parallelism �11
Warp Folding ➢ Improve the power gating potential by coalescing the pipeline bubbles �12
Warp Folding Scheduler Ready Warps Queue Issued Warps 1 1 1 1 1 1 1 1 Active Mask: Bubble Active Mask: 39
Warp Folding Scheduler Ready Warps Queue Issued Warps 1 1 1 1 1 1 1 1 Active Mask: Bubble Active Mask: Sub_Warp0: Sub_Warp1: 39
Warp Folding Scheduler Ready Warps Queue Issued Warps 1 1 1 1 Active Mask: Bubble Active Mask: Sub_Warp0: 1 1 1 1 Sub_Warp1: 39
Warp Folding Scheduler Ready Warps Queue Issued Warps Active Mask: Bubble Active Mask: Sub_Warp0: 1 1 1 1 Sub_Warp1: 1 1 1 1 39
Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: 1 1 1 1 0 0 0 0 Sub_Warp1: 1 1 1 1 0 0 0 0 39
Recommend
More recommend