origami folding warps for energy efficient gpus
play

Origami: Folding Warps for Energy Efficient GPUs Mohammad - PowerPoint PPT Presentation

Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong , Justin Huang and Murali Annavaram* * University of Southern California University of California, Riverside Stanford University Outline


  1. Origami: 
 Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong † , Justin Huang ‡ and Murali Annavaram* * University of Southern California † University of California, Riverside ‡ Stanford University

  2. Outline • GPU overview • Motivation and related work • Warp Folding • Origami Scheduler • Evaluation �2

  3. GPGPU Overview (GTX480) 
 SM LD/ST Instruction Cache C C C C Operands LD/ST Fetch and decode SFU LD/ST C C C C INT FP LD/ST Unit Unit LD/ST C C C C LD/ST Warp Scheduler SFU Result Queue LD/ST C C C C (2-level) LD/ST LD/ST C C C C LD/ST Register File SFU LD/ST 128KB C C C C LD/ST LD/ST Execution Units C C C C LD/ST SFU LD/ST 64KB shared C C C C LD/ST Memory/L1 cache SFU LD/ST 19 3

  4. GPGPU Overview (GTX480) 
 SM LD/ST Instruction Cache C C C C Operands LD/ST Fetch and decode SFU LD/ST C C C C INT FP LD/ST Unit Unit LD/ST C C C C LD/ST Warp Scheduler SFU Result Queue LD/ST C C C C (2-level) LD/ST LD/ST C C C C LD/ST Register File SFU LD/ST 128KB C C C C LD/ST LD/ST Execution Units C C C C LD/ST SFU LD/ST 64KB shared C C C C LD/ST Memory/L1 cache SFU LD/ST 19 3

  5. GPGPU Power Break-Down DRAM EXE 0.178 0.201 L2 RF 0.045 0.134 MC 0.048 Other Pipeline 0.072 0.114 NOC Constant 0.095 0.112 GPUWattch, ISCA 2013 �4

  6. GPGPU Power Break-Down DRAM EXE 0.178 0.201 L2 EXE 20.1% RF 0.045 0.134 MC 0.048 Other 0.072 Pipeline 0.114 NOC Constant 0.095 0.112 GPUWattch, ISCA 2013 �4

  7. GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5

  8. GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5

  9. GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5

  10. GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5

  11. GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5

  12. Technology Scaling • As technology scales leakage power will increase – Accounts for 50% of the execution units power • Power Gating can be used to reduce the leakage power – Need long idle periods to be effective Warped Gates, MICRO 2012 6 6

  13. Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET �7

  14. Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Lost Opportunity �7

  15. Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Energy Loss or Neutral Lost Opportunity �7

  16. Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Energy Loss or Neutral Energy Savings Lost Opportunity �7

  17. Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Energy Loss or Neutral Energy Savings Lost Opportunity �7

  18. Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Energy Loss or Neutral Energy Savings Lost Opportunity Need to increase idle period length �7

  19. Warp Scheduler Effect on Power Gating INT FP INT INT FP INTO Ready Warps INT FP 8 8

  20. Warp Scheduler Effect on Power Gating Ready Warps C C C C C C C C (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) C C C C C C C C (FP) (FP) (FP) (FP) (INT) (INT) (INT) (INT) C C C C C C C C (FP) (FP) (FP) (FP) (INT) (INT) (INT) (INT) C C C C C Busy C C C C (FP) (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) C C C C C C C C C Idle (FP) (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) C C C C C C C C (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) INT FP 8 8

  21. Warp Scheduler Effect on Power Gating Ready Warps C C C C C C C C (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) C C C C C C C C (FP) (FP) (FP) (FP) (INT) (INT) (INT) (INT) C C C C C C C C • Idle periods 
 (FP) (FP) (FP) (FP) (INT) (INT) (INT) (INT) C interrupted 
 Need to coalesce warp issues 
 C C C C Busy C C C C (FP) (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) by instructions 
 C by resource type C C C C C C C C Idle (FP) that are greedily 
 (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) scheduled C C C C C C C C (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) INT FP 8 8

  22. Related Work/Warped-Gates* • Schedule instructions based on their type • Force power gated units to stay in power gating state for at least the breakeven time 30% 54.3% 0.0% 45.7% 23% Frequency 15% 8% 0% 0 6 13 19 25 Idle period length �9 *Warped-Gates, MICRO 2013

  23. Related Work/Warped-Gates* • Schedule instructions based on their type • Force power gated units to stay in power gating state for at least the breakeven time 30% 54.3% 0.0% 45.7% 23% Frequency 15% 8% 0% 0 6 13 19 25 Idle period length �9 *Warped-Gates, MICRO 2013

  24. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities �10

  25. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler INT FP INT FP INT FP FP INT �10

  26. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler INT FP INT FP INT FP FP INT �10

  27. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT FP INT FP INT FP FP INT �10

  28. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP INT FP INT FP FP INT �10

  29. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP Bubble Cycle X+1 INT FP INT FP FP INT �10

  30. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP Bubble Cycle X+1 Cycle X+2 1111 1111 INT FP INT FP FP INT �10

  31. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP Bubble Cycle X+1 Cycle X+2 1111 1111 INT FP Cycle X+3 Bubble INT FP FP INT �10

  32. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP Bubble Cycle X+1 Cycle X+2 1111 1111 INT FP Cycle X+3 Bubble INT FP FP INT �10

  33. Fine grain idleness • Spatial Idleness – Lanes have different activity • Branch divergence • Insufficient parallelism �11

  34. Warp Folding ➢ Improve the power gating potential by coalescing the pipeline bubbles �12

  35. Warp Folding Scheduler Ready Warps Queue Issued Warps 1 1 1 1 1 1 1 1 Active Mask: Bubble Active Mask: 39

  36. Warp Folding Scheduler Ready Warps Queue Issued Warps 1 1 1 1 1 1 1 1 Active Mask: Bubble Active Mask: Sub_Warp0: Sub_Warp1: 39

  37. Warp Folding Scheduler Ready Warps Queue Issued Warps 1 1 1 1 Active Mask: Bubble Active Mask: Sub_Warp0: 1 1 1 1 Sub_Warp1: 39

  38. Warp Folding Scheduler Ready Warps Queue Issued Warps Active Mask: Bubble Active Mask: Sub_Warp0: 1 1 1 1 Sub_Warp1: 1 1 1 1 39

  39. Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: 1 1 1 1 0 0 0 0 Sub_Warp1: 1 1 1 1 0 0 0 0 39

Recommend


More recommend