get out of the valley power efficient address mapping for
play

Get Out of the Valley: Power-Efficient Address Mapping for GPUs The - PowerPoint PPT Presentation

Get Out of the Valley: Power-Efficient Address Mapping for GPUs The 45 th International Symposium on Computer Architecture (ISCA) Monday June 4 th , 2018 Yuxi Liu (Ghent & Peking), Xia Zhao (Ghent), Magnus Jahre (NTNU), Zhenlin Wang (MTU),


  1. Get Out of the Valley: Power-Efficient Address Mapping for GPUs The 45 th International Symposium on Computer Architecture (ISCA) Monday June 4 th , 2018 Yuxi Liu (Ghent & Peking), Xia Zhao (Ghent), Magnus Jahre (NTNU), Zhenlin Wang (MTU), Xiaolin Wang (Peking), Yingwei Luo (Peking), and Lieven Eeckhout (Ghent)

  2. GPU Memory Systems GPUs require high bandwidth memory systems to support efficient execution of 100s to 1000s of concurrent threads DRAM Banks DRAM LLC Slice Channel 0 Network on Chip (NoC) Multiprocessors (SMs) DRAM LLC Slice Streaming Channel 1 DRAM LLC Slice Channel 2 DRAM LLC Slice Channel 3 Achieving high bandwidth requires effectively utilizing the parallel units in the memory system 2

  3. Bank and channel bits must be highly variable Entropy Valley to ensure even distribution of memory requests across LLC slices, channels and banks Memory Address Most Least Row Channel Bank Column Block significant bit significant bit CPUs GPUs Entropy is a Entropy measure of the Entropy information Valley content of each address bit Memory Address Bit Entropy valleys create significant resource imbalance in GPU memory systems - leading to poor performance and low power-efficiency 3

  4. Why Do Entropy Valleys Exist? Column-Major 1D Thread Channel Block (TB) Allocation Channel 0 [y,x] bits 7 [0,0] … 0000 00 … Request [0,0] 6 [1,0] … 0010 00 … Request [1,0] 5 [2,0] … 0100 00 … Request [2,0] Channel 1 Y-dimension [3,0] … 0110 00 … 4 Request [3,0] [4,0] … 1000 00 … Request [4,0] 3 Channel 2 [5,0] … 1010 00 … Request [5,0] 2 [6,0] … 1100 00 … Request [6,0] 1 [7,0] Channel 3 … 1110 00 … Request [7,0] 0 0 1 2 3 4 5 6 7 X-dimension Memory Addresses and Requests DRAM Channels 4

  5. Why Do Entropy Valleys Exist? Column-Major 1D Thread Channel Block (TB) Allocation Channel 0 [y,x] bits Request [0,0] Request [1,0] 7 [0,0] … 0000 00 … Request [2,0] Request [3,0] 6 Request [4,0] Request [5,0] [1,0] … 0010 00 … Request [6,0] Request [7,0] 5 [2,0] … 0100 00 … Y-dimension All requests end up in Channel 0 [3,0] … 0110 00 … 4 [4,0] … 1000 00 … 3 Entropy valleys are caused by Channel 2 [5,0] … 1010 00 … dimension-related array indexing 2 [6,0] … 1100 00 … 1 Our solution: [7,0] Channel 3 … 1110 00 … BIM-based address mapping 0 0 1 2 3 4 5 6 7 X-dimension Memory Addresses and Requests DRAM Channels 5

  6. Getting Out of the Entropy Valley Channel BIM-based Column-Major 1D Thread [y,x] bits Address Mapping Block (TB) Allocation … 0000 00 … [0,0] Channel 0 Output Addr. Binary … 0010 00 … Input Addr. [1,0] Invertible 7 x = … 0100 00 … [2,0] Matrix [3,0] … 0110 00 … (BIM) 6 [4,0] … 1000 00 … Channel 1 [5,0] … 1010 00 … 5 [6,0] … 1100 00 … Y-dimension [7,0] … 1110 00 … 4 Channel 2 [0,0] … 0000 00 … Request [0,0] 3 [1,0] … 0010 11 … Request [1,0] [2,0] … 0100 01 … Request [2,0] 2 [3,0] … 0110 10 … Request [3,0] Channel 3 1 [4,0] … 1000 11 … Request [4,0] [5,0] … 1010 00 … Request [5,0] 0 [6,0] … 1100 10 … Request [6,0] [7,0] 0 1 2 3 4 5 6 7 … 1110 01 … Request [7,0] X-dimension Memory Addresses and Requests DRAM Channels 6

  7. Getting Out of the Entropy Valley Channel BIM-based Column-Major 1D Thread [y,x] bits Address Mapping Block (TB) Allocation … 0000 00 … [0,0] Channel 0 Output Addr. Binary … 0010 00 … Input Addr. [1,0] Request [0,0] Invertible 7 x = … 0100 00 … [2,0] Matrix Request [5,0] [3,0] … 0110 00 … (BIM) 6 [4,0] … 1000 00 … Channel 1 [5,0] … 1010 00 … Request [2,0] 5 [6,0] … 1100 00 … Y-dimension Request [7,0] [7,0] … 1110 00 … 4 Channel 2 [0,0] … 0000 00 … 3 Request [3,0] [1,0] … 0010 11 … [2,0] Request [6,0] … 0100 01 … 2 [3,0] Perfect channel … 0110 10 … Channel 3 1 [4,0] … 1000 11 … utilization! Request [1,0] [5,0] … 1010 00 … 0 Request [4,0] [6,0] … 1100 10 … [7,0] 0 1 2 3 4 5 6 7 … 1110 01 … X-dimension Memory Addresses and Requests DRAM Channels 7

  8. Outline 1. Introduction 2. Window-based memory address entropy 3. Binary Invertible Matrix (BIM) address mapping 4. Results 5. Conclusion 8

  9. Window-based Entropy We need an entropy metric without memory request ordering assumptions Intra-TB Entropy Inter-TB Entropy … 1 0 0 … TB1 TB2 TB3 TB4 … 0 0 1 … BVR 0 1 0 1 Thread Block (TB) 1 … 1 0 1 … … 0 0 0 … Window: The TBs that are likely to issue requests Bit Value Ratio (BVR) 0 that coexist in the memory system … 1 1 0 … … 0 1 1 … Compute Shannon’s entropy function over the BVR Thread Block (TB) 2 … 1 1 1 … probabilities within each window … 0 1 0 … Overall entropy = Mean of window entropies Bit Value Ratio (BVR) 1 With Greedy-Then-Oldest (GTO) warp scheduling, we heuristically set the window size to the number of Streaming Multiprocessors (SMs) 9

  10. Entropy Profile Examples Two channel bits Three bank bits and one bank bit 1 . 0 1 . 0 1 . 0 Entropy Entropy Entropy 0 . 5 0 . 5 0 . 5 0 . 0 0 . 0 0 . 0 29 18 6 29 18 6 29 18 6 Bit Bit Bit MT LU GS 1 . 0 1 . 0 1 . 0 Entropy Entropy Entropy 0 . 5 0 . 5 0 . 5 0 . 0 0 . 0 0 . 0 29 18 6 29 18 6 29 18 6 Bit Bit Bit NW LPS NN (no valley) All workloads have low entropy bits, and their location is highly application-dependent GPU address mapping schemes must harvest entropy across broad address bit ranges 10

  11. Outline 1. Introduction 2. Window-based memory address entropy 3. Binary Invertible Matrix (BIM) address mapping 4. Results 5. Conclusion 11

  12. The Binary Invertible Matrix (BIM) Output Addr. Binary Input Addr. The BIM can represent all possible Invertible x = address mapping schemes that consist Matrix (BIM) of AND and XOR operations Example Memory Map • Matrix covers all possible transformations • Invertibility criterion ensures that all possible Remap (RMP) one-to-one relations are considered Single 1 per row The BIM has low hardware overhead Permutation-based mapping (PM) Zhang et al. • Can be implemented with a tree of XOR-gates [MICRO’00] • Mapping can be performed in a single clock cycle Two 1s in bank and channel rows 12

  13. Our Mapping Schemes Broad mapping strategy Entropy analysis shows that a GPU address mapping policy needs to harvest entropy across broad Multiple 1s for each bank and channel row address bit ranges • We call this the broad mapping strategy • Covers many possible mapping schemes Broad sub-strategies Row Channel Bank Column Block FAE PAE FAE We define three sub-strategies that All All differ in which memory address Binary Invertible Matrix (BIM) fields can be used as input and All output in the BIM Row Channel Bank Column Block • Page Address Entropy (PAE) • Full Address Entropy (FAE) We randomly generate BIMs that match the input and output restrictions of each sub-strategy • All 13

  14. Entropy Impact of Address Mapping Schemes for the MT Benchmark Baseline Remap PM 1 . 0 1 . 0 1 . 0 Entropy Entropy Entropy 0 . 5 0 . 5 0 . 5 0 . 0 0 . 0 0 . 0 29 18 6 29 18 6 29 18 6 Bit Bit Bit FAE All PAE 1 . 0 1 . 0 1 . 0 Entropy Entropy Entropy 0 . 5 0 . 5 0 . 5 0 . 0 0 . 0 0 . 0 29 18 6 29 18 6 29 18 6 Bit Bit Bit PAE, FAE, and All remove the entropy valleys – the other mapping schemes do not 14

  15. Outline 1. Introduction 2. Window-based memory address entropy 3. Binary Invertible Matrix (BIM) address mapping 4. Results 5. Conclusion 15

  16. Execution Time vs. DRAM Power 1,2 Average Execution Time Normalized to BASE BASE 1 PM RMP - 1.51X 0,8 PAE ALL FAE 0,6 +1.30X 0,4 0,2 0 0,8 0,9 1 1,1 1,2 1,3 1,4 1,5 Average DRAM Power Consumption Normalized to BASE 16

  17. Performance BASE PM RMP PAE FAE ALL 8 +7.5X +6.7X 7 PAE improves Speed-up Relative to BASE 6 performance by +1.31X on average 5 +4.0X compared to PM 4 3 +1.9X +2.0X 2 +1.5X +1.4X +1.4X +1.3X +1.1X +1.0X +1.0X 1 0 MT LU GS NW LPS SC SRAD2 DWT2D HS SP HMEAN 17

  18. Performance per Watt BASE PM RMP PAE FAE ALL 4,5 +3.9X PAE improves 4 Performance per Watt 3,5 Performance per Watt by +1.25X on average 3 compared to PM 2,5 2 +1.4X 1,5 1 0,5 0 MT LU GS NW LPS SC SRAD2 DWT2D HS SP HMEAN 18

  19. Why is PAE Most Power-Efficient? background activate read write 60 BASE PM RMP PAE FAE ALL DRAM Power Breakdown (W) 50 40 30 20 10 0 MT LU GS NW LPS SC SRAD2 DWT2D HS SP AVG FAE and ALL tend to distribute requests with good DRAM page locality to different banks which increases the number of DRAM page activations PAE saves power by keeping these requests in the same bank 19

  20. Outline 1. Introduction 2. Window-based memory address entropy 3. Binary Invertible Matrix (BIM) address mapping 4. Results 5. Conclusion 20

  21. Conclusion Window-Based Entropy • A novel entropy metric tailored for the highly concurrent memory behavior of GPU compute workloads Binary Invertible Matrix (BIM) address mapping • A unified representation of address mapping schemes that use AND and XOR operations Page Address Entropy (PAE) address mapping • PAE improves performance by 1.31X and performance per Watt by 1.25X compared to the state-of-the-art permutation-based address mapping scheme 21

  22. Thank You! 22

Recommend


More recommend