register packing register packing
play

Register Packing Register Packing Exploiting Narrow- -Width - PowerPoint PPT Presentation

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting Narrow for Reducing Register File Pressure for Reducing Register File Pressure Og guz uz Ergin Ergin*, Deniz Balkan, Kanad Ghose, *, Deniz


  1. Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting Narrow for Reducing Register File Pressure for Reducing Register File Pressure Og guz uz Ergin Ergin*, Deniz Balkan, Kanad Ghose, *, Deniz Balkan, Kanad Ghose, O Dmitry Ponomarev Ponomarev Dmitry Department of of Computer Computer Science Science Department State University University of of New New York York - - Binghamton Binghamton State *currently with Intel Barcelona Research Center *currently with Intel Barcelona Research Center

  2. Outline Outline � Introduction and motivations Introduction and motivations � � Register Packing: Register Packing: � Conservative Packing � Conservative Packing � Speculative Packing � Speculative Packing � � Results and discussions Results and discussions � � Conclusion Conclusion �

  3. Introduction Introduction � Implications of larger instruction windows Implications of larger instruction windows � � Increases register pressure Increases register pressure � � Generally dealt with by using large register files Generally dealt with by using large register files � � Large register files have: Large register files have: � � Higher access time or require multi Higher access time or require multi- -cycle access cycle access � � Higher energy dissipation Higher energy dissipation � � Need to decrease the register file pressure Need to decrease the register file pressure �

  4. Motivations Motivations � Many generated results have a lot of leading Many generated results have a lot of leading � zeros or ones zeros or ones � Fewer bits are needed to represent the value Fewer bits are needed to represent the value � � Register files are thus not used efficiently Register files are thus not used efficiently �

  5. “Narrow Narrow” ” Values Values “ � Prefixes of all 1s can be replaced with a single 1 and Prefixes of all 1s can be replaced with a single 1 and � the prefixes of all 0s can be replaced with a single 0. the prefixes of all 0s can be replaced with a single 0. → � 1111111 11111111 1 → 1 (width = 1) 1 (width = 1) � → � 00000000 00000000 → 0 (width = 1) 0 (width = 1) � → → � 00000001 00000001 01 (width = 2) 01 (width = 2) � → � 11111101 11111101 → 101 (width = 3) 101 (width = 3) � → � 10101001 10101001 → 10101001 (width = 8) 10101001 (width = 8) � � Narrow width operands do not use the full width of a Narrow width operands do not use the full width of a � register register

  6. 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% bzip2 gap gcc Distribution of Widths Distribution of Widths gzip mcf parser twolf vpr 16 bits ammp 32 bits applu apsi 48 bits art equake 64 bits mesa mgrid swim wupwise INT Average FP Average Total Average

  7. Exploiting Narrow Values Exploiting Narrow Values � Packing multiple results into a single physical Packing multiple results into a single physical � register improves performance as the effective register improves performance as the effective number of physical registers go up number of physical registers go up 32 32 32 16 48 16 64 64 48 32

  8. Main Challenges Main Challenges � Value widths are not known until the results are Value widths are not known until the results are � actually produced actually produced � Register allocation made to a result can change if the Register allocation made to a result can change if the � value turns out to be narrow value turns out to be narrow � Consumers of the result have to be informed if it is Consumers of the result have to be informed if it is � reallocated to a different register based on its width reallocated to a different register based on its width � If multiple results are packed into a common If multiple results are packed into a common � register some means must be provided to locate register some means must be provided to locate them unambiguously them unambiguously

  9. Detecting Value Widths Detecting Value Widths � Have to quantize the widths to simplify Have to quantize the widths to simplify � implementation implementation � Chunks of bytes or double bytes Chunks of bytes or double bytes � � Width detection logic is embedded into the final Width detection logic is embedded into the final � stages of an execution unit stages of an execution unit � Techniques for detecting widths are well known Techniques for detecting widths are well known – – � Leading Zero Detectors in floating point units Leading Zero Detectors in floating point units

  10. Storing Narrow Values in Registers Storing Narrow Values in Registers � Parts of a result do not need to be stored Parts of a result do not need to be stored � contiguously. contiguously. Upper half of Lower half of narrow result B narrow result B P7 Upper half of Lower half of narrow result A narrow result A

  11. Addressing Narrow Values Addressing Narrow Values � Use a bit mask to specify partitions holding Use a bit mask to specify partitions holding � components of the value along with the register components of the value along with the register address address Address of A = P7, 1001 P7 Upper half of Lower half of narrow result A narrow result A

  12. Register Read Logic Register Read Logic Bitcell Array Partition 3 Partition 2 Partition 1 Partition 0 1 k 1 k 1 k 1 k 4:1 sign bit MUX* n- devices s k 2:1 MUX 3:1 MUX 4:1 MUX 4:1 MUX k k k k Sense Amp Array 4k *includes 1 � k expander

  13. Register Packing Alternatives Register Packing Alternatives � Conservative Packing Conservative Packing � � Assume result to use the full width of a Assume result to use the full width of a � register at allocation time register at allocation time � Speculative Packing Speculative Packing � � Predict the result width at allocation time and Predict the result width at allocation time and � allocate accordingly allocate accordingly

  14. Conservative Packing Conservative Packing � Initially allocate a full Initially allocate a full- -width register width register � � If the result turns out to be narrow: If the result turns out to be narrow: � � Release the unneeded parts to the free pool Release the unneeded parts to the free pool � � If there is a suitable partition: reallocate. If there is a suitable partition: reallocate. �

  15. Conservative Packing Conservative Packing Instruction I is dispatched: Instruction I is dispatched: P2 P5 Free Partition Allocated Partition

  16. Conservative Packing Conservative Packing Instruction I is dispatched: Instruction I is dispatched: P2 is allocated P2 is allocated P2 P5 Free Partition Allocated Partition

  17. Conservative Packing Conservative Packing Instruction I is dispatched: Instruction I is dispatched: Width of result = 2 slots Width of result = 2 slots P5’ ’s upper half is allocated and P2 is released s upper half is allocated and P2 is released P5 P2 P5 Free Partition Allocated Partition

  18. Taking Care of Reassignments Taking Care of Reassignments � Two broadcasts are needed Two broadcasts are needed � � First broadcast uses old tag (=originally assigned First broadcast uses old tag (=originally assigned � register id) to inform dependents that the result register id) to inform dependents that the result will be available shortly will be available shortly � Second broadcast drives the old tag and the new Second broadcast drives the old tag and the new � tag (= newly- -assigned register id + assigned register id + “ “parts parts” ” bits) bits) tag (= newly � old tag is used to locate dependents old tag is used to locate dependents � � new tag picked up by matching entries and used later new tag picked up by matching entries and used later � to read out source value from the register file to read out source value from the register file

  19. Tag Broadcast for Wakeup Tag Broadcast for Wakeup Tag Bus Function P1, 1001 P2, 1111 Unit Consumer P2, 1111 P2, 1111 Producer Issue Queue

  20. Tag Rebroadcast Example Tag Rebroadcast Example New Tag Old Tag P5, 1100 P2, 1111 P5, 1100 P2, 1111 Producer Function P1, 1001 P2, 1111 P5, 1100 Unit Consumer Issue Queue

  21. IPCs for Conservative Packing for Conservative Packing IPCs 3 2.5 2 1.5 1 0.5 0 INT Average Total Average vpr FP Average bzip2 gap gzip mcf parser twolf ammp applu apsi equake mesa mgrid swim wupwise gcc art Base 8 tag buses 4 tag buses 1 cycle stall on tag re-broadcasts

  22. Conservative Packing: Observations Conservative Packing: Observations � Extra broadcast is needed for all results that Extra broadcast is needed for all results that � don’ ’t use all of the partitions within a register t use all of the partitions within a register don � Performance is heavily constrained by the Performance is heavily constrained by the � number of broadcast buses number of broadcast buses � 6% for 4 buses 6% for 4 buses � � 14% for 8 buses 14% for 8 buses � � - -26% for 4 buses assuming an extra cycle delay for 26% for 4 buses assuming an extra cycle delay for � width estimation width estimation

Recommend


More recommend