low complexity reorder buffer architecture
play

Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry - PowerPoint PPT Presentation

Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 16 th Annual ACM International


  1. Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 16 th Annual ACM International Conference on Supercomputing (ICS’02), June 24 th 2002 *supported in part by DARPA through the PAC-C program and NSF ICS’02 1

  2. Outline ROB complexities Motivation for the low-complexity ROB Low-complexity ROB design Results Concluding remarks ICS’02 2

  3. Pentium III-like Superscalar Datapath Function Architectural Units Instruction Issue Register File IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX Instruction D-cache dispatch Result/status forwarding buses ICS’02 3

  4. ROB Port Requirements for a W-way CPU Decode/Dispatch Writeback W write ports W write ports to setup entries to write results ROB Dispatch/Issue Commit 2W read ports W read ports to read the source for instruction operands commitment ICS’02 4

  5. Where are the Source Values Coming From? Function Architectural Units Instruction Issue Register File 1 2 IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX 3 Instruction D-cache dispatch Result/status forwarding buses ICS’02 5

  6. Where are the Source Values Coming From ? Forwarding ARF ROB 62% 32% 6% 32% 100% 80% 60% 40% 20% 0% c p c f r f r u a d . 2 k x i t e m e . e t p c c c l p r p e s a e l k s i s n g m o p g g m v p a r f s e i i i g t I a w a w r p a g w z b r m . r u . g b a t o m a s p l g v e p q r v v v u A e e A A w p 96-entry ROB, 4-way processor SPEC2K Benchmarks ICS’02 6

  7. How Efficiently are the Ports Used ? Decode/Dispatch Writeback W write ports W write ports to setup entries To write results ROB Dispatch/Issue Commit 2W read ports W read ports to read the source for instruction operands commitment 6% ICS’02 7

  8. Approaches to Reducing ROB Complexity Reduce the number of read ports for reading out the source operand values More radical (and better): Completely eliminate the read ports for reading source operand values! ICS’02 8

  9. Reducing the Number of Read Ports 1 read port 2 read ports 3.5% 1.0% Average IPC Drop: 16 12 8 Performance Drop % 4 0 bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. 20 16 12 8 4 0 applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS’02 9

  10. Problems with Retaining Fewer Source Read Ports on the ROB Need arbitration for the small number of ports Additional logic needed to block the instructions which could not get the port. Need a switching network to route the operands to correct destinations Multi-cycle access still remains in the critical path of Dispatch/Issue logic ICS’02 10

  11. Our Solution: Elimination of Read Ports Function Architectural Units Instruction Issue Register File 1 2 IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX 3 Instruction D-cache dispatch Result/status forwarding buses ICS’02 11

  12. Our Solution: Elimination of Read Ports Function Architectural Units Instruction Issue Register File 1 2 IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX 3 Instruction D-cache dispatch Result/status forwarding buses ICS’02 12

  13. Our Solution: Elimination of Read Ports Function Architectural Units Instruction Issue Register File 1 IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX 3 Instruction D-cache dispatch Result/status forwarding buses ICS’02 13

  14. Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM Layout of a 16-ported SRAM bitcell bitcell Area Reduction – 71% Shorter bit and wordlines ICS’02 14

  15. Our Solution: Elimination of Read Ports Function Architectural Units Instruction Issue Register File IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX Instruction D-cache dispatch Result/status forwarding buses Area Reduction – 45% ICS’02 15

  16. Eliminating/Reducing the Number of Read Ports: Effects on Power Dissipation Power is reduced because: shorter bitlines and wordlines lower capacitive loading fewer decoders fewer drivers and sense amps ICS’02 16

  17. Completely Eliminating the Source Read Ports on the ROB The Problem: Issue of instructions that require a value stored in the ROB will stall Solutions: Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING ICS’02 17

  18. Late Forwarding: Use the Normal Forwarding Buses! Function Architectural Units Instruction Issue Register File IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX Instruction D-cache dispatch Result/status forwarding buses: ICS’02 18

  19. Late Forwarding: Use the Normal Forwarding Buses! Function Architectural Units Instruction Issue Register File IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX Instruction D-cache dispatch Result/status forwarding buses: ICS’02 19

  20. Optimizing Late Forwarding PROBLEM: If Late Forwarding is done for every result that is committed, additional forwarding buses are needed in order not to degrade the performance SOLUTION: Selective Late Forwarding (SLF) SLF requires additional bit in the ROB That bit is set by the dispatched instructions that require Late Forwarding No additional forwarding buses are needed, since SLF traffic is very small ICS’02 20

  21. Late Forwarding: Use the Normal Forwarding Buses! Function Architectural Units Instruction Issue Register File IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX Instruction Only 3.5% of the D-cache dispatch traffic is from Result/status forwarding buses: SELECTIVE LATE FORWARDING ICS’02 21

  22. Performance Drop of Simplified ROB No ROB read ports with SLF 1 read port 2 read ports 9.6% 3.5% 1.0% Average IPC Drop: 16 17% 12 Performance Drop % 8 4 0 bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. 30 37% 25 20 15 10 5 0 applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS’02 22

  23. IPC Penalty: Source Value Not Accessible within the ROB Lifetime of a Result Value Late Forwarding/ Forwarding Commitment Value within ARF Result Generation Value within ROB time ICS’02 23

  24. Improving IPC with No Read Ports Cache recently generated values in a set of RETENTION LATCHES (RL) Retention Latches are SMALL and FAST Only 8 to 16 latches needed in the set Entire set has 1 or 2 read ports ICS’02 24

  25. Datapath with the Retention Latches Function Architectural Units Instruction Issue Register File IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX Instruction D-cache dispatch Result/status forwarding buses ICS’02 25

  26. Datapath with the Retention Latches RETENTION LATCHES Function Architectural Units Instruction Issue Register File IQ FU1 F1 F2 D1 D2 FU2 ARF ROB FUm Fetch Decode/Dispatch LSQ EX Instruction D-cache dispatch Result/status forwarding buses ICS’02 26

  27. The Structure of the Retention Latch Set L recently-written results (L=1 or 2 works great) 8 or 16 latches L-ported CAM field Result Values Status (key = ROB_slot_id) W write ports for writing up L ROB slot addresses to W results in parallel (L=1 or 2) ICS’02 27

  28. Retention Latch Management Strategies FIFO 8 entry RL: 42% hit rate 16 entry RL: 55% hit rate LRU 8 entry RL: 56% hit rate 16 entry RL: 62% hit rate Random Replacement Worse performance than FIFO ICS’02 28

  29. Hit Ratios to Retention Latches FIFO 8 2 FIFO 16 2 LRU 8 2 LRU 16 2 42% 55% 56% 62% Average Hit Ratio: 100 80 60 40 20 Hit Ratios 0 bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. 100 80 60 40 20 0 applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS’02 29

  30. Accessing Retention Latch Entries ROB index is used as a unique key in the Retention Latches to search the result values Need to maintain unique keys even when we have: Reuse of a ROB slot: Not a problem for FIFO simply flush a RL entry at commit time for LRU Branch mispredictions ICS’02 30

  31. Handling Branch Mispredictions Selective RL Flushing: Retention latch entries that are in the mispredicted path are flushed Uses branch tags Complicated implementation Complete RL Flushing: All retention latch entries are flushed Very simple implementation Performance drop is only 1.5% compared to selective flushing ICS’02 31

  32. Misprediction Handling: Performance Selective flushing Complete flushing 1.5% Average IPC Drop: 3.5 3 2.5 IPC 2 1.5 1 0.5 0 bzip gap gcc gzip mcf pars perl twol vort vpr appl apsi art equ mesa mgrid swim wupw Int. FP Avg. ICS’02 32

  33. Experimental Setup: the AccuPower (DATE’02) Compiled Performance stats SPEC Microarchitectural benchmarks Simulator (Rooted in Datapath Transition counts, SimpleScalar) specs Context information Power/energy Energy/Power stats VLSI layout Estimator data SPICE SPICE deck SPICE measures of energy per transition ICS’02 33

Recommend


More recommend