with a runahead buffer
play

with a Runahead Buffer Milad Hashemi Yale N. Patt December 8, 2015 - PowerPoint PPT Presentation

Filtered Runahead Execution with a Runahead Buffer Milad Hashemi Yale N. Patt December 8, 2015 Runahead Execution Overview Runahead dynamically expands the instruction window when the pipeline is stalled [Mutlu et al., 2003] The core


  1. Filtered Runahead Execution with a Runahead Buffer Milad Hashemi Yale N. Patt December 8, 2015

  2. Runahead Execution Overview • Runahead dynamically expands the instruction window when the pipeline is stalled [Mutlu et al., 2003] • The core checkpoints architectural state • The result of the memory operation that caused the stall is marked as poisoned in the physical register file • The core continues to fetch and execute instructions • Operations are discarded instead of retired

  3. Core Stall Cycles % Total Core Cycles 100 10 20 30 40 50 60 70 80 90 0 calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264 bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf MI-Average

  4. Core Stall Cycles % Total Core Cycles 100 10 20 30 40 50 60 70 80 90 0 3.0 calculix 1.8 povray 2.2 namd 2.4 gamess 2.7 perlbench 2.01.8 tonto gromac 1.62.31.6 gobmk dealII sjeng 1.4 gcc 1.7 hmmer 1.4 2.1 h264 bzip2 0.91.4 astar xalancbmk 1.4 1.3 zeusmp cactus 1.5 wrf 0.9 GemsFDTD 1.2 leslie 0.71.2 omnetpp milc 0.9 soplex 0.8 0.9 sphinx bwaves 0.40.7 libquantum lbm 0.3 mcf 0.9 MI-Average

  5. Runahead Buffer Overview • Overview of Memory Dependence Chains • Traditional Runahead Observations • Runahead Buffer Proposal and Pipeline Modifications • Runahead Buffer System Configuration and Evaluation • Runahead Buffer Conclusions

  6. Background • Every load has a chain of operations that must be completed to generate the address of the memory access

  7. Example Dependence Chain LD [R3] -> R5 ADD R4, R5 -> R9 ADD R9, R1 -> R6 Cache Miss LD [R6] -> R8

  8. Example Dependence Chain LD [R3] -> R5 These are the only operations that need to be ADD R4, R5 -> R9 completed before the cache miss can be executed ADD R9, R1 -> R6 Cache Miss LD [R6] -> R8

  9. Runahead Observations 1 Total Operations Executed During Runahead 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf Dependence Chain Other Operation

  10. Runahead Observations 1 Total Operations Executed During Runahead 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% are irrelevant to the dependence chain of a cache miss Traditional runahead executes many operations that calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf Dependence Chain Other Operation

  11. Runahead Observations 2 Total Cache Miss Dependence Chains 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf Repeated Chain Unique Chain

  12. Runahead Observations 2 Total Cache Miss Dependence Chains 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% calculix povray namd gamess Most dependence chains are repeated in perlbench tonto gromac gobmk dealII traditional runahead sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf Repeated Chain Unique Chain

  13. Dependence Chain Length Runahead Observations 3 10 20 30 40 50 60 70 80 0 calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf Average

  14. Dependence Chain Length Runahead Observations 3 10 20 30 40 50 60 70 80 0 calculix povray namd gamess perlbench tonto gromac Most dependence chains are short gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf Average

  15. Runahead Buffer • At a full window stall, dynamically identify the dependence chain to use during runahead from the reorder buffer • Once the chain is identified, we place it in a runahead buffer • The front-end is then clock-gated and the runahead buffer directly feeds decoded micro-ops into the back- end for runahead execution

  16. Runahead Buffer Pipeline Modifications Pseudo- Wakeup Arch Checkpoint Poison RA-Buffer RA-Cache Bits Register Decode Rename Select/ Execute Commit Fetch Read Wakeup

  17. Runahead Buffer Chain Generation 0xA LD [R0] -> R2 LD [P15] -> P2 Cycle: 0 7 6 1 5 4 2 3 LD [P3] -> P5 LD [R3] -> R5 0xD ADD P4, P5 -> P9 ADD R4, R5 -> R7 0xE ADD R7, R1 -> R6 Source ADD P9, P1 -> P6 0x7 P7 P1, P4, P5 P4, P5 P5 P3 P6 P9, P1 Register MOV P6 -> P7 MOV R6 -> R0 0x8 Search List: LD [R0] -> R2 0xA LD [P7] -> P8

  18. Runahead Buffer Optimizations • A small dependence chain cache (2-entries) improves performance • Hybrid Policy • The core begins traditional runahead execution instead of using the runahead buffer if: • An operation with the same PC as the operation that is blocking the ROB is not found in the ROB • The generated dependence chain is too long (more than 32 operations)

  19. System Configuration • Single Core • 5 Configurations • 4-wide Issue • Traditional Runahead • Runahead Buffer • 192 Entry Reorder Buffer • Runahead Buffer + Chain Cache • Runahead Buffer • Hybrid Policy • 32 Entry • Traditional Runahead + Energy • Runahead Buffer Chain Cache: 2-Entries Optimizations • Caches • 32 KB L1 I/D-Cache, 3-Cycle • 1MB Last Level Cache, 18-Cycle • Stream Prefetcher • Non-Uniform Access Latency DRAM System

  20. Runahead Buffer Performance % IPC Difference over No- Prefetching Baseline 10 15 20 25 30 35 40 -5 0 5 calculix povray namd gamess perlbench tonto gromac gobmk dealII sjeng gcc hmmer h264ref bzip2 astar xalancbmk zeusmp cactus wrf GemsFDTD leslie omnetpp milc soplex sphinx bwaves libquantum lbm mcf GMean Hybrid Policy Runahead Buffer + Chain Cache Runahead Buffer Runahead

  21. Runahead Buffer Performance % IPC Difference over 40 35 No-Prefetching Baseline 30 25 20 Runahead Runahead Buffer 15 Runahead Buffer + Chain Cache 10 Hybrid Policy 5 0 -5

  22. Runahead Buffer MLP 18 Cache Misses per Runahead 16 14 12 Interval 10 8 Runahead 6 Runahead Buffer 4 2 0

  23. Energy Analysis 2.5 % Energy Difference over No-PF 2 1.5 Baseline Runahead Runahead Enhancements 1 Runahead Buffer Runahead Buffer + Chain Cache 0.5 Hybrid 0

  24. Stall Cycles in Runahead Buffer Mode 100% 90% 80% % Total Cycles 70% 60% 50% 40% 30% 20% 10% 0%

  25. Stream Prefetching 160 140 % IPC Difference over No- Prefetching Baseline 120 Stream 100 Runahead + Stream 80 60 Runahead Buffer + Stream 40 Runahead Buffer + Chain Cache + 20 Stream 0 Hybrid + Stream -20

  26. Bandwidth Consumption 2.00 1.80 Normalized Bandwidth 1.60 1.40 1.20 Stream 1.00 Runahead 0.80 Runahead Buffer 0.60 Runahead Buffer + Chain Cache 0.40 0.20 Hybrid 0.00

  27. Energy Analysis 2.5 % Energy Difference over No-PF Baseline + Stream 2 Runahead + Stream 1.5 Baseline Runahead Enhancements + Stream 1 Runhead Buffer + Stream 0.5 Runahead Buffer + Chain Cache + Stream 0 Hybrid + Stream

  28. Runahead Buffer Conclusions • Many of the operations that are executed in traditional runahead execution are unnecessary to generate cache misses • The runahead buffer uses filtered dependence chains that only contain the operations required for a cache miss • These chains are generally short • This chain is read into a buffer and speculatively executed as if they were in a loop when the core would be otherwise idle

  29. Runahead Buffer Conclusions • The runahead buffer enables the front-end to be idle for 47% of the total execution cycles of the medium and high memory intensity SPEC CPU2006 benchmarks • The runahead buffer generates over twice as much MLP on average as traditional runahead execution • The runahead buffer results in a 17.2% performance increase and 6.7% decrease in energy consumption over a system with no- prefetching. Traditional runahead execution results in a 12.3% performance increase and 9.5% energy increase

  30. Filtered Runahead Execution with a Runahead Buffer Milad Hashemi Yale N. Patt December 8, 2015

Recommend


More recommend