bloom filtering cache misses for accurate data
play

Bloom Filtering Cache Misses for Accurate Data Speculation and - PowerPoint PPT Presentation

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering University of Florida 1


  1. Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering University of Florida 1 ICS’02 Peir

  2. It’s the Memory, Stupid • 3 out of 4 cycles are waiting for memory in Pentium-Pro and Alpha-21164, running TPC-C, - Richard Sites Cache Performance: f (Hit time, Miss ratio, Miss penalty) • • Performance impact due to cache latency: lw r1 <= 0(r2) issue register addgen mem1 mem2 hit/miss commit add r3 <= r2, r1 issue register execute commit stall stall stall Speculative issue for hit Minimum 3-cycle hit latency Squashed / re-issued on miss (Total 3 cycles) 2 ICS’02 Peir

  3. Outline • Introduction • Motivation and Related Work • Bloom Filtering Cache Misses – Partitioned-address BF – Partial-address BF • Pipeline Microarchitecture with BF • Performance Evaluation • Summary 3 ICS’02 Peir

  4. No Speculation vs. Perfect Scheduling 3 no data speculation 2.5 perfect scheduler 2 IPC 1.5 1 0.5 0 Vortex Twolf Average Bzip Gap Gzip Mcf Perl Vpr Gcc Parser No data speculation degrades IPC by 15-20% (SPECint2000) 4 ICS’02 Peir

  5. Related Work • Simple solution: Assume loads always hit L1 • Alpha 21264: 4-bit counter; +1 on hit, -2 on miss – Predict hit counter>=8; mini-recovery if miss – Predict miss counter<8; delay dependent if hit • Pentium-4: Always hit, “replay” when miss – Agressive way prediction, needs recovery – Replay way prediction and load dependent scheduling • Recovery buffer: Free speculative instructions from scheduling queue • Hybrid 2-level hit/miss predictor (like branch) 5 ICS’02 Peir

  6. Bloom Filter (BF) - Introduction • A probabilistic algorithm to test membership in a large set using hashing functions to a bit array. • A BF quickly filters non-members without querying the large set. • Filter cache misses – Accurate scheduling load dependents – Overlap and reduce cache miss penalty • Filter other table accesses 6 ICS’02 Peir

  7. Partitioned-Address BF A1 A2 A3 A4 Request Line Addr. Cache Miss R1 R2 R3 R4 Replaced Line Addr. BF1 BF2 BF3 BF4 Increment counter on cache miss Decrement counter on cache miss Guarantee miss! True if cache miss False likely cache hit 7 ICS’02 Peir

  8. Partial-Address BF Requested Line Address Tag Index offset partial address (p bits) BF array Set bit on L1 Cache Tag cache miss Reset bit on cache miss but no collision Hit / Miss Collision Detector Detector False, miss Partial Address (p-bits) True, (may) hit of Replaced Cache Line Guarantee miss! Collision? (yes/no) 8 ICS’02 Peir

  9. Virtual-Address BF • Benefit of cache miss filtering – Must be early before dependent scheduling – Must be accurate • Filter cache miss using virtual address – Must handle address synonym problem – Special handling collision detection for physical caches • Partial-address (virtual) BFs – Separate collision detection from cache tag path – Compare replaced PA with all other PAs with the same page offset – Reset BF array only when no match is found 9 ICS’02 Peir

  10. Pipeline Execution with BF lw r1 <= 0(r2) issue register addgen mem1 mem2 hit/miss commit BF Filtering add r3 <= r2, r1 issue register execute commit stall stall stall • Virtual address BF filter cache miss 2 cycles earlier, Still one cycle too late for dependent scheduling – Delay one cycle for dependent scheduling – Always hit, precise recovery for single-cycle speculation 10 ICS’02 Peir

  11. Cache Miss Filtered by BF CA1 Load: SCH REG AGN CA2 H/M L2 Access BF Filter miss, cancel dependents Dependent: SCH SCH REG EXE WRB CMT Independent: (No Penalty) SCH REG EXE WRB CMT • Cache miss filtered by BF, one cycle window – Precise recovery, reschedule only dependents (have to wait for miss anyway) – No penalty for independent instructions 11 ICS’02 Peir

  12. Cache Miss Not Filtered . CA1 Load: SCH REG AGN CA2 H/M L2 Access BF Cache Miss (not filtered) Speculative Window Dependent: SCH REG EXE Flush SCH REG EXE WRB CMT Dependent: SCH Flush SCH REG EXE WRB CMT Independent: 4-cycle penalty SCH REG EXE Flush SCH REG EXE WRB CMT Independent: 2-cycle penalty SCH Flush SCH REG EXE WRB CMT 12 ICS’02 Peir

  13. Prefetching with BF CA1 Load: SCH REG AGN CA2 H/M L2 Access BF Filter miss cancel dependents Dependent: SCH SCH REG EXE WRB CMT Filter miss trigger L2 access 2-cycle earlier SCH REG EXE WRB CMT • BF filltered miss trigger L2 miss 2-cycle earlier – L1 miss is guaranteed – Applicable to other caches, TLB, branch prediction tables, etc. 13 ICS’02 Peir

  14. Predictors and Extra Hardware Prediction Method Additional Table (Bits) Partitioned BF - 3 15360 Partitioned BF - 4 4480 Partial BF - 1x 512 Partial BF - 4x 2048 Partial BF - 16x 8192 Partial BF - 64x 32768 Always-Hit 0 Counter - 1 (Alpha) 4 Counter - 128 512 Counter - 512 2048 Counter - 2048 8192 Counter - 8192 32768 14 ICS’02 Peir

  15. Cache Miss Filter Rate Partition-3 Partial-1x Partial-4x Partial-16x Partial-64x 100 90 Cache Miss Filtering Rate (%) 80 70 60 50 40 30 20 10 0 p p f l f r p c r x e r c l p e e i a c i o e g z z M V s G P w t G G a B r r o r a T e V P v A Partitioned BFs perform poorly; 97% miss filtered by Partial-16x 15 ICS’02 Peir

  16. Accuracy of Various Predictors 100 Percentage Correct/Incorrect 90 80 Incorrect-cancel 70 Incorrect-delay Correct hit/miss 60 50 Counter-1 Counter-128 Counter-512 Always-hit Partition-3 Partial-1x Partial-4x Partial-16x Partial-64x Counter-2048 Counter-8192 Bloom Filter has NO incorrect-delay, i.e. predict miss always miss 16 ICS’02 Peir

  17. IPC Comparison No-speculation Counter-1 Counter-2048 Always-hit Partition-3 Partial-16x Partial-16x-DP Perfetct-sch Perfect-sch-DP 3 2.5 2 IPC 1.5 1 0.5 0 Bzip Gap Gcc Gzip Mcf Parser Perl Twolf Vortex Vpr Average 17 ICS’02 Peir

  18. Effect of Data Cache Size Perfect-sch-DP Partial-16x-DP Partial-16x Always-hit 20 18 IPC Improvement (%) 16 14 12 10 8 8KB 16KB 32KB 64KB Cache Size 18 ICS’02 Peir

  19. Impact of RUU to Always-Hit Partial-16x Partial-16x-DP Partial-16x-perfect Partial-16x-DP-perfect 10 IPC Improve over Always-Hit (%) 9 8 7 6 5 4 3 2 1 0 ruu32 ruu64 ruu128 19 ICS’02 Peir

  20. Summary • Data speculation schedules load dependents without knowing load latency • Bloom Filter identifies 97% of the misses early using a small 1KB bit array • 19% IPC improvement over no-speculation, 6% IPC improvement over always-hit method • Reach 99.7% IPC of a perfect scheduler 20 ICS’02 Peir

Recommend


More recommend