trace caches
play

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - PowerPoint PPT Presentation

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches [Rotenberg96] Trace Caches For those not in the know: I$ that captures dynamic instruction sequences trace n instructions (cache


  1. Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009

  2. Trace Caches [Rotenberg’96]

  3. Trace Caches For those not in the know: • I$ that captures dynamic instruction sequences • trace • n instructions (cache line size) or • m basic blocks (branch predictor throughput) • + starting address

  4. Trace Caches valid bit - is trace valid? tag - starting address branch flags - predictor bits mask - is last inst branch? fall thru - last branch is not taken target - if last branch is taken

  5. Fill Units [Melvin’88] && [Franklin’94] • Originally proposed to take a stream of scalar instruction and compact them into VLIW-type instructions. • These instructions go in a shadow cache. • Sound familiar?

  6. Differences • Not conceptually, but their aim is different. • Trace caches => high BW instr fetching • Fill Units => ease multiple issue complexity

  7. The Fill Unit Today • Nowadays, papers refer to the fill unit as the mechanism that feeds trace caches

  8. Putting the Fill Unit to Work: Dynamic Optimizations for Trace Cache Microprocessors • Trace caches are more awesome than we thought since they sit off the main fetch- issue pipeline. • This makes them latency tolerant. • So, we can introduce extra “logic” to help place instructions into the trace cache

  9. Optimization I • Register Moves • ADD Rx <- Ry + 0 • Rename output register to • same physical register • same operand tag

  10. Optimization I

  11. Optimization II • Reassociation • ADD Rx <- Ry + 4 • ADD Rz <- Rx + 4 => ADD Rz <- Ry + 8 • (Does so across control flow boundaries)

  12. Optimization II

  13. Optimization III • Scaled Adds • SHIFT Rw <- Rx << 1 • ADD Ry <- Rw + Rz => • SCALEADD Ry <- (Rx << 1) + Rz • (Limit to 3-bit shifts)

  14. Optimization III

  15. Optimization IV • Instruction Placement • Operand bypassing, etc, can be a burden • If we can place the instructions in a better order to ease this we can see some performance.

  16. Optimization IV

  17. Combined Results

  18. Instruction Path Coprocessors • Programmable on-chip coprocessor • Has its own ISA • Operates on core instr to transform them into an efficient internal format

  19. What good are these? • Example: Intel P6 • Converts x86 into uops (CISC on RISC) • Since it operates on instructions, and sits outside the main pipeline it make it perfect for...fill units

  20. OG I-COP [Chou’00] • All about dynamic code modification • No change to ISA or to HW necessary • However, compiler generated object code isn’t what is being run

  21. OG I-COP • The original implementation was statically scheduled , exploited parallelism using VLIW • Each I-COP can have more than one VLIW engine (called slices). This helped with ILP

  22. So what’s wrong? • Takes quite a bit of hardware. Many slices replicated, each needing its own I-mem • Takes up a lot of area on the chip

  23. Enter PipeRench • Reconfigurable fabric for computation (originally on a stream/media applications) • This can allow us to map programs to hardware. • The key to PipeRench is reconfiguration is supposedly fast.

  24. Reconfiguration • Reconfiguration is done using a “scrolling window”

  25. PipeRench • More Hardware = More Throughput

  26. Pipelined • Virtual stripes allow for efficient area usage

  27. Inside the Stripes • Using 0.18um, 1 stripe is 1.03 sq mm

  28. PipeRench Roadmap ‘97 • 28 stripes in .35 um tech • 32 PEs in each stripe • 512 stripes of configuration cache (18 configs) • Speed: 100MHz

  29. Performance Example • IDEA Encryption (Symmetric Encryption for PGP) • 232 virtual stripes, 64 bits wide • PipeRench: 940MB/sec • ASIC: 177 Mb/sec in 1993 • ASIC: 2GB/sec in 1997 • Pentium ~ 1Mb/sec • Using 232 rows => 7.8 GB/sec

  30. DIL • PipeRench configurations are written in Dataflow Intermediate Language • Output is a set of configuration bits (one set per virtual stripe).

  31. PipeRench advantages • Write DIL once, # of physical stripes doesn’t matter • Apply DIL code selectively at run-time

  32. PipeRench I-COP • Use PipeRench to implement I-COPs. • Compare to original VLIW I-COP implementation. • See where the best trade-off point is.

  33. Dynamic Code Modifications • Trace Cache Fill Unit => 11 V-stripes • Register Move (done for trace run +5x) • 22 V-stripes (plus 11) • Stride Data Prefetching => 14 V-stripes • LDS Prefetching => 9 V-Stripes

  34. VLIW Equivalents • Trace construction => 3 PL, 15 physical • Register Move => IPC 2.69 to 2.72 • Stride Prefetch => Reduce to only 9 physical stripes

  35. Area Evaluation • If we maintain the I-COP at .5 core speed • 33 physical stripes is ~ 34 sq mm. • 9 physical stripes ~ 9.27 sq mm

  36. ?

Recommend


More recommend