Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009
Trace Caches [Rotenberg’96]
Trace Caches For those not in the know: • I$ that captures dynamic instruction sequences • trace • n instructions (cache line size) or • m basic blocks (branch predictor throughput) • + starting address
Trace Caches valid bit - is trace valid? tag - starting address branch flags - predictor bits mask - is last inst branch? fall thru - last branch is not taken target - if last branch is taken
Fill Units [Melvin’88] && [Franklin’94] • Originally proposed to take a stream of scalar instruction and compact them into VLIW-type instructions. • These instructions go in a shadow cache. • Sound familiar?
Differences • Not conceptually, but their aim is different. • Trace caches => high BW instr fetching • Fill Units => ease multiple issue complexity
The Fill Unit Today • Nowadays, papers refer to the fill unit as the mechanism that feeds trace caches
Putting the Fill Unit to Work: Dynamic Optimizations for Trace Cache Microprocessors • Trace caches are more awesome than we thought since they sit off the main fetch- issue pipeline. • This makes them latency tolerant. • So, we can introduce extra “logic” to help place instructions into the trace cache
Optimization I • Register Moves • ADD Rx <- Ry + 0 • Rename output register to • same physical register • same operand tag
Optimization I
Optimization II • Reassociation • ADD Rx <- Ry + 4 • ADD Rz <- Rx + 4 => ADD Rz <- Ry + 8 • (Does so across control flow boundaries)
Optimization II
Optimization III • Scaled Adds • SHIFT Rw <- Rx << 1 • ADD Ry <- Rw + Rz => • SCALEADD Ry <- (Rx << 1) + Rz • (Limit to 3-bit shifts)
Optimization III
Optimization IV • Instruction Placement • Operand bypassing, etc, can be a burden • If we can place the instructions in a better order to ease this we can see some performance.
Optimization IV
Combined Results
Instruction Path Coprocessors • Programmable on-chip coprocessor • Has its own ISA • Operates on core instr to transform them into an efficient internal format
What good are these? • Example: Intel P6 • Converts x86 into uops (CISC on RISC) • Since it operates on instructions, and sits outside the main pipeline it make it perfect for...fill units
OG I-COP [Chou’00] • All about dynamic code modification • No change to ISA or to HW necessary • However, compiler generated object code isn’t what is being run
OG I-COP • The original implementation was statically scheduled , exploited parallelism using VLIW • Each I-COP can have more than one VLIW engine (called slices). This helped with ILP
So what’s wrong? • Takes quite a bit of hardware. Many slices replicated, each needing its own I-mem • Takes up a lot of area on the chip
Enter PipeRench • Reconfigurable fabric for computation (originally on a stream/media applications) • This can allow us to map programs to hardware. • The key to PipeRench is reconfiguration is supposedly fast.
Reconfiguration • Reconfiguration is done using a “scrolling window”
PipeRench • More Hardware = More Throughput
Pipelined • Virtual stripes allow for efficient area usage
Inside the Stripes • Using 0.18um, 1 stripe is 1.03 sq mm
PipeRench Roadmap ‘97 • 28 stripes in .35 um tech • 32 PEs in each stripe • 512 stripes of configuration cache (18 configs) • Speed: 100MHz
Performance Example • IDEA Encryption (Symmetric Encryption for PGP) • 232 virtual stripes, 64 bits wide • PipeRench: 940MB/sec • ASIC: 177 Mb/sec in 1993 • ASIC: 2GB/sec in 1997 • Pentium ~ 1Mb/sec • Using 232 rows => 7.8 GB/sec
DIL • PipeRench configurations are written in Dataflow Intermediate Language • Output is a set of configuration bits (one set per virtual stripe).
PipeRench advantages • Write DIL once, # of physical stripes doesn’t matter • Apply DIL code selectively at run-time
PipeRench I-COP • Use PipeRench to implement I-COPs. • Compare to original VLIW I-COP implementation. • See where the best trade-off point is.
Dynamic Code Modifications • Trace Cache Fill Unit => 11 V-stripes • Register Move (done for trace run +5x) • 22 V-stripes (plus 11) • Stride Data Prefetching => 14 V-stripes • LDS Prefetching => 9 V-Stripes
VLIW Equivalents • Trace construction => 3 PL, 15 physical • Register Move => IPC 2.69 to 2.72 • Stride Prefetch => Reduce to only 9 physical stripes
Area Evaluation • If we maintain the I-COP at .5 core speed • 33 physical stripes is ~ 34 sq mm. • 9 physical stripes ~ 9.27 sq mm
?
Recommend
More recommend