Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - PowerPoint PPT Presentation

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009

Trace Caches [Rotenberg’96]

Trace Caches For those not in the know: • I$ that captures dynamic instruction sequences • trace • n instructions (cache line size) or • m basic blocks (branch predictor throughput) • + starting address

Trace Caches valid bit - is trace valid? tag - starting address branch flags - predictor bits mask - is last inst branch? fall thru - last branch is not taken target - if last branch is taken

Fill Units [Melvin’88] && [Franklin’94] • Originally proposed to take a stream of scalar instruction and compact them into VLIW-type instructions. • These instructions go in a shadow cache. • Sound familiar?

Differences • Not conceptually, but their aim is different. • Trace caches => high BW instr fetching • Fill Units => ease multiple issue complexity

The Fill Unit Today • Nowadays, papers refer to the fill unit as the mechanism that feeds trace caches

Putting the Fill Unit to Work: Dynamic Optimizations for Trace Cache Microprocessors • Trace caches are more awesome than we thought since they sit off the main fetch- issue pipeline. • This makes them latency tolerant. • So, we can introduce extra “logic” to help place instructions into the trace cache

Optimization I • Register Moves • ADD Rx <- Ry + 0 • Rename output register to • same physical register • same operand tag

Optimization I

Optimization II • Reassociation • ADD Rx <- Ry + 4 • ADD Rz <- Rx + 4 => ADD Rz <- Ry + 8 • (Does so across control flow boundaries)

Optimization II

Optimization III • Scaled Adds • SHIFT Rw <- Rx << 1 • ADD Ry <- Rw + Rz => • SCALEADD Ry <- (Rx << 1) + Rz • (Limit to 3-bit shifts)

Optimization III

Optimization IV • Instruction Placement • Operand bypassing, etc, can be a burden • If we can place the instructions in a better order to ease this we can see some performance.

Optimization IV

Combined Results

Instruction Path Coprocessors • Programmable on-chip coprocessor • Has its own ISA • Operates on core instr to transform them into an efficient internal format

What good are these? • Example: Intel P6 • Converts x86 into uops (CISC on RISC) • Since it operates on instructions, and sits outside the main pipeline it make it perfect for...fill units

OG I-COP [Chou’00] • All about dynamic code modification • No change to ISA or to HW necessary • However, compiler generated object code isn’t what is being run

OG I-COP • The original implementation was statically scheduled , exploited parallelism using VLIW • Each I-COP can have more than one VLIW engine (called slices). This helped with ILP

So what’s wrong? • Takes quite a bit of hardware. Many slices replicated, each needing its own I-mem • Takes up a lot of area on the chip

Enter PipeRench • Reconfigurable fabric for computation (originally on a stream/media applications) • This can allow us to map programs to hardware. • The key to PipeRench is reconfiguration is supposedly fast.

Reconfiguration • Reconfiguration is done using a “scrolling window”

PipeRench • More Hardware = More Throughput

Pipelined • Virtual stripes allow for efficient area usage

Inside the Stripes • Using 0.18um, 1 stripe is 1.03 sq mm

PipeRench Roadmap ‘97 • 28 stripes in .35 um tech • 32 PEs in each stripe • 512 stripes of configuration cache (18 configs) • Speed: 100MHz

Performance Example • IDEA Encryption (Symmetric Encryption for PGP) • 232 virtual stripes, 64 bits wide • PipeRench: 940MB/sec • ASIC: 177 Mb/sec in 1993 • ASIC: 2GB/sec in 1997 • Pentium ~ 1Mb/sec • Using 232 rows => 7.8 GB/sec

DIL • PipeRench configurations are written in Dataflow Intermediate Language • Output is a set of configuration bits (one set per virtual stripe).

PipeRench advantages • Write DIL once, # of physical stripes doesn’t matter • Apply DIL code selectively at run-time

PipeRench I-COP • Use PipeRench to implement I-COPs. • Compare to original VLIW I-COP implementation. • See where the best trade-off point is.

Dynamic Code Modifications • Trace Cache Fill Unit => 11 V-stripes • Register Move (done for trace run +5x) • 22 V-stripes (plus 11) • Stride Data Prefetching => 14 V-stripes • LDS Prefetching => 9 V-Stripes

VLIW Equivalents • Trace construction => 3 PL, 15 physical • Register Move => IPC 2.69 to 2.72 • Stride Prefetch => Reduce to only 9 physical stripes

Area Evaluation • If we maintain the I-COP at .5 core speed • 33 physical stripes is ~ 34 sq mm. • 9 physical stripes ~ 9.27 sq mm

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - PowerPoint PPT Presentation

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches [Rotenberg96] Trace Caches For those not in the know: I$ that captures dynamic instruction sequences trace n instructions (cache

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa

Our Hobbies 1B Cindy Chan Trace Chan Yuki Lo All: Good morning ,everybody. Cindy: I am Cindy

Trace Elements in igneous petrology Abundances of trace elements are used to test petrogenetic

Trace and center of the twisted Heisenberg category Michael Reeks June 4, 2018 Michael Reeks

Assessing the Performance of MPI Applications Through Time-Independent Trace Replay . Desprez 1

DIV 26000 AND HEAT TRACE FOR MECHANICAL SYSTEMS ACE/ASM DOS AND DONTS OF HEAT TRACE IN

Semantic Trace-based Malware Variants Detection Khalid Alzarooni CREST - DCS - UCL April 6,

Top quak mass measurement using m T2 at CDF (dilepton channel) Hyunsu Lee The University of

Processing Expectation Maximization Mixture Models Bhiksha Raj Class 10. 3 Oct 2013 3 Oct

Data Intensive Linguistics Lecture 3 Language Modeling Philipp Koehn 16 January 2006 PK

PISCES:'A'Programmable,'Protocol4 Independent'So8ware'Switch' [SIGCOMM'2016] ' Sean%Choi%

Progress towards nucleon-nucleon interactions with stochastic LapH [Estabrooks, Martin 1975]

Automatic Streamization of Image Processing Applications LCPC 2014 Pierre Guillou Fabien Coelho

mesino oscillations AKSHAY GHALSASI, DAVE MCKEEN, ANN NELSON arxiv:1508.05392 The one minute

fl 3b) .rt ^l (J \J 5 \9 q F o I ''I u- fi I f" R I '-f ,'r l+ l 3 fl

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - PowerPoint PPT Presentation

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches [Rotenberg96] Trace Caches For those not in the know: I$ that captures dynamic instruction sequences trace n instructions (cache

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Caches &amp; Memcache Example Client N. America Client System Asia + Caches Client Africa

Our Hobbies 1B Cindy Chan Trace Chan Yuki Lo All: Good morning ,everybody. Cindy: I am Cindy

Trace Elements in igneous petrology Abundances of trace elements are used to test petrogenetic

Trace and center of the twisted Heisenberg category Michael Reeks June 4, 2018 Michael Reeks

Assessing the Performance of MPI Applications Through Time-Independent Trace Replay . Desprez 1

DIV 26000 AND HEAT TRACE FOR MECHANICAL SYSTEMS ACE/ASM DOS AND DONTS OF HEAT TRACE IN

Semantic Trace-based Malware Variants Detection Khalid Alzarooni CREST - DCS - UCL April 6,

Top quak mass measurement using m T2 at CDF (dilepton channel) Hyunsu Lee The University of

Processing Expectation Maximization Mixture Models Bhiksha Raj Class 10. 3 Oct 2013 3 Oct

Data Intensive Linguistics Lecture 3 Language Modeling Philipp Koehn 16 January 2006 PK

PISCES:'A'Programmable,'Protocol4 Independent'So8ware'Switch' [SIGCOMM'2016] ' Sean%Choi%

Progress towards nucleon-nucleon interactions with stochastic LapH [Estabrooks, Martin 1975]

Automatic Streamization of Image Processing Applications LCPC 2014 Pierre Guillou Fabien Coelho

mesino oscillations AKSHAY GHALSASI, DAVE MCKEEN, ANN NELSON arxiv:1508.05392 The one minute

fl 3b) .rt ^l (J \J 5 \9 q F o I ''I u- fi I f&quot; R I '-f ,'r l+ l 3 fl

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa

fl 3b) .rt ^l (J \J 5 \9 q F o I ''I u- fi I f" R I '-f ,'r l+ l 3 fl