Lecture 16: Reducing Cache Miss Penalty and Exploit Memory - PowerPoint PPT Presentation

Lecture 16: Reducing Cache Miss Penalty and Exploit Memory Parallelism Critical work first, reads priority over writes, merging write buffer, non-blocking cache, stream buffer, and software prefetching 1 Adapted from UC Berkeley CS252 S01

Improving Cache Performance 1. 3. Reducing miss penalty or Reducing miss rates miss rates via parallelism Larger block size � Reduce miss penalty or larger cache size miss rate by parallelism � higher associativity Non-blocking caches � victim caches Hardware prefetching � Compiler prefetching way prediction and � Pseudoassociativity 4. Reducing cache hit time compiler optimization � � Small and simple caches 2. Reducing miss penalty � Avoiding address Multilevel caches translation � critical word first � Pipelined cache access � read miss first � Trace caches � merging write buffers � 2

Early Restart and Critical Word First Don’t wait for full block to be loaded before restarting CPU � Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution � Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Generally useful only in large blocks (relative to bandwidth) Good spatial locality may reduce the benefits of early restart, as the next sequential word may be needed anyway block 3

Read Priority over Write on Miss Write-through with write buffers offer RAW conflicts with main memory reads on cache misses � If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) � Check write buffer contents before read; if no conflicts, let the memory access continue � Usually used with no-write allocate and a write buffer Write-back also want buffer to hold misplaced blocks � Read miss replacing dirty block � Normal: Write dirty block to memory, and then do the read � Instead copy the dirty block to a write buffer, then do the read, and then do the write � CPU stall less since restarts as soon as do read � Usually used with write allocate and a writeback buffer 4

Read Priority over Write on Miss CPU in out Write Buffer write buffer DRAM (or lower mem) 5

Merging Write Buffer Write merging: new written data into an existing block are merged Reduce stall for write (writeback) buffer being full Improve memory efficiency 6

Reducing Miss Penalty Summary   CPUtime = IC × CPI Execution + Memory accesses × Miss rate × Miss penalty  × Clock cycle time  Instruction Four techniques � Multi-level cache � Early Restart and Critical Word First on miss � Read priority over write � Merging write buffer Can be applied recursively to Multilevel Caches � Danger is that time to DRAM will grow with multiple levels in between � First attempts at L2 caches can make things worse, since increased worst case is worse 7

Improving Cache Performance 1. 3. Reducing miss penalty or Reducing miss rates miss rates via parallelism Larger block size � Reduce miss penalty or larger cache size miss rate by � parallelism higher associativity � Non-blocking caches victim caches � Hardware prefetching way prediction and � Compiler prefetching Pseudoassociativity compiler optimization � 4. Reducing cache hit time 2. Reducing miss penalty � Small and simple caches Multilevel caches � Avoiding address � translation critical word first � � Pipelined cache access read miss first � � Trace caches merging write buffers � 8

Non-blocking Caches to reduce stalls on misses Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss � Usually works with out-of-order execution “hit under miss” reduces the effective miss penalty by allowing one cache miss; processor keeps running until another miss happens � Sequential memory access is enough � Relative simple implementation “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses � Implies memories support concurrency (parallel or pipelined) � Significantly increases the complexity of the cache controller � Requires muliple memory banks (otherwise cannot support) � Penium Pro allows 4 outstanding memory misses 9

Value of Hit Under Miss for SPEC Hit Under i Misses 2 1.8 1.6 1.4 0->1 0->1 1.2 1->2 1->2 1 2->64 2->64 0.8 Base Base 0.6 “Hit under n Misses” 0.4 0.2 0 doduc nasa7 espresso ear wave5 ora eqntott compress fpppp tomcatv su2cor hydro2d spice2g6 xlisp alvinn swm256 mdljdp2 mdljsp2 Integer Floating Point FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26 Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss 10

Reducing Misses by Hardware Prefetching of Instructions & Data E.g., Instruction Prefetching � Alpha 21064 fetches 2 blocks on a miss � Extra block placed in “stream buffer” � On miss check stream buffer Works with data blocks too: � Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% � Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty 11

Stream Buffer Diagram from processor to processor Direct mapped Tags Data cache tag and head Stream a one cache block of data comp buffer tag a one cache block of data tail tag a one cache block of data Source: Jouppi tag a one cache block of data ICS’90 Shown with a single stream buffer +1 (way); multiple ways and filter may next level of cache 12 be used

Victim Buffer Diagram to proc from proc Direct mapped Tags Data cache Proposed in the same next level of cache paper: Jouppi tag and comp one cache block of data ICS’90 tag and comp one cache block of data Victim cache, fully tag and comp one cache block of data associative tag and comp one cache block of data 13

Reducing Misses by Software Prefetching Data Data Prefetch � Load data into register (HP PA-RISC loads) � Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) � Special prefetching instructions cannot cause faults; a form of speculative execution Prefetching comes in two flavors: � Binding prefetch: Requests load directly into register. � Must be correct address and register! � Non-Binding prefetch: Load into cache. � Can be incorrect. Frees HW/SW to guess! Issuing Prefetch Instructions takes time � Is cost of prefetch issues < savings in reduced misses? � Higher superscalar reduces difficulty of issue bandwidth 14

Lecture 16: Reducing Cache Miss Penalty and Exploit Memory - PowerPoint PPT Presentation

Lecture 16: Reducing Cache Miss Penalty and Exploit Memory Parallelism Critical work first, reads priority over writes, merging write buffer, non-blocking cache, stream buffer, and software prefetching 1 Adapted from UC Berkeley CS252 S01

1 Trace Cache Summary of Reducing Cache Hit Time Trace: a dynamic sequence of Small and simple

Improving Cache Performance AMAT: Average Memory Access Time AMAT = T hit + Miss Rate x Miss

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Chapter Overview 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4

lecture 18 cache 2 - TLB (hit and miss) - instruction or data cache - cache (hit and

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Improving Cache Performance AMAT: Average Memory Access Time AMAT = T hit + Miss Rate x Miss

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Welcome! Introductions Year 1 Year 2 Miss Fowler Miss Mackintosh Mrs McEwan Miss Williams

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

Lecture 23: Cache, Memory, Virtual Memory Todays topics: Cache examples, caching

Chapter 4 Cache Memory Contents Computer memory system overview Characteristics of

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Hacking Lucene for Custom Search Results Doug Turnbull OpenSource Connections OpenSource

Suf ufferi ring ng Smyrna rna Ou Our Savio ior Ou Our Suffer erin ing Ou Our Surren

Memory Hierarchy Reducing Hit Time Main Memory and Examples Soner Onder Michigan

CMP722 ADVANCED COMPUTER VISION Lecture #5 Language and Vision Aykut Erdem // Hacettepe

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides ILP

Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant

Oracle Application Server 10g Upgrade and Migration Monika Dreher Product Technology Services

the iPhone Lawrence Yates The New York Society Library Welcome! This seminar is meant to