Caches Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer - PowerPoint PPT Presentation

Spring 2015 :: CSE 502 – Computer Architecture Caches Instructor: Nima Honarmand

Spring 2015 :: CSE 502 – Computer Architecture Motivation 10000 Performance 1000 Processor 100 10 Memory 1 1985 1990 1995 2000 2005 2010 • Want memory to appear: – As fast as CPU – As large as required by all of the running applications

Spring 2015 :: CSE 502 – Computer Architecture Storage Hierarchy • Make common case fast: – Common: temporal & spatial locality – Fast: smaller more expensive memory Registers Controlled Bigger Transfers More Bandwidth by Hardware Larger Faster Caches (SRAM) Controlled by Software Cheaper (OS) Memory (DRAM) [SSD? (Flash)] Disk (Magnetic Media) What is S (tatic)RAM vs D (dynamic)RAM?

Spring 2015 :: CSE 502 – Computer Architecture Caches • An automatically managed hierarchy Core • Break memory into blocks (several bytes) and transfer data to/from cache in blocks $ – spatial locality Memory • Keep recently accessed blocks – temporal locality

Spring 2015 :: CSE 502 – Computer Architecture Cache Terminology • block ( cache line ): minimum unit that may be cached • frame : cache storage location to hold one block • hit : block is found in the cache • miss : block is not found in the cache • miss ratio : fraction of references that miss • hit time : time to access the cache • miss penalty : time to replace block on a miss

Spring 2015 :: CSE 502 – Computer Architecture Cache Example • Address sequence from core: Core (assume 8-byte lines) Miss 0x10000 0x10000 (…data…) Hit 0x10004 0x10008 (…data…) 0x10120 Miss 0x10120 (…data…) Miss 0x10008 Hit 0x10124 Hit 0x10004 Memory Final miss ratio is 50%

Spring 2015 :: CSE 502 – Computer Architecture Average Memory Access Time (1/2) • Or AMAT • Very powerful tool to estimate performance • If … cache hit is 10 cycles (core to L1 and back) memory access is 100 cycles (core to mem and back) • Then … at 50% miss ratio, avg. access: 0.5×10+0.5×100 = 55 at 10% miss ratio, avg. access: 0.9×10+0.1×100 = 19 at 1% miss ratio, avg. access: 0.99×10+0.01× 100 ≈ 11

Spring 2015 :: CSE 502 – Computer Architecture Average Memory Access Time (2/2) • Generalizes nicely to any-depth hierarchy • If … L1 cache hit is 5 cycles (core to L1 and back) L2 cache hit is 20 cycles (core to L2 and back) memory access is 100 cycles (core to mem and back) • Then … at 20% miss ratio in L1 and 40% miss ratio in L2 … avg. access: 0.8×5+0.2×(0.6×20+0.4× 100) ≈ 14

Spring 2015 :: CSE 502 – Computer Architecture Memory Organization (1/3) • L1 is split (separate I$ and D$) • L2 and L3 are unified Processor Registers I-TLB L1 I-Cache L1 D-Cache D-TLB L2 Cache L3 Cache (LLC) Main Memory (DRAM)

Spring 2015 :: CSE 502 – Computer Architecture Memory Organization (2/3) • L1 and L2 are private • L3 is shared Processor Core 0 Core 1 Registers Registers I-TLB L1 I-Cache L1 D-Cache D-TLB I-TLB L1 I-Cache L1 D-Cache D-TLB L2 Cache L2 Cache L3 Cache (LLC) Main Memory (DRAM) Multi-core replicates the top of the hierarchy

Spring 2015 :: CSE 502 – Computer Architecture Memory Organization (3/3) (3.3GHz, 4 cores, 2 threads per core) 32K L1-D Intel Nehalem 256K 32K L1-I L2

Spring 2015 :: CSE 502 – Computer Architecture SRAM Overview 1 1 0 1 0 1 “6T SRAM” cell b b 2 access gates 2T per inverter • Chained inverters maintain a stable state • Access gates provide access to the cell • Writing to cell involves over-powering storage inverters

Spring 2015 :: CSE 502 – Computer Architecture 8-bit SRAM Array wordline bitlines

Spring 2015 :: CSE 502 – Computer Architecture 8 × 8-bit SRAM Array wordlines bitlines

Spring 2015 :: CSE 502 – Computer Architecture Fully-Associative Cache 63 address 0 • Keep blocks in cache frames – data tag[63:6] block offset[5:0] – state (e.g., valid) – address tag = state tag data = state tag data = state tag data state tag = data multiplexor Content Addressable hit? Memory (CAM) What happens when the cache runs out of space?

Spring 2015 :: CSE 502 – Computer Architecture The 3 C’s of Cache Misses • Compulsory : Never accessed before • Capacity : Accessed long ago and already replaced • Conflict : Neither compulsory nor capacity (later today) • Coherence : (To appear in multi-core lecture)

Spring 2015 :: CSE 502 – Computer Architecture Cache Size • Cache size is data capacity (don’t count tag and state) – Bigger can exploit temporal locality better – Not always better • Too large a cache – Smaller is faster  bigger is slower – Access time may hurt critical path hit rate • Too small a cache working set size – Limited temporal locality – Useful data constantly replaced capacity

Spring 2015 :: CSE 502 – Computer Architecture Block Size • Block size is the data that is – Associated with an address tag – Not necessarily the unit of transfer between hierarchies • Too small a block – D on’t exploit spatial locality well – Excessive tag overhead hit rate • Too large a block – Useless data transferred – Too few total blocks • Useful data frequently replaced block size

Spring 2015 :: CSE 502 – Computer Architecture 8 × 8-bit SRAM Array wordline 1-of-8 decoder bitlines

Spring 2015 :: CSE 502 – Computer Architecture 64 × 1-bit SRAM Array wordline 1-of-8 decoder bitlines column mux 1-of-8 decoder SRAM designers try to keep physical layout square (to avoid long wires) Logical layout of SRAM array may differ from physical layout

Spring 2015 :: CSE 502 – Computer Architecture Direct-Mapped Cache • Use middle bits as index • Only one tag comparison tag[63:16] index[15:6] block offset[5:0] data state tag data state tag data state tag decoder state tag data multiplexor tag match = hit? Why take index bits out of the middle?

Spring 2015 :: CSE 502 – Computer Architecture Cache Conflicts • What if two blocks alias on a frame? – Same index, but different tags Address sequence: 0xDEADBEEF 11011110101011011011111011101111 0xFEEDBEEF 11111110111011011011111011101111 0xDEADBEEF 11011110101011011011111011101111 tag index block offset • 0xDEADBEEF experiences a Conflict miss – Not Compulsory (seen it before) – Not Capacity (lots of other indexes available in cache)

Spring 2015 :: CSE 502 – Computer Architecture Associativity (1/2) • Where does block index 12 (b’1100) go? Frame Set/Frame Set 0 0 0 0 1 1 1 2 0 2 1 3 1 3 4 0 4 2 5 1 5 6 0 6 3 7 1 7 Fully-associative Set-associative Direct-mapped block goes in any frame block goes in any frame block goes in exactly in one set one frame (all frames in 1 set) (frames grouped in sets) (1 frame per set)

Spring 2015 :: CSE 502 – Computer Architecture Associativity (2/2) • Larger associativity – lower miss rate (fewer conflicts) – higher power consumption holding cache and block size constant • Smaller associativity – lower cost – faster hit time hit rate ~5 for L1-D associativity

Spring 2015 :: CSE 502 – Computer Architecture N-Way Set-Associative Cache tag[63:15] index[14:6] block offset[5:0] way data state tag data state tag set data state tag data state tag data state tag data state tag decoder decoder data state tag data state tag multiplexor multiplexor = = multiplexor hit? Note the additional bit(s) moved from index to tag

Spring 2015 :: CSE 502 – Computer Architecture Associative Block Replacement • Which block in a set to replace on a miss? • Ideal replacement ( Belady’s Algorithm) – Replace block accessed farthest in the future – Trick question: How do you implement it? • Least Recently Used (LRU) – Optimized for temporal locality (expensive for >2-way) • Not Most Recently Used (NMRU) – Track MRU, random select among the rest – Same as LRU for 2-sets • Random – Nearly as good as LRU, sometimes better (when?) • Pseudo-LRU – Used in caches with high associativity – Examples: Tree-PLRU, Bit-PLRU

Spring 2015 :: CSE 502 – Computer Architecture Victim Cache (1/2) • Associativity is expensive – Performance overhead from extra muxes – Power overhead from reading and checking more tags and data • Conflicts are expensive – Performance from extra mises • Observation: Conflicts don’t occur in all sets

Spring 2015 :: CSE 502 – Computer Architecture Victim Cache (2/2) 4-way Set-Associative 4-way Set-Associative Fully-Associative Access + Victim Cache L1 Cache L1 Cache Sequence: C C B D A E D A B C A E B A C B C D E A B C D M L K L J A B X Y Z X Y Z N J M N J K J L K M L N J K J K L M L C P Q R P Q R K L D Every access is a miss! Victim cache provides M ABCDE and JKLMN a “fifth way” so long as do not “fit” in a 4 -way only four sets overflow set associative cache into it at the same time Can even provide 6 th or 7 th … ways Provide “extra” associativity, but not for all sets

Caches Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer - PowerPoint PPT Presentation

Spring 2015 :: CSE 502 Computer Architecture Caches Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture Motivation 10000 Performance 1000 Processor 100 10 Memory 1 1985 1990 1995 2000 2005 2010 Want

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa

Techniques for Caches in GPUs Gnther Schindler Seminar Talk 2015/16 Chair ASC Outline 1.

Caches Out-of-order execution Data flow model Samira Khan Superscalar processor March

Nexus: A New Approach to Replication in Distributed Shared Caches Po-An Tsai , Nathan Beckmann,

Today Memory hierarchy, caches, locality Cache organiza:on

Caching 1 Caches break down an address into which parts? Letter Answer A Tag, delay, length

Rigid Body Transformations (Or How Different sensors see the same world) By, Paritosh Kelkar

Environmental Communiucation 9/5/2017 Photogprahy Crash Course Composition 1 Work Composition:

Efficient Models for Grasp Planning With A Object Model Finger Multi-fingered Hand Workspace

Grammar Implementation with Lexicalized Tree Adjoining Grammars and Frame Semantics Frame

Global Causes of Death 2011 The inescapable conclusion is that an epidemic of NCDs cause

FEASIBILITY STUDY School Committee Meeting April 25, 2018 PROJECT MANAGEMENT SMMA Agenda 1.

Dag 2: Logistic regression Susanne Rosthj Biostatistisk Afdeling Institut for

BoXHED : B oosted e X act H azard E stimator with D ynamic covariates Xiaochen Wang Yale

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Caches Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer - PowerPoint PPT Presentation

Spring 2015 :: CSE 502 Computer Architecture Caches Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture Motivation 10000 Performance 1000 Processor 100 10 Memory 1 1985 1990 1995 2000 2005 2010 Want

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Caches &amp; Memcache Example Client N. America Client System Asia + Caches Client Africa

Techniques for Caches in GPUs Gnther Schindler Seminar Talk 2015/16 Chair ASC Outline 1.

Caches Out-of-order execution Data flow model Samira Khan Superscalar processor March

Nexus: A New Approach to Replication in Distributed Shared Caches Po-An Tsai , Nathan Beckmann,

Today Memory hierarchy, caches, locality Cache organiza:on

Caching 1 Caches break down an address into which parts? Letter Answer A Tag, delay, length

Rigid Body Transformations (Or How Different sensors see the same world) By, Paritosh Kelkar

Environmental Communiucation 9/5/2017 Photogprahy Crash Course Composition 1 Work Composition:

Efficient Models for Grasp Planning With A Object Model Finger Multi-fingered Hand Workspace

Grammar Implementation with Lexicalized Tree Adjoining Grammars and Frame Semantics Frame

Global Causes of Death 2011 The inescapable conclusion is that an epidemic of NCDs cause

FEASIBILITY STUDY School Committee Meeting April 25, 2018 PROJECT MANAGEMENT SMMA Agenda 1.

Dag 2: Logistic regression Susanne Rosthj Biostatistisk Afdeling Institut for

BoXHED : B oosted e X act H azard E stimator with D ynamic covariates Xiaochen Wang Yale

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa