Q. How do architects address this gap? A. Put smaller, faster “cache” memories Performance between CPU and DRAM. CPU (1/latency) Create a “memory hierarchy”. 60% per yr CPU 2X in 1.5 yrs COSC 5351 Advanced Computer Architecture Gap grew 50% per Slides modified from Hennessy CS252 course slides year DRAM 9% per yr DRAM 2X in 10 yrs COSC5351 Advanced Computer Year Architecture Upper Level Capacity Access Time Staging Cost Apple ][ (1977) faster Xfer Unit CPU Registers Registers 100s Bytes CPU: 1000 ns <10s ns prog./compiler Instr. Operands 1-8 bytes DRAM: 400 ns Cache K Bytes Cache 10-100 ns 1-0.1 cents/bit cache cntl Blocks 8-128 bytes Main Memory Memory M Bytes 200ns- 500ns $.0001-.00001 cents /bit OS Pages 512-4K bytes Disk G Bytes, 10 ms Disk (10,000,000 ns) -5 -6 10 - 10 cents/bit user/operator Files Mbytes Steve Larger Steve Wozniak Tape Jobs infinite Tape Lower Level sec-min -8 10 COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture L1 (64K Instruction) Managed Managed Managed by OS, by compiler by hardware hardware, application R eg Reg L1 Inst L1 Data L2 DRAM Disk ist Size 1K 64K 32K 512K 256M 80G er 512K Latency s iMac G5 L2 1, 3, 3, 11, 88, 10 7 , Cycles, 1.6 GHz 0.6 ns 1.9 ns 1.9 ns 6.9 ns 55 ns 12 ms Time Goal: Illusion of large, fast, cheap memory Let programs address a memory space that (1K) scales to the disk size, at a speed that is usually as fast as register access COSC5351 Advanced Computer COSC5351 Advanced Computer L1 (32K Data) Architecture Architecture CS252 S05 1
Bad locality behavior Memory Address (one dot per access) The Principle of Locality: ◦ Program access a relatively small portion of the address space at any instant of time. (This is kind of like in real life, we all have a lot of friends. But at any given time most of us can only keep in touch with a small group of them.) Temporal Two Different Types of Locality: Locality ◦ Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) ◦ Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Last 15 years, HW relied on locality for speed Spatial It is a property of programs which is exploited in machine design. Locality Time Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10(3): 168-192 (1971) Architecture Hit: data appears in some block in the upper level (example: Block X) Hit rate : fraction found in that level ◦ Hit Rate: the fraction of memory access found in the upper ◦ So high that usually talk about Miss rate level ◦ Miss rate fallacy: as MIPS to CPU performance, ◦ Hit Time: Time to access the upper level which consists of miss rate to average memory access time in RAM access time + Time to determine hit/miss memory Miss: data needs to be retrieved from a block in the Average memory-access time lower level (Block Y) = Hit time + Miss rate x Miss penalty ◦ Miss Rate = 1 - (Hit Rate) (ns or clocks) ◦ Miss Penalty: Time to replace a block in the upper level + Miss penalty : time to replace a block from Time to deliver the block the processor lower level, including time to replace in Hit Time << Miss Penalty CPU ◦ access time : time to lower level = f(latency to lower level) Lower Level Upper Level To Processor Memory ◦ transfer time : time to transfer block Memory =f(BW between upper & lower levels) Blk X From Processor Blk Y COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture T e : Effective memory access time in cache memory system Q1: Where can a block be placed in the upper T c : Cache access time level? (Block placement) T m : Main memory access time Q2: How is a block found if it is in the upper T e = T c + (1 - h) T m level? (Block identification) Example: T c = 0.4ns, T m = 1.2ns, h = 0.85% Q3: Which block should be replaced on a miss? (Block replacement) T e = 0.4 + (1 - 0.85) × 1.2 = 0.58ns Q4: What happens on a write? (Write strategy) COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture CS252 S05 2
Tag on each block Block 12 placed in 8 block cache: ◦ Fully associative, direct mapped, 2-way set associative ◦ No need to check index or block offset ◦ S.A. Mapping = Block Number Modulo Number Sets Increasing associativity shrinks index, expands tag Direct Mapped 2-Way Assoc Full Mapped (12 mod 8) = 4 (12 mod 4) = 0 01234567 01234567 01234567 Cache Block Address Block Offset Tag Index 1111111111222222222233 01234567890123456789012345678901 Memory COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture A randomly chosen block? The Least Recently Used Easy for Direct Mapped Easy to implement, how (LRU) block? Appealing, Set Associative or Fully Associative: well does it work? but hard to implement for ◦ Random high associativity ◦ LRU (Least Recently Used) Miss Rate for 2-way Set Associative Cache ◦ FIFO, MRU, LFU (frequently), MFU Also, Size Random LRU Asso soc: c: 2-way 4-way 8-way try 5.7% 5.2% 16 KB Size LRU Ran LRU Ran LRU Ran other LRU 2.0% 1.9% 64 KB 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% approx. 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.17% 1.15% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture Write-Through Write-Back Lower Cache Processor Level Memory Write data only to the Data written to cache cache Write Buffer block Policy also written to lower- Update lower level Holds data awaiting write-through to when a block falls out level memory of the cache lower level memory Debug Easy Hard Q. Why a write buffer ? A. So CPU doesn’t stall Do read misses No Yes Q. Why a buffer, why A. Bursts of writes are produce writes? not just one register ? common. Do repeated writes Yes No make it to lower Q. Are Read After Write A. Yes! Drain buffer before level? next read, or send read 1 st (RAW) hazards an issue Additional option -- let writes to an un-cached address for write buffer? after check write buffers. allocate a new cache line (“write - allocate”). COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture CS252 S05 3
“Physical addresses” of memory locations Reducing Miss Rate A0-A31 A0-A31 Larger Block size (compulsory misses) 1. CPU Memory Larger Cache size (capacity misses) 2. Higher Associativity (conflict misses) 3. D0-D31 D0-D31 Data Reducing Miss Penalty Multilevel Caches All programs share one address space: 4. The physical address space Reducing hit time Machine language programs must be Giving Reads Priority over Writes 5. aware of the machine organization • E.g., Read complete before earlier writes in write buffer No way to prevent a program from accessing any machine resource COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture “Virtual Addresses” “Physical Translation: Addresses” ◦ Program can be given consistent view of memory, even though physical memory is scrambled Physical A0-A31 Virtual A0-A31 ◦ Makes multithreading reasonable (now used a lot!) Address CPU Memory ◦ Only the most important part of program (“Working Set”) Translation must be in physical memory. ◦ Contiguous structures (like stacks) use only as much D0-D31 D0-D31 physical memory as necessary yet still grow later. Data Protection: ◦ Different threads (or processes) protected from each other. User programs run in an standardized ◦ Different pages can be given special behavior virtual address space (Read Only, Invisible to user programs, etc). ◦ Kernel data protected from User programs Address Translation hardware ◦ Very important for protection from malicious programs managed by the operating system (OS) Sharing: ◦ Can map same physical page to multiple users maps virtual address to physical memory (“Shared memory”) Hardware supports “modern” OS features: Protection, Translation, Sharing COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture Physical A virtual address space Physical Page Table Page Table Memory Space Memory Space is divided into blocks Virtual Address frame of memory called pages frame 12 frame frame V page no. offset frame frame A machine frame frame Page Table Page Table usually supports Base Reg V Access PA pages of a few index Rights into virtual virtual sizes address address page table located table (MIPS R4000): in physical P page no. offset OS memory 12 manages A page table is indexed by a Physical Address the page Page table maps virtual page numbers to physical table for virtual address frames ( “PTE” = Page Table Entry) each ASID Virtual memory => treat memory cache for disk A valid page table entry codes physical memory “frame” address for the page COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture CS252 S05 4
Recommend
More recommend