Today HW3 extension Phew! Lab 4? Finish up caches, exceptional - PowerPoint PPT Presentation

University of Washington Today  HW3 extension  Phew!   Lab 4?  Finish up caches, exceptional control flow 1

University of Washington Cache Associativity 8-way 1-way 2-way 4-way 1 set, 8 sets, 4 sets, 2 sets, 8 blocks 1 block each 2 blocks each 4 blocks each Set Set Set Set 0 0 1 0 2 1 3 0 4 2 5 1 6 3 7 direct mapped fully associative 2

University of Washington Types of Cache Misses  Cold (compulsory) miss  Occurs on first access to a block 3

University of Washington Types of Cache Misses  Cold (compulsory) miss  Occurs on first access to a block  Conflict miss  Most hardware caches limit blocks to a small subset (sometimes just one) of the available cache slots  if one (e.g., block i must be placed in slot (i mod size)), direct-mapped  if more than one, n-way set-associative (where n is a power of 2)  Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot  e.g., referencing blocks 0, 8, 0, 8, ... would miss every time= 4

University of Washington Types of Cache Misses  Cold (compulsory) miss  Occurs on first access to a block  Conflict miss  Most hardware caches limit blocks to a small subset (sometimes just one) of the available cache slots  if one (e.g., block i must be placed in slot (i mod size)), direct-mapped  if more than one, n-way set-associative (where n is a power of 2)  Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot  e.g., referencing blocks 0, 8, 0, 8, ... would miss every time  Capacity miss  Occurs when the set of active cache blocks (the working set) is larger than the cache (just won’t fit) 5

University of Washington Intel Core i7 Cache Hierarchy Processor package Core 0 Core 3 L1 i-cache and d-cache: 32 KB, 8-way, Regs Regs Access: 4 cycles L1 L1 L1 L1 L2 unified cache: … d-cache i-cache d-cache i-cache 256 KB, 8-way, Access: 11 cycles L2 unified cache L2 unified cache L3 unified cache: 8 MB, 16-way, Access: 30-40 cycles L3 unified cache Block size : 64 bytes for (shared by all cores) all caches. Main memory 6

University of Washington What about writes?  Multiple copies of data exist:  L1, L2, possibly L3, main memory  What is the main problem with that? 7

University of Washington What about writes?  Multiple copies of data exist:  L1, L2, possibly L3, main memory  What to do on a write-hit?  Write-through (write immediately to memory)  Write-back (defer write to memory until line is evicted)  Need a dirty bit to indicate if line is different from memory or not  What to do on a write-miss?  Write-allocate (load into cache, update line in cache)  Good if more writes to the location follow  No-write-allocate (just write immediately to memory)  Typical caches:  Write-back + Write-allocate, usually  Write-through + No-write-allocate, occasionally 8

University of Washington Where else is caching used? 9

University of Washington Software Caches are More Flexible  Examples  File system buffer caches, web browser caches, etc.  Some design differences  Almost always fully-associative  so, no placement restrictions  index structures like hash tables are common (for placement)  Often use complex replacement policies  misses are very expensive when disk or network involved  worth thousands of cycles to avoid them  Not necessarily constrained to single “block” transfers  may fetch or write-back in larger units, opportunistically 10

University of Washington Optimizations for the Memory Hierarchy  Write code that has locality  Spatial: access data contiguously  Temporal: make sure access to the same data is not too far apart in time  How to achieve?  Proper choice of algorithm  Loop transformations 11

University of Washington Example: Matrix Multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n + j] += a[i*n + k]*b[k*n + j]; } j c a b = * i 12

University of Washington Cache Miss Analysis  Assume:  Matrix elements are doubles  Cache block = 64 bytes = 8 doubles  Cache size C << n (much smaller than n) n  First iteration:  n/8 + n = 9n/8 misses (omitting matrix c) = *  Afterwards in cache: (schematic) = * 8 wide 13

University of Washington Cache Miss Analysis  Assume:  Matrix elements are doubles  Cache block = 64 bytes = 8 doubles  Cache size C << n (much smaller than n) n  Other iterations:  Again: n/8 + n = 9n/8 misses = * (omitting matrix c) 8 wide  Total misses:  9n/8 * n 2 = (9/8) * n 3 14

University of Washington Blocked Matrix Multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=B) for (j = 0; j < n; j+=B) for (k = 0; k < n; k+=B) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+B; i1++) for (j1 = j; j1 < j+B; j1++) for (k1 = k; k1 < k+B; k1++) c[i1*n + j1] += a[i1*n + k1]*b[k1*n + j1]; } j1 c a b = * i1 Block size B x B 15

University of Washington Cache Miss Analysis  Assume:  Cache block = 64 bytes = 8 doubles  Cache size C << n (much smaller than n)  Three blocks fit into cache: 3B 2 < C n/B blocks  First (block) iteration:  B 2 /8 misses for each block  2n/B * B 2 /8 = nB/4 = * (omitting matrix c) Block size B x B  Afterwards in cache (schematic) = * 16

University of Washington Cache Miss Analysis  Assume:  Cache block = 64 bytes = 8 doubles  Cache size C << n (much smaller than n)  Three blocks fit into cache: 3B 2 < C n/B blocks  Other (block) iterations:  Same as first iteration  2n/B * B 2 /8 = nB/4 = *  Total misses: Block size B x B  nB/4 * (n/B) 2 = n 3 /(4B) 17

University of Washington Summary (9/8) * n 3  No blocking: 1/(4B) * n 3  Blocking:  If B = 8 difference is 4 * 8 * 9 / 8 = 36x  If B = 16 difference is 4 * 16 * 9 / 8 = 72x  Suggests largest possible block size B, but limit 3B 2 < C!  Reason for dramatic difference:  Matrix multiplication has inherent temporal locality:  Input data: 3n 2 , computation 2n 3  Every array element used O(n) times!  But program has to be written properly 18

University of Washington Cache-Friendly Code  Programmer can optimize for cache performance  How data structures are organized  How data are accessed  Nested loop structure  Blocking is a general technique  All systems favor “cache - friendly code”  Getting absolute optimum performance is very platform specific  Cache sizes, line sizes, associativities, etc.  Can get most of the advantage with generic code  Keep working set reasonably small (temporal locality)  Use small strides (spatial locality)  Focus on inner loop code 19

University of Washington Intel Core i7 Cache Hierarchy Processor package Core 0 Core 3 L1 i-cache and d-cache: 32 KB, 8-way, Regs Regs Access: 4 cycles L1 L1 L1 L1 L2 unified cache: … d-cache i-cache d-cache i-cache 256 KB, 8-way, Access: 11 cycles L2 unified cache L2 unified cache L3 unified cache: 8 MB, 16-way, Access: 30-40 cycles L3 unified cache Block size : 64 bytes for (shared by all cores) all caches. Main memory 20

University of Washington Intel Core i7 The Memory Mountain 32 KB L1 i-cache 32 KB L1 d-cache 256 KB unified L2 cache 8M unified L3 cache Read throughput (MB/s) 7000 L1 All caches on-chip 6000 5000 4000 L2 3000 2000 L3 1000 0 2K Mem s1 s3 16K s5 s7 128K s9 1M s11 s13 8M s15 Stride (x8 bytes) s32 Working set size (bytes) 64M 21

University of Washington Data & addressing Roadmap Integers & floats Machine code & C C: Java: x86 assembly Car c = new Car(); car *c = malloc(sizeof(car)); programming c.setMiles(100); c->miles = 100; Procedures & c->gals = 17; c.setGals(17); stacks float mpg = get_mpg(c); float mpg = Arrays & structs c.getMPG(); free(c); Memory & caches Exceptions & Assembly get_mpg: pushq %rbp processes language: movq %rsp, %rbp Virtual memory ... Memory allocation popq %rbp Java vs. C ret OS: Machine 0111010000011000 100011010000010000000010 code: 1000100111000010 110000011111101000011111 Computer system: 22

University of Washington Control Flow  So far, we’ve seen how the flow of control changes as a single program executes  A CPU executes more than one program at a time though – we also need to understand how control flows across the many components of the system  Exceptional control flow is the basic mechanism used for:  Transferring control between processes and OS  Handling I/O and virtual memory within the OS  Implementing multi-process applications like shells and web servers  Implementing concurrency 23

Today HW3 extension Phew! Lab 4? Finish up caches, exceptional - PowerPoint PPT Presentation

University of Washington Today HW3 extension Phew! Lab 4? Finish up caches, exceptional control flow 1 University of Washington Cache Associativity 8-way 1-way 2-way 4-way 1 set, 8 sets, 4 sets, 2 sets, 8 blocks 1

What is the League Today 1 1/23/2017 What is the League Today What is the League Today 2

Social/Network/Analysis mohamed.bouguessa@uqo.ca/ 1 Web/today 2

Lecture 15 Logistics HW4 is due today HW5 posted today HW5 posted today Exam

WIEMANN LAMPHERE ARCHITECTS MONTPELIER TODAY MONTPELIER TODAY PARKING! VEHICLES ARE

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Welcome back. Today. Welcome back. Today. Continue Sampling combinatorial structures. Welcome

1. Abertis today 2. 2016 Financial Year 3. Outlook 4. Conclusions Abertis today 2016

Matt Fisher EUA Coordinator Overview of Parramatta today Overview of Parramatta today Overview

Course Business New dataset on CourseWeb: bpd.csv Midterm project due today Today

Featherweight Scala Week 14 January 31 1 Today Previously: Featherweight Java Today:

Stuff New HW on the web later today No lab today Tests graded by Thurs Last Time

Welcome back. Today. Welcome back. Today. Review: Spectral gap, Edge expansion h ( G ) ,

Sorting 15-121 Fall 2020 Margaret Reid-Miller Today Margaret will have office hours today

Exceptions Announcements Exceptions Today's Topic: Handling Errors 4 Today's Topic: Handling

Today and Tomorrow HEARING LOSS TECHNOLOGY TODAY AND TOMORROW Laura E. Plummer, MA, CRC, ATP

Fr From om Aristoteles to A o AI Today Today Prof. of. Nikol ola K a Kasabov abov Fellow

26 July 2020 Access service sheets at thecrossing.com.sg/im-new/sunday-services/ Family

Todays Message: Loving people who drive you nuts! 1 Corinthians 13 hopecc.com/slides

Impossible So Will I (100 Billion X) Hillsong Worship Wonder Fairest (with Fairest Lord

Logo slide Comfort, comfort my people, says your God. 2 Speak tenderly to Jerusalem, and proclaim

Massive Data Algorithmics Gerth Stlting Brodal Aarhus University Forskningsdag for

. Bruno Durand LIRMM CNRS Universit de Montpellier II November26 th 2011 . . . 1.

Decay of aftershock density with distance indicates triggering by dynamic stress 2017 6/12

CS675: Convex and Combinatorial Optimization Fall 2019 Combinatorial Problems as Linear and