slide 2/19 ENCM 501 W14 Slides for Lecture 23 Previous Lecture Slides for Lecture 23 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng ◮ introduction to cache coherency protocols Electrical & Computer Engineering Schulich School of Engineering University of Calgary 3 April, 2014 ENCM 501 W14 Slides for Lecture 23 slide 3/19 ENCM 501 W14 Slides for Lecture 23 slide 4/19 Today’s Lecture Quick review: Caches in multi-core systems These points were made in the previous lecture: ◮ In a writeback cache in a uniprocessor (one-core, no SMT) system, possible states of a cache block are ◮ MSI protocol for cache concurrency invalid , clean , or dirty . ◮ snooping to support MSI and similar protocols ◮ For that writeback cache in a uniprocessor system, block ◮ race conditions at ILP and TLP levels states usually change as a result of loads and stores ◮ use of locks to manage TLP races processed within the single core. (Messy detail: Actions such as DMA by I/O devices can also change states of ◮ introduction to ISA support for locks blocks.) Related reading in Hennessy & Patterson: Sections 5.2, 5.5 ◮ In a multicore system, one of many possible choices for sets of possible cache block states is MSI : modified , shared , or invalid . State changes in a cache in one core may be triggered by actions in another core. slide 5/19 slide 6/19 ENCM 501 W14 Slides for Lecture 23 ENCM 501 W14 Slides for Lecture 23 Example quad-core system, Example of propagating writes between cores with private L1 caches and a shared L2 cache Scenario from previous lecture: Before Thread A writes to X, there is a copy of X in all 5 caches, with a status of shared . core 0 core 1 core 2 core 3 Thread A Thread B Thread C time core 0 core 1 core 2 L1 I L1 D L1 I L1 D L1 I L1 D L1 I L1 D writes X=42 . . . . . . bus . . . reads X . L2 UNIFIED . . reads X DRAM controller In an MSI protocol, what effects do the write and reads have to off-chip DRAM on states of cache blocks that contain X?
slide 7/19 slide 8/19 ENCM 501 W14 Slides for Lecture 23 ENCM 501 W14 Slides for Lecture 23 Example of write serialization Complete details of an MSI Protocol Scenario: Before Thread A writes to Y, there is a copy of Y in caches of cores 0 and 1, with a status of shared . Thread A Thread B See Figures 5.5, 5.6, and 5.7 in the textbook, and related time core 0 core 1 dicussion. writes Y=66 Note that this is probably the simplest possible cache . . . coherency protocol, and its sets of state transitions and related writes Y=77 actions are substantially more complicated than what is . . . required for a uniprocessor cache system! reads Y In an MSI protocol, what effects do the writes and read have on states of cache blocks that contain X? ENCM 501 W14 Slides for Lecture 23 slide 9/19 ENCM 501 W14 Slides for Lecture 23 slide 10/19 Snooping Snooping works reasonably well in a small SMP system—for The example scenarios presented on slides 6 and 7 require all example, two to four cores on a single chip. Scaling is a caches to snoop on a common bus. problem, as bus traffic increases at least proportionally to the Consider the L1 D-cache in core 0 of our quad-core example. number of cores. In addition to doing all the things a uniprocessor cache would There is potential for a lot of waste. For example, in an need to do, it must be capable of eight-core system, suppose threads in two cores, say ◮ responding to write misses broadcast from other cores 0 and 1, are frequently accessing shared memory. Then cores—this may require writeback from the core 0 cores 2–7 are exposed to and slowed down by a stream of cache; traffic that is really just a conversation between cores 0 and 1. ◮ responding to invalidate commands broadcast from other A more scalable alternative to snooping is use of a cores—this may require invalidation within the core 0 directory-based protocol. Time prevents us from looking at cache. directory-baseds protocols in ENCM 501. The requirements are symmetric for the L1 D-caches in cores 1, 2 and 3. slide 11/19 slide 12/19 ENCM 501 W14 Slides for Lecture 23 ENCM 501 W14 Slides for Lecture 23 Race conditions ILP race condition example Consider a Tomasulo-based microarchitecture for a A race condition is a situation in which the evolution of the uniprocessor system. state of some system depends critically on the exact order of What is the potential race condition in the following code nearly-simultaneous events in two or more subsystems. sequence? Race conditions can exist as a pure hardware problem in the S.D F0, 0(R8) design of sequential logic circuits. S.D F2, 24(R29) Race conditions can also arise at various levels of parallelism in How could hardware design eliminate the race condition? software systems, including instruction-level parallelism and thread-level parallelism. Note: There are similar race condition problems related to out-of-order execution of stores and loads.
slide 13/19 slide 14/19 ENCM 501 W14 Slides for Lecture 23 ENCM 501 W14 Slides for Lecture 23 TLP race condition example TLP race condition example, continued The same defective program, now with the threads running Two threads are running in the same loop, both reading and simultaneously in two cores . . . writing a global variable called counter . . . Thread A: Thread B: Thread B: Thread A: while ( condition ) { while ( condition ) { while ( condition ) { while ( condition ) { do some work do some work do some work counter++ ; counter++ ; do some work counter++ ; } } counter++ ; } } Why does a cache coherency protocol such as MSI not prevent failure? Before considering the race condition in a multicore system, let’s address this question: What is the potential failure if this Does it help if the ISA has memory-register instructions, such program is running in a uniprocessor system? as addq $1, (%rax) in x86-84? Doesn’t that do read-add-write in a single instruction? ENCM 501 W14 Slides for Lecture 23 slide 15/19 ENCM 501 W14 Slides for Lecture 23 slide 16/19 Making the two-thread program safe Thread A: Thread B: while ( condition ) { while ( condition ) { Thread A: Thread B: do some work do some work while ( condition ) { while ( condition ) { acquire lock acquire lock do some work do some work counter++ ; counter++ ; acquire lock acquire lock release lock release lock counter++ ; counter++ ; } } release lock release lock } } What about the multicore case? The lock will prevent Thread B from reading counter while Thread A is doing a Setting up some kind of lock —often called a mutual load-add-store update with counter . exclusion , or mutex —prevents the failures seen in the past two slide. Is that enough? Consider the MSI protocol again. What must happen either as a result of Thread A releasing the lock, or as Let’s make some notes about how this would work in a a result of Thread B releasing the lock? uniprocessor system. slide 17/19 slide 18/19 ENCM 501 W14 Slides for Lecture 23 ENCM 501 W14 Slides for Lecture 23 ISA and microarchitecture support for concurrency Instruction sets have to provide special atomic instructions to The key aspects of atomic instructions are: allow software implementation of synchronization facilities ◮ the whole operation succeeds or the whole operation fails, such as mutexes and semaphores. in a clean way that can be checked after the attempt was made; An atomic instruction (or a sequence of instructions that is ◮ if two or more threads attempt the operation, such that intended to provide the same kind of behaviour, such as MIPS LL / SC ) typically works like this: the attempts overlap in time, one thread will succeed, and all the other threads will fail. ◮ memory read is attempted at some location; ◮ some kind of write data is generated; ◮ memory write to the same location is attempted.
slide 19/19 ENCM 501 W14 Slides for Lecture 23 Next (and last!) lecture, Thu Apr 10 ◮ atomic instructions ◮ overview of parallel programming with Pthreads Related reading in Hennessy & Patterson: Section 5.5
Recommend
More recommend