Slides for Lecture 23 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 3 April, 2014
slide 2/19 ENCM 501 W14 Slides for Lecture 23 Previous Lecture ◮ introduction to cache coherency protocols
slide 3/19 ENCM 501 W14 Slides for Lecture 23 Today’s Lecture ◮ MSI protocol for cache concurrency ◮ snooping to support MSI and similar protocols ◮ race conditions at ILP and TLP levels ◮ use of locks to manage TLP races ◮ introduction to ISA support for locks Related reading in Hennessy & Patterson: Sections 5.2, 5.5
slide 4/19 ENCM 501 W14 Slides for Lecture 23 Quick review: Caches in multi-core systems These points were made in the previous lecture: ◮ In a writeback cache in a uniprocessor (one-core, no SMT) system, possible states of a cache block are invalid , clean , or dirty . ◮ For that writeback cache in a uniprocessor system, block states usually change as a result of loads and stores processed within the single core. (Messy detail: Actions such as DMA by I/O devices can also change states of blocks.) ◮ In a multicore system, one of many possible choices for sets of possible cache block states is MSI : modified , shared , or invalid . State changes in a cache in one core may be triggered by actions in another core.
slide 5/19 ENCM 501 W14 Slides for Lecture 23 Example quad-core system, with private L1 caches and a shared L2 cache core 0 core 1 core 2 core 3 L1 I L1 D L1 I L1 D L1 I L1 D L1 I L1 D bus L2 UNIFIED DRAM controller to off-chip DRAM
slide 6/19 ENCM 501 W14 Slides for Lecture 23 Example of propagating writes between cores Scenario from previous lecture: Before Thread A writes to X, there is a copy of X in all 5 caches, with a status of shared . Thread A Thread B Thread C time core 0 core 1 core 2 writes X=42 . . . . . . . . . reads X . . . reads X In an MSI protocol, what effects do the write and reads have on states of cache blocks that contain X?
slide 7/19 ENCM 501 W14 Slides for Lecture 23 Example of write serialization Scenario: Before Thread A writes to Y, there is a copy of Y in caches of cores 0 and 1, with a status of shared . Thread A Thread B time core 0 core 1 writes Y=66 . . . writes Y=77 . . . reads Y In an MSI protocol, what effects do the writes and read have on states of cache blocks that contain X?
slide 8/19 ENCM 501 W14 Slides for Lecture 23 Complete details of an MSI Protocol See Figures 5.5, 5.6, and 5.7 in the textbook, and related dicussion. Note that this is probably the simplest possible cache coherency protocol, and its sets of state transitions and related actions are substantially more complicated than what is required for a uniprocessor cache system!
slide 9/19 ENCM 501 W14 Slides for Lecture 23 Snooping The example scenarios presented on slides 6 and 7 require all caches to snoop on a common bus. Consider the L1 D-cache in core 0 of our quad-core example. In addition to doing all the things a uniprocessor cache would need to do, it must be capable of ◮ responding to write misses broadcast from other cores—this may require writeback from the core 0 cache; ◮ responding to invalidate commands broadcast from other cores—this may require invalidation within the core 0 cache. The requirements are symmetric for the L1 D-caches in cores 1, 2 and 3.
slide 10/19 ENCM 501 W14 Slides for Lecture 23 Snooping works reasonably well in a small SMP system—for example, two to four cores on a single chip. Scaling is a problem, as bus traffic increases at least proportionally to the number of cores. There is potential for a lot of waste. For example, in an eight-core system, suppose threads in two cores, say cores 0 and 1, are frequently accessing shared memory. Then cores 2–7 are exposed to and slowed down by a stream of traffic that is really just a conversation between cores 0 and 1. A more scalable alternative to snooping is use of a directory-based protocol. Time prevents us from looking at directory-baseds protocols in ENCM 501.
slide 11/19 ENCM 501 W14 Slides for Lecture 23 Race conditions A race condition is a situation in which the evolution of the state of some system depends critically on the exact order of nearly-simultaneous events in two or more subsystems. Race conditions can exist as a pure hardware problem in the design of sequential logic circuits. Race conditions can also arise at various levels of parallelism in software systems, including instruction-level parallelism and thread-level parallelism.
slide 12/19 ENCM 501 W14 Slides for Lecture 23 ILP race condition example Consider a Tomasulo-based microarchitecture for a uniprocessor system. What is the potential race condition in the following code sequence? S.D F0, 0(R8) S.D F2, 24(R29) How could hardware design eliminate the race condition? Note: There are similar race condition problems related to out-of-order execution of stores and loads.
slide 13/19 ENCM 501 W14 Slides for Lecture 23 TLP race condition example Two threads are running in the same loop, both reading and writing a global variable called counter . . . Thread B: Thread A: while ( condition ) { while ( condition ) { do some work do some work counter++ ; counter++ ; } } Before considering the race condition in a multicore system, let’s address this question: What is the potential failure if this program is running in a uniprocessor system?
slide 14/19 ENCM 501 W14 Slides for Lecture 23 TLP race condition example, continued The same defective program, now with the threads running simultaneously in two cores . . . Thread A: Thread B: while ( condition ) { while ( condition ) { do some work do some work counter++ ; counter++ ; } } Why does a cache coherency protocol such as MSI not prevent failure? Does it help if the ISA has memory-register instructions, such as addq $1, (%rax) in x86-84? Doesn’t that do read-add-write in a single instruction?
slide 15/19 ENCM 501 W14 Slides for Lecture 23 Making the two-thread program safe Thread A: Thread B: while ( condition ) { while ( condition ) { do some work do some work acquire lock acquire lock counter++ ; counter++ ; release lock release lock } } Setting up some kind of lock —often called a mutual exclusion , or mutex —prevents the failures seen in the past two slide. Let’s make some notes about how this would work in a uniprocessor system.
slide 16/19 ENCM 501 W14 Slides for Lecture 23 Thread A: Thread B: while ( condition ) { while ( condition ) { do some work do some work acquire lock acquire lock counter++ ; counter++ ; release lock release lock } } What about the multicore case? The lock will prevent Thread B from reading counter while Thread A is doing a load-add-store update with counter . Is that enough? Consider the MSI protocol again. What must happen either as a result of Thread A releasing the lock, or as a result of Thread B releasing the lock?
slide 17/19 ENCM 501 W14 Slides for Lecture 23 ISA and microarchitecture support for concurrency Instruction sets have to provide special atomic instructions to allow software implementation of synchronization facilities such as mutexes and semaphores. An atomic instruction (or a sequence of instructions that is intended to provide the same kind of behaviour, such as MIPS LL / SC ) typically works like this: ◮ memory read is attempted at some location; ◮ some kind of write data is generated; ◮ memory write to the same location is attempted.
slide 18/19 ENCM 501 W14 Slides for Lecture 23 The key aspects of atomic instructions are: ◮ the whole operation succeeds or the whole operation fails, in a clean way that can be checked after the attempt was made; ◮ if two or more threads attempt the operation, such that the attempts overlap in time, one thread will succeed, and all the other threads will fail.
slide 19/19 ENCM 501 W14 Slides for Lecture 23 Next (and last!) lecture, Thu Apr 10 ◮ atomic instructions ◮ overview of parallel programming with Pthreads Related reading in Hennessy & Patterson: Section 5.5
Recommend
More recommend