previous lecture slides for lecture 21
play

Previous Lecture Slides for Lecture 21 ENCM 501: Principles of - PDF document

slide 2/22 ENCM 501 W14 Slides for Lecture 21 Previous Lecture Slides for Lecture 21 ENCM 501: Principles of Computer Architecture Winter 2014 Term more examples of Tomasulos algorithm Steve Norman, PhD, PEng reorder buffers and


  1. slide 2/22 ENCM 501 W14 Slides for Lecture 21 Previous Lecture Slides for Lecture 21 ENCM 501: Principles of Computer Architecture Winter 2014 Term ◮ more examples of Tomasulo’s algorithm Steve Norman, PhD, PEng ◮ reorder buffers and speculation ◮ introduction to multiple issue of instructions Electrical & Computer Engineering Schulich School of Engineering Related reading in Hennessy & Patterson: Sections 3.5–3.8 University of Calgary 27 March, 2014 ENCM 501 W14 Slides for Lecture 21 slide 3/22 ENCM 501 W14 Slides for Lecture 21 slide 4/22 Today’s Lecture Multiple issue Multiple issue means issue of 2 or more instructions within a single processor core, in a single clock cycle. This is superscalar execution. ◮ overview of multiple issue of instructions Many quite different processor organization schemes support ◮ quick note about limitations of ILP multiple issue. Examples: ◮ processes and threads ◮ In-order processing of instructions in two parallel ◮ introduction to SMP architecture pipelines—e.g., ARM Cortex-A8 described in textbook ◮ introduction to memory coherency Section 3.13. Obviously no more than two instructions issue per clock. Related reading in Hennessy & Patterson: Sections 3.7, 3.8, 3.10, 5.1, 5.2 ◮ In-order issue, out-of-order execution, in-order commit of up to six instructions per clock—current x86-64 microarchitectures, e.g., Intel Core i7, also described in textbook Section 3.13. slide 5/22 slide 6/22 ENCM 501 W14 Slides for Lecture 21 ENCM 501 W14 Slides for Lecture 21 Multiple issue: Instruction unit requirements Here is another easy example. The next three instructions in The instruction unit needs to be able to fetch multiple line, in program order, are: instructions per clock. This has obvious implications for design LW R8, 24(R9) of L1 I-caches! ADDU R16, R16, R8 The instruction unit has to look at pending instructions, and SLT R9, R8, R4 decide every clock cycle how many instructions are safe to How will these instructions be managed? issue in parallel. In general, the problem of deciding how many instructions to Let’s look at an easy example for MIPS32, in a issue in parallel is difficult. See textbook Section 3.8 for microarchitecture that can issue a maximum of two further detail. (My favourite sentence in that section is, instructions per clock: “Analyze all the dependencies among the instructions in the LW R8, 24(R9) issue bundle.” That is only one of several things that has to ADDU R10, R11, R4 happen every clock cycle! ) If the above instructions are the next two in line in program order, can they be issued in parallel? Why or why not?

  2. slide 7/22 slide 8/22 ENCM 501 W14 Slides for Lecture 21 ENCM 501 W14 Slides for Lecture 21 Limitations of IPC So why try to support issue of 6 instructions/cycle? A few slides back: “In-order issue, out-of-order execution, A shortage of time in ENCM 501 requires brutal in-order commit of up to six instructions per clock—current oversimplification, but . . . x86-64 microarchitectures, e.g., Intel Core i7.” Roughly speaking, computer architects have discovered that It might pay to run two or more threads at the same time no matter how many transistors are thrown at the in the same core! That might get the average instruction problem , it’s very hard to get more than an average of about issue rate close to 6 per cycle. 2.5 instructions completed per cycle, due mostly to Intel’s term for this idea is hyper-threading. (As of March dependencies between instructions. 26, 2014, the Wikipedia article titled Hyper-threading See textbook Figure 3.43 for real data—for 19 different provides a good introduction.) SPECCPU2006 benchmark programs run on a Core i7, the A more generic term for the same idea is simultaneous best CPI is 0.44 (2.27 instructions per clock), and some multithreading (SMT) —see textbook Section 3.12 for programs have CPIs greater than 1.0! discussion. ENCM 501 W14 Slides for Lecture 21 slide 9/22 ENCM 501 W14 Slides for Lecture 21 slide 10/22 Processes and Threads A one-thread process time start finish . . . work, work, work . . . Suppose that you build an executable from a bunch of C source files. When you run the executable, it’s almost Program results are generated as if instructions are processed in certain that the program will run in a single thread, unless program order, as if each instruction finishes as the next your source files have explicitly asked in some way for instruction starts. multiple threads. If a processor is able to find significant ILP in the program, in the best case, CPI significantly less than 1 can be achieved. (I use the word almost in case I am wrong about the state of the art for C compilers and program linkers.) In practice, due to (a) difficulty in finding ILP, (b) TLB misses, and (c) cache misses, CPI greater than 1 is much more likely than All the C code used in ENCM 501 up to Assignment 7 is CPI less than 1. single-threaded. (And Assignment 8 has no C code at all.) Having many cores cannot improve the overall time spent on any one single one-thread process, but may help significantly with throughput of multiple one-thread processes. slide 11/22 slide 12/22 ENCM 501 W14 Slides for Lecture 21 ENCM 501 W14 Slides for Lecture 21 A process with five threads time . . . four “worker” threads work . . . time . . . four “worker” threads work . . . start finish . . . main thread waits . . . start finish Important: All five threads share a common virtual . . . main thread waits . . . address space. The OS kernel maintains a single page table for the process. Each thread has its own PC, its own stack, and its own sets of GPRs and FPRs. If there are at least four cores, all four Shared memory provides an opportunity to the programmer, worker threads can run at the same time. but also a perhaps surprisingly complex challenge. Speedup relative to the one-thread version can be close to 4.0, Other resources, such as open files, are also shared by all but . . . watch out for Amdahl’s law. threads belonging to a process.

  3. slide 13/22 slide 14/22 ENCM 501 W14 Slides for Lecture 21 ENCM 501 W14 Slides for Lecture 21 SMP: Symmetric MultiProcessor architecture Intel Core i7 cache and DRAM arrangement: core 0 core 1 core 2 core 3 SMP is currently the dominant architecture for processor L1 I L1 D L1 I L1 D L1 I L1 D L1 I L1 D private circuits in smartphones, laptops, desktops, and small servers. caches L2 unified L2 unified L2 unified L2 unified Hennessy and Patterson often use the term centralized shared-memory multiprocessor to mean the same thing as L3 shared cache SMP. The key feature of SMP architecture is a single main DRAM controller memory , to which all cores have equally fast access . bus to off-chip The Intel Core i7 is a well-known SMP chip . . . DRAM modules The above diagram shows relationships betweeen caches. See textbook page 29 for physical layout of a Core i7 die. ENCM 501 W14 Slides for Lecture 21 slide 15/22 ENCM 501 W14 Slides for Lecture 21 slide 16/22 DSM/NUMA multiprocessor architecture “Private” caches are not totally private! Process P has two threads: DSM and NUMA are two names for the same thing. T1 and T2. T1 is running in core 0 and T2 is running DSM: Distributed shared memory core 0 core 1 in core 1. NUMA: Nonuniform memory access T1 and T2 both frequently L1 I L1 D L1 I L1 D This kind of architecture has multiple main memories . read global variable G. Processors have relatively fast access to their local main L2 shared, unified (Remember T1 and T2 memories, and relatively slow access to other main memories. share a common virtual DRAM controller address space!) This kind of architecture works well for larger servers, with too to off-chip many cores for effective SMP. How many copies of G are DRAM there in the memory For the rest of ENCM 501, we will look at SMP only. system? slide 17/22 slide 18/22 ENCM 501 W14 Slides for Lecture 21 ENCM 501 W14 Slides for Lecture 21 Cache coherency Let’s continue. Suppose that T1 only reads G, but core 0 core 1 T2 makes frequent reads from and occasional writes L1 I L1 D L1 I L1 D to G. Multiprocessor systems require a coherent memory system. L2 shared, unified Q1: Why is it unacceptable The concept of coherence is defined and explained over the to allow T1 and T2 to DRAM controller next three slides. proceed with different to off-chip The material is a rephrasing of the discussion on textbook “versions” of G after T2 DRAM pages 352–353, in terms of actions within a multicore SMP writes to G? chip. Q2a: What would be a perfect, but impossible solution to the problem? Q2b: A pretty-good, but also impossible solution? Q3: What are some practical solutions?

Recommend


More recommend