Slides for Lecture 20 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 25 March, 2014
slide 2/16 ENCM 501 W14 Slides for Lecture 20 Previous Lecture ◮ more examples of Tomasulo’s algorithm ◮ reorder buffers and speculation ◮ multiple issue of instructions ◮ other ILP topics, if time permits Related reading in Hennessy & Patterson: Sections 3.5–3.8
slide 3/16 ENCM 501 W14 Slides for Lecture 20 Today’s Lecture ◮ WHAT? Related reading in Hennessy & Patterson: Sections 3.5–3.6 AND WHAT ELSE?
slide 4/16 ENCM 501 W14 Slides for Lecture 20 Resolution of practical RAW, WAR and WAW hazards (repeat from lec. 19, with edits for clarity) RAW: S.D depends on the MUL.D result, and ADD.D depends the L.D result. WAR: S.D must use the MUL.D result, not the L.D result. WAW: ADD.D must use the L.D result, not the MUL.D result, and when all these instructions are done, F2 must contain the L.D result. MUL.D F2, F4, F6 S.D F2, 0(R8) SUB.D F0, F12, F14 L.D F2, 0(R9) ADD.D F8, F8, F2 Let’s trace how Tomasulo’s algorithm handles this sequence.
slide 5/16 ENCM 501 W14 Slides for Lecture 20 Loop example This is from page 179 of the textbook: Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop1 Let’s make some notes about the DADDIU and BNE instructions. Let’s assume that the loop starts with R1 = 0x600040 and R2 = 0x600000 . Let’s trace how Tomasulo’s algorithm might handle the first two passes through the loop.
slide 6/16 ENCM 501 W14 Slides for Lecture 20 History of Tomasulo’s algorithm Tomasulo developed the algorithm in the 1960’s . Note that microprocessors did not exist until the early 1970’s! Also, in the 60’s and 70’s, memory was fast enough relative to processors that caches were unnecessary . The algorithm was deployed in the IBM 360/91, a computer designed to crunch FP numbers as fast as FP numbers could possibly be crunched in the 1960’s. (Web search for ibm 360/91 yields many fantastic results.) Processor designs started to use Tomasulo’s algorithm again in the 1990’s, when it became clear that it was important to find ways to work around unpredictable delays caused by cache misses .
slide 7/16 ENCM 501 W14 Slides for Lecture 20 Costs of the CDB (common data bus) In a typical clock cycle, some reservation station will broadcast a result on the CDB, and other reservation stations and the register file will look at the result to see if it’s useful. Transmitting the result and receiving the result both have energy costs. A complex instruction unit, reservation stations, and related hardware require lots of transistors . If Moore’s law had not applied for so many decades, we would not see Tomasulo’s algorithm used as a basis for design of modestly priced processor chips. It’s possible, in some cycles, that two or more reservation stations will simultaneously try to broadcast their results. Why is this not a fatal defect in Tomasulo’s algorithm?
slide 8/16 ENCM 501 W14 Slides for Lecture 20 Data hazards in the memory system Why is this a potential RAW hazard? S.D F0, 48(R8) L.D F2, 8(R9) And why is this a potential WAR hazard? L.D F4, 72(R10) S.D F6, 0(R11) Finally, why is this a potential WAW hazard? S.D F8, (R12) S.D F10, (R13) All of these hazards are important problems in processors that may complete instructions out of order. Due to lack of time, we won’t look at solutions in detail, but be aware that processor designers must deal with these hazards correctly.
slide 9/16 ENCM 501 W14 Slides for Lecture 20 Tomasulo’s algorithm and branch prediction Consider this code fragment: BEQ R8, R0, L99 S.D F0, (R10) ADD.D F2, F2, F4 Suppose the branch is incorrectly predicted as not taken, and S.D and ADD.D get issued while BEQ waits for some earlier instruction to provide a value for R8. If Tomasulo’s algorithm does nothing beyond what has been presented so far in lectures, what will prevent S.D from making an incorrect update to memory, and what will prevent ADD.D from making an incorrect update to F2?
slide 10/16 ENCM 501 W14 Slides for Lecture 20 Tomasulo’s algorithm and exceptions MUL.D F2, F4, F6 S.D F2, 0(R8) SUB.D F0, F12, F14 L.D F2, 0(R9) ADD.D F8, F8, F2 Suppose MUL.D gets delayed because it has to wait until a result for F6 is ready. That will delay the execution of S.D. Meanwhile, Tomasulo’s algorithm may allow completion of SUB.D, L.D, and ADD.D. What kind of problem is created if S.D eventually results in a page fault exception?
slide 11/16 ENCM 501 W14 Slides for Lecture 20 Out-of-order execution, in-order completion The version of Tomasulo’s algorithm presented in textbook Section 3.5 has scalar issue (that is, at most one instruction issued per clock cycle), out-of-order execution, and out-of-order completion. Section 3.6 modifies the algorithm to include a circuit called a reorder buffer (ROB) , which will enforce in-order completion . Use of a reorder buffer solves the branch prediction and exception problems described on slides 9 and 10.
slide 12/16 ENCM 501 W14 Slides for Lecture 20 In a processor with a reorder buffer , issue of an instruction sends information related to the instruction both to a reservation station and to the reorder buffer. A reservation station for a store is responsible for address computation only—it is not allowed to write to memory. The reorder buffer is a FIFO queue —instructions enter in program order, and leave in program order. When an instruction gets to the head of the ROB, it can be committed as soon as its results are known. Examples: ◮ An ADD.D can be committed if a reservation station has provided the sum to the reorder buffer. ◮ An S.D can be committed if both the data to be stored and the address to be used are ready.
slide 13/16 ENCM 501 W14 Slides for Lecture 20 Register file changes: ◮ The Qi field for each register is replaced by a Busy flag and a Reorder # field. Busy = 0 means the register is up-to-date; Busy = 1 means the register is waiting for a result from whatever entry in the reorder buffer matches the Reorder #. ◮ The register file does not watch the CDB for results. The ROB must watch the CDB for results for all of the instructions within the ROB that don’t yet have results.
slide 14/16 ENCM 501 W14 Slides for Lecture 20 The reservation stations and functional units work very much as before, except: ◮ the Qj and Qk fields hold ROB entry numbers instead of reservation station numbers; ◮ each reservation stations has a Dest field to hold an ROB entry number; ◮ when a reservation station broadcasts its result on the CDB, it includes the Dest field value to help both the ROB and the other reservation stations.
slide 15/16 ENCM 501 W14 Slides for Lecture 20 The reorder buffer and safe speculation The key point about the ROB is that it can collect a large number of results without knowing whether those results should really be written to registers or memory. Consider a branch instruction that is mispredicted as taken. ◮ What happens to all the instructions that got into the ROB before the branch? ◮ What happens to the branch target instruction, the successor of the the branch target instruction, etc., which got into the ROB after the branch? The bad effect of the above scenario is a waste of time and energy. What are the important bad effects that were prevented?
slide 16/16 ENCM 501 W14 Slides for Lecture 20 More Topics for Today As time permits . . . ◮ multiple issue of instructions ◮ limitations of ILP
slide 16/16 ENCM 501 W14 Slides for Lecture 20 Upcoming Topics ◮ Processes and threads. ◮ Multi-core processor circuits and their caches. ◮ Multi-core support for processes and threads. Related reading in Hennessy & Patterson: Sections 5.1–5.2
Recommend
More recommend