previous lecture slides for lecture 20
play

Previous Lecture Slides for Lecture 20 ENCM 501: Principles of - PDF document

slide 2/16 ENCM 501 W14 Slides for Lecture 20 Previous Lecture Slides for Lecture 20 ENCM 501: Principles of Computer Architecture Winter 2014 Term more examples of Tomasulos algorithm reorder buffers and speculation Steve Norman,


  1. slide 2/16 ENCM 501 W14 Slides for Lecture 20 Previous Lecture Slides for Lecture 20 ENCM 501: Principles of Computer Architecture Winter 2014 Term ◮ more examples of Tomasulo’s algorithm ◮ reorder buffers and speculation Steve Norman, PhD, PEng ◮ multiple issue of instructions Electrical & Computer Engineering ◮ other ILP topics, if time permits Schulich School of Engineering University of Calgary Related reading in Hennessy & Patterson: Sections 3.5–3.8 25 March, 2014 ENCM 501 W14 Slides for Lecture 20 slide 3/16 ENCM 501 W14 Slides for Lecture 20 slide 4/16 Today’s Lecture Resolution of practical RAW, WAR and WAW hazards (repeat from lec. 19, with edits for clarity) RAW: S.D depends on the MUL.D result, and ADD.D depends the L.D result. WAR: S.D must use the MUL.D result, not the L.D result. ◮ WHAT? WAW: ADD.D must use the L.D result, not the MUL.D result, and Related reading in Hennessy & Patterson: Sections 3.5–3.6 when all these instructions are done, F2 must contain the L.D result. AND WHAT ELSE? MUL.D F2, F4, F6 S.D F2, 0(R8) SUB.D F0, F12, F14 L.D F2, 0(R9) ADD.D F8, F8, F2 Let’s trace how Tomasulo’s algorithm handles this sequence. slide 5/16 slide 6/16 ENCM 501 W14 Slides for Lecture 20 ENCM 501 W14 Slides for Lecture 20 Loop example History of Tomasulo’s algorithm This is from page 179 of the textbook: Tomasulo developed the algorithm in the 1960’s . Loop: L.D F0, 0(R1) Note that microprocessors did not exist until the early 1970’s! MUL.D F4, F0, F2 Also, in the 60’s and 70’s, memory was fast enough relative to S.D F4, 0(R1) processors that caches were unnecessary . DADDIU R1, R1, -8 The algorithm was deployed in the IBM 360/91, a computer BNE R1, R2, Loop1 designed to crunch FP numbers as fast as FP numbers could Let’s make some notes about the DADDIU and BNE possibly be crunched in the 1960’s. (Web search for ibm instructions. 360/91 yields many fantastic results.) Let’s assume that the loop starts with R1 = 0x600040 and Processor designs started to use Tomasulo’s algorithm again in R2 = 0x600000 . the 1990’s, when it became clear that it was important to find ways to work around unpredictable delays caused by cache Let’s trace how Tomasulo’s algorithm might handle the first misses . two passes through the loop.

  2. slide 7/16 slide 8/16 ENCM 501 W14 Slides for Lecture 20 ENCM 501 W14 Slides for Lecture 20 Costs of the CDB (common data bus) Data hazards in the memory system Why is this a potential RAW hazard? In a typical clock cycle, some reservation station will broadcast S.D F0, 48(R8) a result on the CDB, and other reservation stations and the L.D F2, 8(R9) register file will look at the result to see if it’s useful. Transmitting the result and receiving the result both have And why is this a potential WAR hazard? energy costs. L.D F4, 72(R10) A complex instruction unit, reservation stations, and related S.D F6, 0(R11) hardware require lots of transistors . If Moore’s law had not Finally, why is this a potential WAW hazard? applied for so many decades, we would not see Tomasulo’s S.D F8, (R12) algorithm used as a basis for design of modestly priced S.D F10, (R13) processor chips. All of these hazards are important problems in processors that It’s possible, in some cycles, that two or more reservation may complete instructions out of order. Due to lack of time, stations will simultaneously try to broadcast their results. we won’t look at solutions in detail, but be aware that Why is this not a fatal defect in Tomasulo’s algorithm? processor designers must deal with these hazards correctly. ENCM 501 W14 Slides for Lecture 20 slide 9/16 ENCM 501 W14 Slides for Lecture 20 slide 10/16 Tomasulo’s algorithm and branch prediction Tomasulo’s algorithm and exceptions MUL.D F2, F4, F6 Consider this code fragment: S.D F2, 0(R8) BEQ R8, R0, L99 SUB.D F0, F12, F14 S.D F0, (R10) L.D F2, 0(R9) ADD.D F2, F2, F4 ADD.D F8, F8, F2 Suppose the branch is incorrectly predicted as not taken, and Suppose MUL.D gets delayed because it has to wait until a S.D and ADD.D get issued while BEQ waits for some earlier result for F6 is ready. That will delay the execution of S.D. instruction to provide a value for R8. Meanwhile, Tomasulo’s algorithm may allow completion of If Tomasulo’s algorithm does nothing beyond what has been SUB.D, L.D, and ADD.D. presented so far in lectures, what will prevent S.D from making an incorrect update to memory, and what will prevent What kind of problem is created if S.D eventually results in a ADD.D from making an incorrect update to F2? page fault exception? slide 11/16 slide 12/16 ENCM 501 W14 Slides for Lecture 20 ENCM 501 W14 Slides for Lecture 20 Out-of-order execution, in-order completion In a processor with a reorder buffer , issue of an instruction sends information related to the instruction both to a reservation station and to the reorder buffer. The version of Tomasulo’s algorithm presented in textbook A reservation station for a store is responsible for address Section 3.5 has scalar issue (that is, at most one instruction computation only—it is not allowed to write to memory. issued per clock cycle), out-of-order execution, and The reorder buffer is a FIFO queue —instructions enter in out-of-order completion. program order, and leave in program order. Section 3.6 modifies the algorithm to include a circuit called a When an instruction gets to the head of the ROB, it can be reorder buffer (ROB) , which will enforce in-order committed as soon as its results are known. Examples: completion . ◮ An ADD.D can be committed if a reservation station has Use of a reorder buffer solves the branch prediction and provided the sum to the reorder buffer. exception problems described on slides 9 and 10. ◮ An S.D can be committed if both the data to be stored and the address to be used are ready.

  3. slide 13/16 slide 14/16 ENCM 501 W14 Slides for Lecture 20 ENCM 501 W14 Slides for Lecture 20 Register file changes: The reservation stations and functional units work very ◮ The Qi field for each register is replaced by a Busy flag much as before, except: and a Reorder # field. Busy = 0 means the register is ◮ the Qj and Qk fields hold ROB entry numbers instead of up-to-date; Busy = 1 means the register is waiting for a reservation station numbers; result from whatever entry in the reorder buffer matches ◮ each reservation stations has a Dest field to hold an ROB the Reorder #. entry number; ◮ The register file does not watch the CDB for results. ◮ when a reservation station broadcasts its result on the The ROB must watch the CDB for results for all of the CDB, it includes the Dest field value to help both the ROB and the other reservation stations. instructions within the ROB that don’t yet have results. ENCM 501 W14 Slides for Lecture 20 slide 15/16 ENCM 501 W14 Slides for Lecture 20 slide 16/16 The reorder buffer and safe speculation More Topics for Today The key point about the ROB is that it can collect a large number of results without knowing whether those results should really be written to registers or memory. Consider a branch instruction that is mispredicted as taken. As time permits . . . ◮ What happens to all the instructions that got into the ◮ multiple issue of instructions ROB before the branch? ◮ limitations of ILP ◮ What happens to the branch target instruction, the successor of the the branch target instruction, etc., which got into the ROB after the branch? The bad effect of the above scenario is a waste of time and energy. What are the important bad effects that were prevented? slide 16/16 ENCM 501 W14 Slides for Lecture 20 Upcoming Topics ◮ Processes and threads. ◮ Multi-core processor circuits and their caches. ◮ Multi-core support for processes and threads. Related reading in Hennessy & Patterson: Sections 5.1–5.2

Recommend


More recommend