slide 2/18 ENCM 501 W14 Slides for Lecture 19 Today’s Lecture Slides for Lecture 19 ENCM 501: Principles of Computer Architecture Winter 2014 Term ◮ Tomasulo’s algorithm: key components, and examples of Steve Norman, PhD, PEng instruction processing. Electrical & Computer Engineering Related reading in Hennessy & Patterson: Sections 3.4–3.5 Schulich School of Engineering University of Calgary 20 March, 2014 ENCM 501 W14 Slides for Lecture 19 slide 3/18 ENCM 501 W14 Slides for Lecture 19 slide 4/18 FP fegister file for Tomasulo examples Example of FP register file state The Qi field is four bits wide for examples in textbook sections 3.4–3.6. (In the previous lecture, I mistakenly 64-bit FP data Qi suggested that three bits were enough.) F0 2.25 0000 64-bit FP data Qi F2 0.375 0111 F0 F4 42.0 1011 F2 . . . . . . . . . F4 . . . . . . . . . F0 has Qi = 0, so the value of 2.25 is up-to-date. F28 But the values in F2 and F4 are out-of-date. These registers F30 are waiting for fresh results from reservation stations 7 and 11. This register file plays an active role in managing data hazards. A nonzero Qi value indicates that a register is waiting for data from an instruction. slide 5/18 slide 6/18 ENCM 501 W14 Slides for Lecture 19 ENCM 501 W14 Slides for Lecture 19 Reservation stations: How many? Reservation stations: What for? In the textbook example, there are fifteen of these: ◮ 5 store buffers The main possible states for a reservation station are: ◮ 5 load buffers ◮ available—not currently in use ◮ 3 stations for FP add or subtract instructions ◮ busy—waiting for one or two operand data items ◮ 2 stations for FP multiply or divide instructions ◮ busy—operation underway The textbook gives them names but not numbers, so let’s do ◮ busy—result ready, station waiting to write result to CDB that to help with clarity in examples: (common data bus) ◮ Store1 to Store5: 0001 to 0101 Key point: The instruction unit can feed an instruction to an available reservation station, even if the instruction is not ◮ Load1 to Load5: 0110 to 1010 ready to start execution. ◮ Add1, Add2, Add3 (FP add/subtract): 1011, 1100, 1101 ◮ Mult1, Mult2: (FP multiply/divide): 1110, 1111
slide 7/18 slide 8/18 ENCM 501 W14 Slides for Lecture 19 ENCM 501 W14 Slides for Lecture 19 Seven fields in a reservation station Vj, Vk, Qj, Qk for FP math reservation stations Busy Busy Op Vj Vk Qj Qk A Op Vj Vk Qj Qk A This is not to scale! The Busy, Op, Qj, and Qk fields are really Qj = 0 implies that the FP value in Vj is ready. tiny compared to the 64-bit Vj, Vk, and A fields. Qj � = 0 implies that the value in Vj is not ready. Qk = 0 implies that the FP value in Vk is ready. Busy and Op are the easiest to explain: Qk � = 0 implies that the value in Vk is not ready. ◮ Busy is 1 for busy, 0 for available. ◮ Op selects the operation; for example it distinguishes add If Qj � = 0 , beyond simply signifying that Vj is not ready, what does the specific nonzero value of Qj indicate? from subtract or multiply from divide. (The text isn’t clear about why Op matters in a load buffer or Let’s write out some examples of reservation station states. store buffer. In a more extended example, Op might be needed (The A field is unnecessary in the FP math reservation in a load buffer to distinguish, say, a 64-bit load from a 32-bit stations.) load.) ENCM 501 W14 Slides for Lecture 19 slide 9/18 ENCM 501 W14 Slides for Lecture 19 slide 10/18 Reservation stations are not the functional units Vj, Vk, Qj, Qk, A for store buffers that do FP math Busy Op Vj Vk Qj Qk A Reservation stations control the entrances to the functional units that crunch numbers, and watch the exits of those As with the FP math stations, Vj is ready if and only if units for results. Qj = 0, and the same applies for Vk and Qk. This is one of many possible arrangements: Vk is used for the FP data to be written in an S.D instruction. ◮ Reservation stations Add1, Add2, and Add3 all feed input into a single FP add/subtract pipeline. So what does it mean if Qk � = 0 ? ◮ Each of Mult1 and Mult2 can feed input into either a Vj, Qj, and A have to do with memory address calculations. pipelined FP multiplier or a non-pipelined FP divider. Let’s not worry about the details for now. slide 11/18 slide 12/18 ENCM 501 W14 Slides for Lecture 19 ENCM 501 W14 Slides for Lecture 19 Vj, Vk, Qj, Qk, A for load buffers A queue of decoded instructions Busy Our example system is Op Vj Vk Qj Qk A Instruction scalar and does in-order Unit instruction fetch. So in a typical clock cycle the Again, Vj is ready if and only if Qj = 0, and the same applies decoded instructions Instruction Unit puts one for Vk and Qk. decoded instruction into the Vj, Qj, and A have to do with memory address calculations. queue. As with store buffers, let’s not worry about the details for now. Why is a queue required? head of queue Remark: The load buffers and store buffers provide an Is it possible for the queue interface between the execution unit of the processor and the . . . . . . to become empty? If so, data caches . That’s an interesting design problem we don’t to reservation stations why? have time to study in this course.
slide 13/18 slide 14/18 ENCM 501 W14 Slides for Lecture 19 ENCM 501 W14 Slides for Lecture 19 Assignment of instructions to reservation stations Instruction completion and the CDB Suppose these two instructions are first and second in the queue: We’ve discussed most of the key components: Instruction ADD.D F2, F2, F0 Unit, register file, reservation stations, and functional units for SUB.D F6, F4, F2 # Note: RAW hazard! FP math. Suppose that the register file is in this state: The last key component is the Common Data Bus—CDB. 64-bit FP data Qi A busy reservation station watches for completion of the F0 1.0 0000 instruction. When the result is ready, the result goes on to the F2 1.5 0000 CDB along with the ID number of the reservation station. F4 3.75 0000 The register file and all reservation stations with nonzero F6 − 1 . 0 0000 Qj or Qk are constantly watching the CDB for new results. . . . . . . . . . Let’s trace how that works for completion of the example If stations Add1 (1011) and Add2 (1100) are both available, ADD.D and SUB.D instructions on the previous slide. how do the instructions get moved out of the queue, and what happens in the Register File? ENCM 501 W14 Slides for Lecture 19 slide 15/18 ENCM 501 W14 Slides for Lecture 19 slide 16/18 Resolution of a silly WAW hazard example Resolution of practical RAW, WAR and WAW hazards RAW: S.D needs the MUL.D result, and ADD.D needs the L.D result. MUL.D must not write to F2 after L.D writes to F2: WAR: S.D needs the MUL.D result. MUL.D F2, F0, F0 WAW: ADD.D needs the L.D result, and when all these L.D F2, (R4) instructions are done, F2 needs the L.D result. How is an incorrect write to F2 prevented? MUL.D F2, F4, F6 S.D F2, 0(R8) SUB.D F0, F12, F14 L.D F2, 0(R9) ADD.D F8, F8, F2 Let’s trace how Tomasulo’s algorithm handles this sequence. slide 17/18 slide 18/18 ENCM 501 W14 Slides for Lecture 19 ENCM 501 W14 Slides for Lecture 19 Loop example Upcoming Topics This is from page 179 of the textbook: Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) ◮ Continued discussion of Tomasulo’s algorithm and related DADDIU R1, R1, -8 design issues. BNE R1, R2, Loop1 ◮ Concluding remarks on ILP. Let’s make some notes about the DADDIU and BEQ Related reading in Hennessy & Patterson: Sections 3.4 to 3.6 instructions. Let’s assume the loop starts with R1 = 0x600040 and R2 = 0x600000 . Let’s trace how Tomasulo’s algorithm might handle the first two passes through the loop.
Recommend
More recommend