slides for lecture 19
play

Slides for Lecture 19 ENCM 501: Principles of Computer Architecture - PowerPoint PPT Presentation

Slides for Lecture 19 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 20 March, 2014 slide 2/18 ENCM 501 W14


  1. Slides for Lecture 19 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 20 March, 2014

  2. slide 2/18 ENCM 501 W14 Slides for Lecture 19 Today’s Lecture ◮ Tomasulo’s algorithm: key components, and examples of instruction processing. Related reading in Hennessy & Patterson: Sections 3.4–3.5

  3. slide 3/18 ENCM 501 W14 Slides for Lecture 19 FP fegister file for Tomasulo examples The Qi field is four bits wide for examples in textbook sections 3.4–3.6. (In the previous lecture, I mistakenly suggested that three bits were enough.) 64-bit FP data Qi F0 F2 F4 . . . . . . . . . F28 F30 This register file plays an active role in managing data hazards. A nonzero Qi value indicates that a register is waiting for data from an instruction.

  4. slide 4/18 ENCM 501 W14 Slides for Lecture 19 Example of FP register file state 64-bit FP data Qi F0 2.25 0000 F2 0.375 0111 F4 42.0 1011 . . . . . . . . . F0 has Qi = 0, so the value of 2.25 is up-to-date. But the values in F2 and F4 are out-of-date. These registers are waiting for fresh results from reservation stations 7 and 11.

  5. slide 5/18 ENCM 501 W14 Slides for Lecture 19 Reservation stations: How many? In the textbook example, there are fifteen of these: ◮ 5 store buffers ◮ 5 load buffers ◮ 3 stations for FP add or subtract instructions ◮ 2 stations for FP multiply or divide instructions The textbook gives them names but not numbers, so let’s do that to help with clarity in examples: ◮ Store1 to Store5: 0001 to 0101 ◮ Load1 to Load5: 0110 to 1010 ◮ Add1, Add2, Add3 (FP add/subtract): 1011, 1100, 1101 ◮ Mult1, Mult2: (FP multiply/divide): 1110, 1111

  6. slide 6/18 ENCM 501 W14 Slides for Lecture 19 Reservation stations: What for? The main possible states for a reservation station are: ◮ available—not currently in use ◮ busy—waiting for one or two operand data items ◮ busy—operation underway ◮ busy—result ready, station waiting to write result to CDB (common data bus) Key point: The instruction unit can feed an instruction to an available reservation station, even if the instruction is not ready to start execution.

  7. slide 7/18 ENCM 501 W14 Slides for Lecture 19 Seven fields in a reservation station Busy Op Vj Vk Qj Qk A This is not to scale! The Busy, Op, Qj, and Qk fields are really tiny compared to the 64-bit Vj, Vk, and A fields. Busy and Op are the easiest to explain: ◮ Busy is 1 for busy, 0 for available. ◮ Op selects the operation; for example it distinguishes add from subtract or multiply from divide. (The text isn’t clear about why Op matters in a load buffer or store buffer. In a more extended example, Op might be needed in a load buffer to distinguish, say, a 64-bit load from a 32-bit load.)

  8. slide 8/18 ENCM 501 W14 Slides for Lecture 19 Vj, Vk, Qj, Qk for FP math reservation stations Busy Op Vj Vk Qj Qk A Qj = 0 implies that the FP value in Vj is ready. Qj � = 0 implies that the value in Vj is not ready. Qk = 0 implies that the FP value in Vk is ready. Qk � = 0 implies that the value in Vk is not ready. If Qj � = 0 , beyond simply signifying that Vj is not ready, what does the specific nonzero value of Qj indicate? Let’s write out some examples of reservation station states. (The A field is unnecessary in the FP math reservation stations.)

  9. slide 9/18 ENCM 501 W14 Slides for Lecture 19 Reservation stations are not the functional units that do FP math Reservation stations control the entrances to the functional units that crunch numbers, and watch the exits of those units for results. This is one of many possible arrangements: ◮ Reservation stations Add1, Add2, and Add3 all feed input into a single FP add/subtract pipeline. ◮ Each of Mult1 and Mult2 can feed input into either a pipelined FP multiplier or a non-pipelined FP divider.

  10. slide 10/18 ENCM 501 W14 Slides for Lecture 19 Vj, Vk, Qj, Qk, A for store buffers Busy Op Vj Vk Qj Qk A As with the FP math stations, Vj is ready if and only if Qj = 0, and the same applies for Vk and Qk. Vk is used for the FP data to be written in an S.D instruction. So what does it mean if Qk � = 0 ? Vj, Qj, and A have to do with memory address calculations. Let’s not worry about the details for now.

  11. slide 11/18 ENCM 501 W14 Slides for Lecture 19 Vj, Vk, Qj, Qk, A for load buffers Busy Op Vj Vk Qj Qk A Again, Vj is ready if and only if Qj = 0, and the same applies for Vk and Qk. Vj, Qj, and A have to do with memory address calculations. As with store buffers, let’s not worry about the details for now. Remark: The load buffers and store buffers provide an interface between the execution unit of the processor and the data caches . That’s an interesting design problem we don’t have time to study in this course.

  12. slide 12/18 ENCM 501 W14 Slides for Lecture 19 A queue of decoded instructions Our example system is Instruction scalar and does in-order Unit instruction fetch. So in a typical clock cycle the decoded instructions Instruction Unit puts one decoded instruction into the queue. Why is a queue required? head of queue Is it possible for the queue . . . . . . to become empty? If so, to reservation stations why?

  13. slide 13/18 ENCM 501 W14 Slides for Lecture 19 Assignment of instructions to reservation stations Suppose these two instructions are first and second in the queue: ADD.D F2, F2, F0 F6, F4, F2 # Note: RAW hazard! SUB.D Suppose that the register file is in this state: 64-bit FP data Qi F0 1.0 0000 F2 1.5 0000 F4 3.75 0000 F6 − 1 . 0 0000 . . . . . . . . . If stations Add1 (1011) and Add2 (1100) are both available, how do the instructions get moved out of the queue, and what happens in the Register File?

  14. slide 14/18 ENCM 501 W14 Slides for Lecture 19 Instruction completion and the CDB We’ve discussed most of the key components: Instruction Unit, register file, reservation stations, and functional units for FP math. The last key component is the Common Data Bus—CDB. A busy reservation station watches for completion of the instruction. When the result is ready, the result goes on to the CDB along with the ID number of the reservation station. The register file and all reservation stations with nonzero Qj or Qk are constantly watching the CDB for new results. Let’s trace how that works for completion of the example ADD.D and SUB.D instructions on the previous slide.

  15. slide 15/18 ENCM 501 W14 Slides for Lecture 19 Resolution of a silly WAW hazard example MUL.D must not write to F2 after L.D writes to F2: MUL.D F2, F0, F0 L.D F2, (R4) How is an incorrect write to F2 prevented?

  16. slide 16/18 ENCM 501 W14 Slides for Lecture 19 Resolution of practical RAW, WAR and WAW hazards RAW: S.D needs the MUL.D result, and ADD.D needs the L.D result. WAR: S.D needs the MUL.D result. WAW: ADD.D needs the L.D result, and when all these instructions are done, F2 needs the L.D result. MUL.D F2, F4, F6 S.D F2, 0(R8) SUB.D F0, F12, F14 L.D F2, 0(R9) ADD.D F8, F8, F2 Let’s trace how Tomasulo’s algorithm handles this sequence.

  17. slide 17/18 ENCM 501 W14 Slides for Lecture 19 Loop example This is from page 179 of the textbook: Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop1 Let’s make some notes about the DADDIU and BEQ instructions. Let’s assume the loop starts with R1 = 0x600040 and R2 = 0x600000 . Let’s trace how Tomasulo’s algorithm might handle the first two passes through the loop.

  18. slide 18/18 ENCM 501 W14 Slides for Lecture 19 Upcoming Topics ◮ Continued discussion of Tomasulo’s algorithm and related design issues. ◮ Concluding remarks on ILP. Related reading in Hennessy & Patterson: Sections 3.4 to 3.6

Recommend


More recommend