Slides for Lecture 16 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 11 March, 2014
slide 2/26 ENCM 501 W14 Slides for Lecture 16 Previous Lecture ◮ context switches and effects on memory latency ◮ memory system summary ◮ introduction to ILP (instruction-level parallelism) ◮ review of simple pipelining
slide 3/26 ENCM 501 W14 Slides for Lecture 16 Today’s Lecture ◮ pipeline hazards ◮ solutions to pipeline hazards Related reading in Hennessy & Patterson: Sections C.2–C.3
slide 4/26 ENCM 501 W14 Slides for Lecture 16 A rough sketch of the 5-stage pipeline This sketch was presented at the end of the previous lecture: IF ID EX MEM WB CLK CLK CLK CLK instr. I-mem decode CLK ? CLK ALU D-mem add GPRs PC IF/ID ID/EX EX/MEM MEM/WB
slide 5/26 ENCM 501 W14 Slides for Lecture 16 Pipeline Hazards If a certain sequence of instructions prevents the usual throughput of one instruction for clock cycle in a simple pipeline, the situation is called a pipeline hazard . Hazards can be categorized into three main types: structural hazards , data hazards , and control hazards .
slide 6/26 ENCM 501 W14 Slides for Lecture 16 Structural hazards These occur when two instructions “want” to use the same physical resource at the same time, in incompatible ways. For example, if the simple 5-stage pipeline had a single memory unit, instead of split instruction and data memories, MEM of an LW or SW instruction would interfere with IF of a later instruction. Why is access to three GPRs by two different instructions, one in WB and a later one in ID, not a structural hazard?
slide 7/26 ENCM 501 W14 Slides for Lecture 16 Structural hazards: solutions The best solution is to design hardware to avoid structural hazards wherever possible. For example: ◮ in the simple, 5-stage pipeline, use separate instruction and data memories; ◮ in real pipelines, have separate I-TLBs and D-TLBs, and separate L1 I-caches and D-caches. For complex pipelines, it may be practically impossible to avoid all structural hazards, so stalls may be required—if two instructions are contending for a resource, one or the other will be delayed one or more clock cycles.
slide 8/26 ENCM 501 W14 Slides for Lecture 16 Data hazards (We’ll use MIPS32 instructions as examples, because instructions like ADD and SUB are easier to deal with than DADD and DSUB.) The most common kind of data hazard is called a RAW hazard: RAW stands for Read-After-Write. ADD R8, R9, R10 SUB R11, R12, R8 For correct processing, SUB must work as if R8 is read by SUB after R8 is written by ADD. (This is where the term RAW comes from.) Let’s draw a “pipeline diagram” to get a precise understanding of the problem.
slide 9/26 ENCM 501 W14 Slides for Lecture 16 More examples of RAW hazards For the simple 5-stage pipeline, let’s find all the RAW hazards in this sequence . . . LW R8, 0(R4) AND R9, R8, R5 OR R10, R6, R8 SLT R11, R8, R7 Remark: The deeper a pipeline is (the more stages it has), the greater will be the number and complexity of potential RAW hazards.
slide 10/26 ENCM 501 W14 Slides for Lecture 16 Forwarding Forwarding is the name given to a technique that can often solve RAW data hazards without loss of clock cycles to stalls. (Another name for forwarding is bypassing .) The essential idea is that if Instruction B depends on the result of Instruction A, Instruction B should not wait for Instruction A to write that result to its destination, but instead grab that result as soon as it is available. Let’s look at how forwarding helps with this sequence . . . LW R8, 0(R4) AND R9, R8, R5 OR R10, R6, R8 SLT R11, R8, R7
slide 11/26 ENCM 501 W14 Slides for Lecture 16 Sketch of forwarding hardware for 5-stage MIPS32 Here is an incomplete schematic for the EX stage . . . forward control CLK FwdA FwdB 2 GPR 2 00 01 A ID/EX pipeline register 10 ALU GPR 00 0 01 B 10 1 LW/SW data for SW offset ALU result from EX/MEM reg. LW or ALU result from MEM/WB reg.
slide 12/26 ENCM 501 W14 Slides for Lecture 16 Q1: What should the values of the “forward control” outputs be in the case where no forwarding is needed? Consider this sequence: LW R8, 0(R4) AND R9, R10, R11 SUB R12, R8, R9 Q2: What should the values of the “forward control” outputs be when SUB is in the EX stage? Q3: What are the inputs to “forward control” and how does the forwarding logic work? (We’ll give an example or two, not completely specify the logic!)
slide 13/26 ENCM 501 W14 Slides for Lecture 16 Can forwarding solve all RAW hazards? Consider this sequence: LW R15, 0(R14) ADD R16, R17, R15 Is it possible to solve the hazard by forwarding? If not, what is the most time-efficient way to solve the hazard? Let’s make some general remarks about optimal solutions of RAW data hazards.
slide 14/26 ENCM 501 W14 Slides for Lecture 16 Control hazards: Introduction In a simple pipeline, a control hazard is a difficulty in determining the address to use for the next Instruction Fetch. Look at this example, and assume a version of MIPS32 in which the delay slot instruction is not supposed to be completed if the branch is taken: L1: LW R9, 0(R5) instructions in loop body BEQ R8, R0, L1 OR R16, R10, R0 In the clock cycle after IF for the BEQ instruction, why is doing IF difficult? (There is more than one reason.)
slide 15/26 ENCM 501 W14 Slides for Lecture 16 Control hazards: Not just for conditional branches! In a conditional branch, there is an obvious motivation to wait for the decision about whether or not to take the branch. But consider the following unconditional updates to the PC: ◮ jump within a procedure; ◮ procedure call; ◮ procedure return. Why do these kinds of instructions generate control hazards? How many cycles might be lost due to such a hazard in a 5-stage pipeline like the one we’ve been looking at?
slide 16/26 ENCM 501 W14 Slides for Lecture 16 “Old school” solutions to control hazards (1) Stall as long as necessary to ensure that instruction results are correct. This obviously makes CPI worse (higher) if programs have lots of conditional branches and unconditional jumps.
slide 17/26 ENCM 501 W14 Slides for Lecture 16 “Old school” solutions to control hazards (2) Delayed jumps and branches. Because it is very difficult to do IF properly in the cycle immediately following a jump or a taken branch, many ISA designs decreed that the successor to a jump or branch would always be completed before the jump or branch target instruction . . . BEQ R12, R0, L99 ADD R13, R14, R15 # successor more instructions L99: SUB R8, R9, R10 # branch target OR R16, R8, R0 Real MIPS ISAs (as opposed to some hypothetical MIPS-like ISAs in textbooks and lecture slides) have delayed branches and jumps.
slide 18/26 ENCM 501 W14 Slides for Lecture 16 Dynamic branch prediction Dynamic branch prediction is the most important current technology for management of control hazards. A branch prediction circuit is a memory array comparable in size to an L1 I-cache, and somewhat more complex. A branch prediction circuit records the locations of thousands of recently-encountered branches and jumps, along with the addresses of their targets. For each conditional branch, a branch prediction circuit maintains a few bits of information that can be used to predict whether the branch will be taken or untaken.
slide 19/26 ENCM 501 W14 Slides for Lecture 16 Branch prediction code example p and past_last are of type int* . count is an int . do { if (*p < 0) count++; p++; } while (p != past_last); p walks through an array of int elements, and count records how many of those elements are negative.
slide 20/26 ENCM 501 W14 Slides for Lecture 16 Branch prediction code example, continued Assembly language for a MIPS32-like ISA that does not have delayed branch . . . L1: LW R8, (R4) SLT R9, R0, R8 BEQ R9, R0, L2 # branch if !(*p < 0) ADDIU R25, R25, 1 # count++ L2: ADDIU R4, R4, 4 # p++ BNE R4, R24, L1 # branch if p != past_last Let’s suppose that there are a lot of array elements, and most of them are negative. As the processor runs the loop, what predictions will it learn to make about the BEQ and BNE instructions?
slide 21/26 ENCM 501 W14 Slides for Lecture 16 Scalar versus Superscalar It seems like the right moment to introduce these terms. A scalar processor core starts no more than one instruction per clock cycle. In some cycles it can’t start an instruction, due to a stall caused by a pipeline hazard. All of the pipeline examples so far have been for scalar cores. A superscalar processor core tries to start two or more instructions per clock cycle. When I start talking about superscalar cores, I will let you know.
slide 22/26 ENCM 501 W14 Slides for Lecture 16 A 5-stage pipeline with dynamic branch prediction Let’s review our previous sketch of the 5-stage pipeline, then show how it would be modified to support dynamic branch prediction. An instruction fetch unit encapsulates a PC, an L1 I-cache, and a branch prediction circuit. Both sketches are for scalar systems.
Recommend
More recommend