Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings
Objective To provide an overview of the superscalar approach and the key design issues associated with its implementation. 2 Superscalar Processors
Outline Parallelism Concepts Superscalar × Superpipelining Limitations to Parallelism Instruction Issue Policies Register Renaming and Dynamic Scheduling 3 Superscalar Processors
Two Parallelism Concepts Instruction Level Parallelism (ILP) exists when instructions in a sequence are independent and thus can be executed in parallel, e.g., ... ADD EAX,ECX can be executed simultaneously keeping the MOV EBX,ESI same result as in a sequential execution ... Machine Parallelism is a measure of the ability of the processor to take advantage of ILP. 4 Superscalar Processors
Outline Parallelism Concepts Superscalar × Superpipelining Limitations to Parallelism Instruction Issue Policies Register Renaming and Dynamic Scheduling 5 Superscalar Processors
Superpipelining Approach In a conventional pipeline the most time consuming task determines the clock rate. clock period = 2 t 2t 2t 2t 2t 2t Ifetch Decode Execute Write STAGES t t 2t t A superpipelined machine runs at higher clock rates by splitting most time consuming stages into smaller stages. clock period = t t t t t t more stages STAGES Pentium IV (20 stages) 6 Superscalar Processors
Superpipelining Execution Conventional Pipeline Time diagram Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time Superpipeline Time diagram 0 1 2 3 4 5 6 7 8 9 time 7 Superscalar Processors
Superscalar Approach A superscalar machine is able to execute multiple instructions independently and concurrently in multiple pipelines integer register file floating-point register file pipeline functional units memory General Superscalar Organization 8 Superscalar Processors
Superscalar Execution Conventional Pipeline Time diagram Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time Superscalar Time diagram Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time 9 Superscalar Processors
Outline Parallelism Concepts Superscalar × Superpipelining Limitations to Parallelism Instruction Issue Policies Register Renaming and Dynamic Scheduling 10 Superscalar Processors
Limitations to Parallelism Resource Conflicts 1. Competition of two or more instructions for the same resource at the same time, e.g., consecutive arithmetic instructions → possible solution is adding a second ALU Procedural Dependency 2. conditional branches (already seen) in superscalar processors more is lost if prediction fails Data Dependencies 3. 11 Superscalar Processors
Data Dependencies ... True Data Dependency ADD EAX ,ECX 1. MOV EBX, EAX or Read-after-Write (RAW) ... ... ADD ,ECX EAX Output Dependency 2. MOV EAX ,EBX or Write-after-Write (WAW) ... ... Antidependency EAX 3. ADD EBX, EAX MOV ,ECX or Write-after-Read (WAR) ... 12 Superscalar Processors
Outline Parallelism Concepts Superscalar × Superpipelining Limitations to Parallelism Instruction Issue Policies Register Renaming and Dynamic Scheduling 13 Superscalar Processors
Instruction Issue Policy It refers to: Instruction fetch: order in which instructions are 1. fetched Instruction execution: order in which instructions 2. are delivered to a functional unit to execute the operation Instruction commit: order in which instruction 3. results are stored in registers and memory 14 Superscalar Processors
In-order issue with in-order completion instruction issuing is stalled by resource conflicts, procedural or any data dependencies. Example: up to two instructions may be fetched, issued and written back at a 1. time fetch of next two instructions waits till decode buffer is cleared 2. 3 functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock. 3. Data dependency stalls instruction issuing until the execution of the 4. earlier instruction is completed. In RAW later instruction may be issued only after the earlier 5. instruction has written the result. 15 Superscalar Processors
In-order issue and completion decode decode / / * * +/- +/- write write CY CY 1 1 2 2 1 1 3 3 4 4 1 1 2 2 2 2 4 4 3 3 1 1 3 3 4 4 3 3 1 1 2 2 4 4 5 5 6 6 4 4 3 3 5 5 6 6 5 5 4 4 6 6 6 6 5 5 7 7 1. R3=R0*R1 1. R3=R0*R1 7 7 8 8 6 6 5 5 8 8 2. R4=R0+R2 2. R4=R0+R2 7 7 8 8 6 6 9 9 3. R5=R0/R1 3. R5=R0/R1 8 8 7 7 10 10 4. R6=R1+R4 4. R6=R1+R4 8 8 7 7 11 11 5. R7=R1*R2 5. R7=R1*R2 8 8 7 7 12 12 6. R1=R0-R2 6. R1=R0-R2 8 8 13 13 7. R3=R3*R1 7. R3=R3*R1 14 14 8. R1=R4+R4 8. R1=R4+R4 15 15 16 Superscalar Processors
Out-of-order issue and completion Instruction window A buffer where decoded instruction are stored waiting for execution. It decouples decode stages from execution stages Can continue to fetch and decode until this window is full When a functional unit becomes available an instruction can be executed Since instructions have been decoded, processor can look ahead 17 Superscalar Processors
Out-of-order issue and completion instruction issuing is stalled by resource conflicts, procedural or TRUE data dependencies. Example: Up to two instructions may be fetched, issued and written back at a 1. time 3 functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock. 2. Data dependency does not stall instruction issuing. 3. In RAW later instruction may be issued only after the earlier 4. instruction has written the result. 18 Superscalar Processors
Out-of-order issue and completion decode decode / / * * +/- +/- write write CY CY 1 1 2 2 1 1 3 3 4 4 1 1 2 2 2 2 5 5 6 6 3 3 1 1 2 2 3 3 7 7 8 8 3 3 5 5 4 4 1 1 4 4 5 5 6 6 3 3 4 4 5 5 8 8 5 5 6 6 6 6 7 7 8 8 7 7 1. R3=R0*R1 7 7 8 8 2. R4=R0+R2 7 7 9 9 3. R5=R0/R1 10 10 4. R6=R1+R4 Register 11 11 5. R7=R1*R2 Renaming 12 12 13 13 6. R1=R0-R2 S1 14 14 7. R3=R3*R1 S1 15 15 8. R1=R4+R4 S2 19 Superscalar Processors
Out-of-order issue with In-order completion decode / * +/- write CY 1 2 1 Exercise : How would it 3 4 1 2 2 5 6 3 1 3 be if out-of-order issue is 7 8 3 1 2 4 allowed but in-order 5 4 3 5 completion is required? 5 6 4 6 8 5 6 7 1. R3=R0*R1 7 8 2. R4=R0+R2 7 9 3. R5=R0/R1 7 8 10 4. R6=R1+R4 11 5. R7=R1*R2 12 13 6. R1=R0-R2 14 7. R3=R3*R1 15 8. R1=R4+R4 20 Superscalar Processors
Exercises decode / * +/- write CY 1 2 1 1 Exercise 1: Complete the tables on the 1 2 right under the same assumptions of 3 4 2 1 3 the previous examples for the program 3 2 4 4 fragment below and for in-order issue 3 2 5 and completion. 5 6 5 3 4 6 6 5 7 1. R3=R0*R1 7 6 8 2. R4=R0*R2 7 9 3. R5=R0/R1 7 10 4. R6=R5+R4 7 11 5. R5=R1-R2 12 6. R1=R0-R2 13 7. R3=R3*R1 14 15 21 Superscalar Processors
Exercises decode / * +/- write CY 1 2 1 Exercise 2: Complete the tables on the 3 4 1 2 right under the same assumptions of 5 6 3 1 3 the previous examples for the program 7 3 2 4 1 4 fragment below and for out-of-order 2 5 3 4 5 issue and completion. 7 6 2 5 6 7 6 7 1. R3=R0*R1 7 8 2. R4=R0*R2 9 3. R5=R0/R1 10 4. R6=R5+R4 11 5. R5=R1-R2 12 6. R1=R0-R2 13 7. R3=R3*R1 14 15 22 Superscalar Processors
Exercises decode / * +/- write CY 1 2 1 Exercise 3: Complete the tables on the 1 2 right under the same assumptions of 1 3 the previous examples for the program 3 4 3 2 4 fragment below and for in-order issue 3 4 2 5 and completion. 5 6 4 3 6 7 6 4 5 7 1. R3=R0-R1 6 8 2. R4=R0+R3 7 6 9 3. R3=R0/R1 7 10 4. R6=R5*R4 7 11 5. R5=R1-R2 12 6. R1=R0*R2 13 7. R3=R3*R5 14 15 23 Superscalar Processors
Exercises Decode / * +/- write CY 1 2 1 Exercise 4: Complete the tables on the 3 4 1 2 right under the same assumptions of 5 6 3 5 1 3 the previous examples for the program 7 3 6 2 5 4 fragment below and for out-of-order 6 2 3 5 issue and completion. 7 6 6 7 7 1. R3=R0-R1 7 8 2. R4=R0+R3 9 3. R3=R0/R1 10 4. R6=R5*R4 11 5. R5=R1-R2 12 6. R1=R0*R2 13 7. R3=R3*R5 14 15 24 Superscalar Processors
Outline Parallelism Concepts Superscalar × Superpipelining Limitations to Parallelism Instruction Issue Policies Register Renaming and Dynamic Scheduling 25 Superscalar Processors
Register Renaming Logical registers contain pointers to hidden registers, which actually contain the data. S0 S1 S2 3 R0 S3 4 R1 S4 7 R2 S5 5 R3 S6 S7 Logical Registers Hidden Registers > contain pointers to contain data hidden Registers HW keeps track of non committed hidden registers. 26 Superscalar Processors
Recommend
More recommend