superscalar processors
play

Superscalar Processors Raul Queiroz Feitosa Parts of these slides - PowerPoint PPT Presentation

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings Objective To provide an overview of the superscalar approach and the key design issues associated with its implementation.


  1. Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings

  2. Objective To provide an overview of the superscalar approach and the key design issues associated with its implementation. 2 Superscalar Processors

  3. Outline  Parallelism Concepts  Superscalar × Superpipelining  Limitations to Parallelism  Instruction Issue Policies  Register Renaming and Dynamic Scheduling 3 Superscalar Processors

  4. Two Parallelism Concepts Instruction Level Parallelism (ILP) exists when instructions in a sequence are independent and thus can be executed in parallel, e.g., ... ADD EAX,ECX can be executed simultaneously keeping the MOV EBX,ESI same result as in a sequential execution ... Machine Parallelism is a measure of the ability of the processor to take advantage of ILP. 4 Superscalar Processors

  5. Outline  Parallelism Concepts  Superscalar × Superpipelining  Limitations to Parallelism  Instruction Issue Policies  Register Renaming and Dynamic Scheduling 5 Superscalar Processors

  6. Superpipelining Approach In a conventional pipeline the most time consuming task determines the clock rate.  clock period = 2 t 2t 2t 2t 2t 2t Ifetch Decode Execute Write STAGES t t 2t t A superpipelined machine runs at higher clock rates by splitting most time consuming stages into smaller stages.  clock period = t t t t t t more stages STAGES Pentium IV (20 stages) 6 Superscalar Processors

  7. Superpipelining Execution Conventional Pipeline Time diagram Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time Superpipeline Time diagram 0 1 2 3 4 5 6 7 8 9 time 7 Superscalar Processors

  8. Superscalar Approach A superscalar machine is able to execute multiple instructions independently and concurrently in multiple pipelines integer register file floating-point register file pipeline functional units memory General Superscalar Organization 8 Superscalar Processors

  9. Superscalar Execution Conventional Pipeline Time diagram Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time Superscalar Time diagram Ifetch Decode Execute Write 0 1 2 3 4 5 6 7 8 9 time 9 Superscalar Processors

  10. Outline  Parallelism Concepts  Superscalar × Superpipelining  Limitations to Parallelism  Instruction Issue Policies  Register Renaming and Dynamic Scheduling 10 Superscalar Processors

  11. Limitations to Parallelism Resource Conflicts 1. Competition of two or more instructions for the same resource at the same time, e.g., consecutive arithmetic instructions → possible solution is adding a second ALU Procedural Dependency 2. conditional branches (already seen) in superscalar processors more is lost if prediction fails Data Dependencies 3. 11 Superscalar Processors

  12. Data Dependencies ... True Data Dependency ADD EAX ,ECX 1. MOV EBX, EAX or Read-after-Write (RAW) ... ... ADD ,ECX EAX Output Dependency 2. MOV EAX ,EBX or Write-after-Write (WAW) ... ... Antidependency EAX 3. ADD EBX, EAX MOV ,ECX or Write-after-Read (WAR) ... 12 Superscalar Processors

  13. Outline  Parallelism Concepts  Superscalar × Superpipelining  Limitations to Parallelism  Instruction Issue Policies  Register Renaming and Dynamic Scheduling 13 Superscalar Processors

  14. Instruction Issue Policy It refers to: Instruction fetch: order in which instructions are 1. fetched Instruction execution: order in which instructions 2. are delivered to a functional unit to execute the operation Instruction commit: order in which instruction 3. results are stored in registers and memory 14 Superscalar Processors

  15. In-order issue with in-order completion instruction issuing is stalled by resource conflicts, procedural or any data dependencies. Example: up to two instructions may be fetched, issued and written back at a 1. time fetch of next two instructions waits till decode buffer is cleared 2. 3 functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock. 3. Data dependency stalls instruction issuing until the execution of the 4. earlier instruction is completed. In RAW later instruction may be issued only after the earlier 5. instruction has written the result. 15 Superscalar Processors

  16. In-order issue and completion decode decode / / * * +/- +/- write write CY CY 1 1 2 2 1 1 3 3 4 4 1 1 2 2 2 2 4 4 3 3 1 1 3 3 4 4 3 3 1 1 2 2 4 4 5 5 6 6 4 4 3 3 5 5 6 6 5 5 4 4 6 6 6 6 5 5 7 7 1. R3=R0*R1 1. R3=R0*R1 7 7 8 8 6 6 5 5 8 8 2. R4=R0+R2 2. R4=R0+R2 7 7 8 8 6 6 9 9 3. R5=R0/R1 3. R5=R0/R1 8 8 7 7 10 10 4. R6=R1+R4 4. R6=R1+R4 8 8 7 7 11 11 5. R7=R1*R2 5. R7=R1*R2 8 8 7 7 12 12 6. R1=R0-R2 6. R1=R0-R2 8 8 13 13 7. R3=R3*R1 7. R3=R3*R1 14 14 8. R1=R4+R4 8. R1=R4+R4 15 15 16 Superscalar Processors

  17. Out-of-order issue and completion Instruction window  A buffer where decoded instruction are stored waiting for execution.  It decouples decode stages from execution stages  Can continue to fetch and decode until this window is full  When a functional unit becomes available an instruction can be executed  Since instructions have been decoded, processor can look ahead 17 Superscalar Processors

  18. Out-of-order issue and completion instruction issuing is stalled by resource conflicts, procedural or TRUE data dependencies. Example: Up to two instructions may be fetched, issued and written back at a 1. time 3 functional units: * (2 clocks), /(2 clocks), (+,-) 1 clock. 2. Data dependency does not stall instruction issuing. 3. In RAW later instruction may be issued only after the earlier 4. instruction has written the result. 18 Superscalar Processors

  19. Out-of-order issue and completion decode decode / / * * +/- +/- write write CY CY 1 1 2 2 1 1 3 3 4 4 1 1 2 2 2 2 5 5 6 6 3 3 1 1 2 2 3 3 7 7 8 8 3 3 5 5 4 4 1 1 4 4 5 5 6 6 3 3 4 4 5 5 8 8 5 5 6 6 6 6 7 7 8 8 7 7 1. R3=R0*R1 7 7 8 8 2. R4=R0+R2 7 7 9 9 3. R5=R0/R1 10 10 4. R6=R1+R4 Register 11 11 5. R7=R1*R2 Renaming 12 12 13 13 6. R1=R0-R2 S1 14 14 7. R3=R3*R1 S1 15 15 8. R1=R4+R4 S2 19 Superscalar Processors

  20. Out-of-order issue with In-order completion decode / * +/- write CY 1 2 1 Exercise : How would it 3 4 1 2 2 5 6 3 1 3 be if out-of-order issue is 7 8 3 1 2 4 allowed but in-order 5 4 3 5 completion is required? 5 6 4 6 8 5 6 7 1. R3=R0*R1 7 8 2. R4=R0+R2 7 9 3. R5=R0/R1 7 8 10 4. R6=R1+R4 11 5. R7=R1*R2 12 13 6. R1=R0-R2 14 7. R3=R3*R1 15 8. R1=R4+R4 20 Superscalar Processors

  21. Exercises decode / * +/- write CY 1 2 1 1 Exercise 1: Complete the tables on the 1 2 right under the same assumptions of 3 4 2 1 3 the previous examples for the program 3 2 4 4 fragment below and for in-order issue 3 2 5 and completion. 5 6 5 3 4 6 6 5 7 1. R3=R0*R1 7 6 8 2. R4=R0*R2 7 9 3. R5=R0/R1 7 10 4. R6=R5+R4 7 11 5. R5=R1-R2 12 6. R1=R0-R2 13 7. R3=R3*R1 14 15 21 Superscalar Processors

  22. Exercises decode / * +/- write CY 1 2 1 Exercise 2: Complete the tables on the 3 4 1 2 right under the same assumptions of 5 6 3 1 3 the previous examples for the program 7 3 2 4 1 4 fragment below and for out-of-order 2 5 3 4 5 issue and completion. 7 6 2 5 6 7 6 7 1. R3=R0*R1 7 8 2. R4=R0*R2 9 3. R5=R0/R1 10 4. R6=R5+R4 11 5. R5=R1-R2 12 6. R1=R0-R2 13 7. R3=R3*R1 14 15 22 Superscalar Processors

  23. Exercises decode / * +/- write CY 1 2 1 Exercise 3: Complete the tables on the 1 2 right under the same assumptions of 1 3 the previous examples for the program 3 4 3 2 4 fragment below and for in-order issue 3 4 2 5 and completion. 5 6 4 3 6 7 6 4 5 7 1. R3=R0-R1 6 8 2. R4=R0+R3 7 6 9 3. R3=R0/R1 7 10 4. R6=R5*R4 7 11 5. R5=R1-R2 12 6. R1=R0*R2 13 7. R3=R3*R5 14 15 23 Superscalar Processors

  24. Exercises Decode / * +/- write CY 1 2 1 Exercise 4: Complete the tables on the 3 4 1 2 right under the same assumptions of 5 6 3 5 1 3 the previous examples for the program 7 3 6 2 5 4 fragment below and for out-of-order 6 2 3 5 issue and completion. 7 6 6 7 7 1. R3=R0-R1 7 8 2. R4=R0+R3 9 3. R3=R0/R1 10 4. R6=R5*R4 11 5. R5=R1-R2 12 6. R1=R0*R2 13 7. R3=R3*R5 14 15 24 Superscalar Processors

  25. Outline  Parallelism Concepts  Superscalar × Superpipelining  Limitations to Parallelism  Instruction Issue Policies  Register Renaming and Dynamic Scheduling 25 Superscalar Processors

  26. Register Renaming Logical registers contain pointers to hidden registers, which actually contain the data. S0 S1 S2 3 R0 S3 4 R1 S4 7 R2 S5 5 R3 S6 S7 Logical Registers Hidden Registers > contain pointers to contain data hidden Registers HW keeps track of non committed hidden registers. 26 Superscalar Processors

Recommend


More recommend