limits of superscalar architecture
play

Limits of Superscalar Architecture Virendra Singh Associate - PowerPoint PPT Presentation

Limits of Superscalar Architecture Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail:


  1. Limits of Superscalar Architecture Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in CS-683: Advanced Computer Architecture Lecture 21 (09 Oct 2013) CADSL

  2. VLIW Processors • Package multiple operations into one instruction • Example VLIW processor: – One integer instruction (or branch) – Two independent floating-point operations – Two independent memory references  Must be enough parallelism in code to fill the available slots CADSL 09 Oct 2013 CS-683@IITB 2

  3. VLIW Processors • Disadvantages: – Statically finding parallelism – Code size – No hazard detection hardware – Binary code compatibility CADSL 09 Oct 2013 CS-683@IITB 3

  4. Summary: Multiple Issue CADSL 09 Oct 2013 CS-683@IITB 4

  5. Recap: Advanced Superscalars • Even simple branch prediction can be quite effective • Path-based predictors can achieve >95% accuracy • BTB redirects control flow early in pipe, BHT cheaper per entry but must wait for instruction decode • Branch mispredict recovery requires snapshots of pipeline state to reduce penalty • Unified physical register file design, avoids reading data from multiple locations (ROB+arch regfile) • Superscalars can rename multiple dependent instructions in one clock cycle • Need speculative store buffer to avoid waiting for stores to commit CADSL 09 Oct 2013 5 CS-683@IITB

  6. Recap: Branch Prediction and Speculative Execution Update predictors kill Branch Branch kill Resolution Prediction kill kill Decode & PC Fetch Reorder Buffer Commit Rename Reg. File Branch Store ALU MEM D$ Unit Buffer CADSL Execute 09 Oct 2013 6 CS-683@IITB

  7. Little ’ s Law Parallelism = Throughput * Latency or = × N T L Throughput per Cycle One Operation Latency in Cycles CADSL 09 Oct 2013 7 CS-683@IITB

  8. Example Pipelined ILP Machine Max Throughput, Six Instructions per Cycle One Pipeline Stage Two Integer Units, Latency in Single Cycle Latency Cycles Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency • How much instruction-level parallelism (ILP) required to keep machine pipelines busy? + + (2x1 2x3 2x4) 2 2 T = = = = × = 6 L 2 N 6 2 1 6 6 3 3 CADSL 09 Oct 2013 8 CS-683@IITB

  9. Superscalar Control Logic Scaling Issue Width W Issue Group Previously Issued Lifetime L Instructions • Each issued instruction must check against W*L instructions, i.e., growth in hardware ∝ W*(W*L) • For in-order machines, L is related to pipeline latencies • For out-of-order machines, L also includes time spent in instruction buffers (instruction window or ROB) • As W increases, larger instruction window is needed to find enough parallelism to keep machine busy => greater L => Out-of-order control logic grows faster than W2 (~W3) CADSL 09 Oct 2013 9 CS-683@IITB

  10. Superscalar Scenario • Interest in multiple-issue because wanted to improve performance without affecting uniprocessor programming model • Taking advantage of ILP is conceptually simple, but design problems are amazingly complex in practice • Conservative in ideas, just faster clock and bigger • Processors of last 10 years (Pentium 4, IBM Power 5, AMD Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple-issue processors announced in 1995 – Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X as many renaming registers, and 2X as many load- store units  performance 8 to 16X • Peak v. delivered performance gap increasing CADSL 09 Oct 2013 CS-683@IITB 10

  11. Out-of-Order Control Complexity: MIPS R10000 Control Logic [ SGI/MIPS Technologies Inc., 1995 ] CADSL 09 Oct 2013 11 CS-683@IITB

  12. Sequential ISA Bottleneck Sequential Sequential Superscalar compiler source code machine code a = foo(b); for (i=0, i< Find independent Schedule operations operations Superscalar processor Schedule Check instruction execution dependencies CADSL 09 Oct 2013 12 CS-683@IITB

  13. For most apps, most execution units lie idle For an 8-way superscalar. From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995. CADSL 09 Oct 2013 CS-683@IITB 13

  14. Limits to ILP • Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C programs) – Hardware sophistication – Compiler sophistication – • How much ILP is available using existing mechanisms with increasing HW budgets? • Do we need to invent new HW/SW mechanisms to keep on processor performance curve? – Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock – Motorola AltaVec: 128 bit ints and FPs – Supersparc Multimedia ops, etc. – CADSL 09 Oct 2013 CS-683@IITB 14

  15. Overcoming Limits • Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies • However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future CADSL 09 Oct 2013 CS-683@IITB 15

  16. Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers  all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Jump prediction – all jumps perfectly predicted (returns, case statements) 2 & 3  no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle; CADSL 09 Oct 2013 CS-683@IITB 16

  17. Limits to ILP HW Model comparison Model Power 5 Instructions Issued Infinite 4 per clock Instruction Window Infinite 200 Size Renaming Infinite 48 integer + Registers 40 Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect ?? Analysis CADSL 17 09 Oct 2013 CS-683@IITB

  18. Upper Limit to ILP: Ideal Machin e FP: 75 - 150 Instructions Per Clock Integer: 18 - 60 CADSL 09 Oct 2013 CS-683@IITB 18

  19. Limits to ILP HW Model comparison New Model Power 5 Model Instructions Infinite Infinite 4 Issued per clock Instruction Infinite, 2K, Infinite 200 Window Size 512, 128, 32 Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt. Branch Perfect Perfect 2% to 6% Prediction misprediction (Tournament Branch Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect Perfect ?? CADSL 19 09 Oct 2013 CS-683@IITB

  20. More Realistic HW: Window Impact Change from Infinite window FP: 9 - 150 2048, 512, 128, 32 Click to edit the outline text format 160 ● 150 Second Outline Level – 140 Third Outline Level 119 ● Instructions Per Cl 120 – Fourth Outline Level Integer: 8 - 63 Fifth Outline ● 100 Level 75 Sixth Outline IPC ● 80 Level 63 61 60 59 55 • Seventh Outline LevelClick to edit 60 49 45 Master text styles 41 36 35 34 – Second level 40 18 16 15 15 15 14 14 ● Third level 13 12 20 11 10 10 9 9 8 8 – Fourth level 0 gcc espresso li fpppp ● Fifth level doduc tomcatv Infinite 2048 512 128 32 20 09 Oct 2013 CS-683@IITB CADSL

  21. Limits to ILP HW Model comparison New Model Power 5 Model Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt. Branch Perfect vs. 8K Perfect 2% to 6% Prediction Tournament misprediction vs. 512 2-bit (Tournament Branch vs. profile vs. Predictor) none Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect Perfect ?? CADSL 21 09 Oct 2013 CS-683@IITB

  22. More Realistic HW: Branch Impact Change from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle FP: 15 - 45 Integer: 6 - 12 IPC BHT (512) CADSL 09 Oct 2013 CS-683@IITB 22 Profile

  23. Misprediction Rates 35% 30% 30% Misprediction Rate 23% 25% 18% 18% 20% 16% 14% 14% 15% 12% 12% 10% 6% 5% 4% 3% 5% 2% 2% 1% 1% 0% 0% tomcatv doduc fpppp li espresso gcc Profile-based 2-bit counter Tournament CADSL 09 Oct 2013 CS-683@IITB 23

  24. Limits to ILP HW Model comparison New Model Model Power 5 Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming Infinite v. 256, Infinite 48 integer + Registers 128, 64, 32, none 40 Fl. Pt. Branch 8K 2-bit Perfect Tournament Branch Prediction Predictor Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect Perfect Alias CADSL 24 09 Oct 2013 CS-683@IITB

  25. More Realistic HW: Renaming Register Impact (N int + N fp) FP: 11 - 45 Change 2048 instr window, 64 instr issue, 8K 2 level Prediction Integer: 5 - 15 IPC CADSL 09 Oct 2013 CS-683@IITB 25

Recommend


More recommend