cs654 advanced computer architecture lec 9 limits to ilp
play

CS654 Advanced Computer Architecture Lec 9 Limits to ILP and - PowerPoint PPT Presentation

CS654 Advanced Computer Architecture Lec 9 Limits to ILP and Simultaneous Multithreading Peter Kemper Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University of California,


  1. CS654 Advanced Computer Architecture Lec 9 – Limits to ILP and Simultaneous Multithreading Peter Kemper Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley

  2. Review from Last Time • Interest in multiple-issue because wanted to improve performance without affecting uniprocessor programming model • Taking advantage of ILP is conceptually simple, but design problems are amazingly complex in practice • Conservative in ideas, just faster clock and bigger • Processors of last 5 years (Pentium 4, IBM Power 5, AMD Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple- issue processors announced in 1995 – Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X as many renaming registers, and 2X as many load-store units ⇒ performance 8 to 16X • Peak v. delivered performance gap increasing 2 3/4/09 CS252 S06 Lec9 Limits and SMT

  3. Outline • Review • Limits to ILP (another perspective) • Thread Level Parallelism • Multithreading • Simultaneous Multithreading • Power 4 vs. Power 5 • Head to Head: VLIW vs. Superscalar vs. SMT • Commentary • Conclusion 3 3/4/09 CS252 S06 Lec9 Limits and SMT

  4. Limits to ILP • Conflicting studies of amount – Benchmarks (vectorized Fortran FP vs. integer C programs) – Hardware sophistication – Compiler sophistication • How much ILP is available using existing mechanisms with increasing HW budgets? • Do we need to invent new HW/SW mechanisms to keep on processor performance curve? – Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints – Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock – Motorola AltaVec: 128 bit ints and FPs – Supersparc Multimedia ops, etc. 4 3/4/09 CS252 S06 Lec9 Limits and SMT

  5. Overcoming Limits • Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies • However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future 5 3/4/09 CS252 S06 Lec9 Limits and SMT

  6. Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Jump prediction – all jumps perfectly predicted (returns, case statements) 2 & 3 ⇒ no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle; 6 3/4/09 CS252 S06 Lec9 Limits and SMT

  7. Limits to ILP HW Model comparison Model Power 5 (IBM) Instructions Issued Infinite 4 per clock Instruction Window Infinite 200 Size Renaming Infinite 48 integer + Registers 40 Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect ?? Analysis 7 3/4/09 CS252 S06 Lec9 Limits and SMT

  8. Upper Limit to ILP: Ideal Machine (Figure 3.1) Instructions Per Clock 160 150.1 FP: 75 - 150 Instruction Issues per cycle 140 Integer: 18 - 60 118.7 120 100 75.2 80 62.6 54.8 60 40 17.9 20 0 gcc espresso li fpppp doducd tomcatv Programs 8 3/4/09 CS252 S06 Lec9 Limits and SMT

  9. Limits to ILP HW Model comparison New Model Model Power 5 Instructions Infinite Infinite 4 Issued per clock Instruction Infinite, 2K, 512, Infinite 200 Window Size 128, 32 Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt. Branch Perfect Perfect 2% to 6% Prediction misprediction (Tournament Branch Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect ?? 9 3/4/09 Alias CS252 S06 Lec9 Limits and SMT

  10. More Realistic HW: Window Impact Figure 3.2 Change from Infinite FP: 9 - 150 window 2048, 512, 128, 32 160 150 140 119 Instructions Per Clock 120 Integer: 8 - 63 100 75 IPC 80 63 61 60 59 55 60 49 45 41 36 35 34 40 18 16 15 15 15 14 14 13 12 20 11 10 10 9 9 8 8 0 gcc espresso li fpppp doduc tomcatv Infinite 2048 512 128 32 10 3/4/09 CS252 S06 Lec9 Limits and SMT

  11. Limits to ILP HW Model comparison New Model Model Power 5 Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt. Branch Perfect vs. 8K Perfect 2% to 6% Prediction Tournament vs. misprediction 512 2-bit vs. (Tournament Branch profile vs. none Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect ?? 11 3/4/09 Alias CS252 S06 Lec9 Limits and SMT

  12. More Realistic HW: Branch Impact Figure 3.3 Change from Infinite FP: 15 - 45 window to examine to 61 60 60 58 2048 and maximum issue of 64 instructions 50 48 46 per clock cycle 46 45 45 45 41 40 35 Integer: 6 - 12 30 29 IPC 19 20 16 15 14 13 12 10 10 9 7 7 6 6 6 6 4 2 2 2 0 gcc espresso l i fpppp doducd tomcatv P r o g r a m Perfect Selective predictor Standard 2-bit Static None 12 3/4/09 CS252 S06 Lec9 Limits and SMT Perfect Tournament BHT (512) Profile No prediction

  13. Misprediction Rates 35% 30% 30% Misprediction Rate 23% 25% 18% 18% 20% 16% 14% 14% 15% 12% 12% 10% 6% 5% 4% 3% 5% 2% 2% 1% 1% 0% 0% tomcatv doduc fpppp li espresso gcc Profile-based 2-bit counter Tournament 13 3/4/09 CS252 S06 Lec9 Limits and SMT

  14. Limits to ILP HW Model comparison New Model Model Power 5 Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming Infinite v. 256, Infinite 48 integer + Registers 128, 64, 32, none 40 Fl. Pt. Branch 8K entries total, Perfect Tournament Branch Prediction 2 levels Predictor Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect Perfect Alias 14 3/4/09 CS252 S06 Lec9 Limits and SMT

  15. More Realistic HW: Renaming Register Impact (N int + N fp) Figure 3.5 FP: 11 - 49 70 Change 2048 instr 59 60 window, 64 instr 54 issue, 8K 2 level 49 50 Prediction 45 44 40 35 Integer: 5 - 15 IPC 29 30 28 20 20 16 15 15 15 13 12 12 12 11 11 11 10 10 10 9 10 7 6 5 5 5 5 5 5 5 4 4 4 0 gcc espresso l i fpppp doducd tomcatv P r o g r a m Infinite 256 128 64 32 None Infinite 256 128 64 32 None 15 3/4/09 CS252 S06 Lec9 Limits and SMT

  16. Limits to ILP HW Model comparison New Model Model Power 5 Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming 256 Int + 256 FP Infinite 48 integer + Registers 40 Fl. Pt. Branch 8K 2 level Perfect Tournament Prediction Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect v. Stack Perfect Perfect Alias v. Inspect v. none 16 3/4/09 CS252 S06 Lec9 Limits and SMT

  17. More Realistic HW: Memory Address Alias Impact Figure 3.6 49 49 50 Change 2048 instr 45 45 45 window, 64 instr FP: 4 - 45 Instruction issues per cycle 40 issue, 8K 2 level (Fortran, 35 Prediction, 256 no heap) 30 renaming registers 25 Integer: 4 - 9 20 16 16 IPC 15 15 12 10 9 10 7 7 6 5 5 5 4 4 4 4 4 5 3 3 3 0 gcc espresso l i fpppp doducd tomcatv P r o g r a m Perfect Global/stack Perfect Inspection None Perfect Global/Stack perf; Inspec. None 17 3/4/09 CS252 S06 Lec9 Limits and SMT heap conflicts Assem.

  18. Limits to ILP HW Model comparison New Model Model Power 5 Instructions 64 (no Infinite 4 Issued per restrictions) clock Instruction Infinite vs. 256, Infinite 200 Window Size 128, 64, 32 Renaming 64 Int + 64 FP Infinite 48 integer + Registers 40 Fl. Pt. Branch 1K 2-bit Perfect Tournament Prediction Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory HW Perfect Perfect Alias disambiguation 18 3/4/09 CS252 S06 Lec9 Limits and SMT

  19. Realistic HW: Window Impact (Figure 3.7) 60 Perfect disambiguation 56 (HW), 1K Selective 52 50 Prediction, 16 entry FP: 8 - 45 47 Instruction issues per cycle 45 return, 64 registers, 40 issue as many as 35 34 window 30 IPC 22 22 Integer: 6 - 12 20 17 16 15 15 15 14 14 13 12 12 12 11 11 10 10 10 10 9 9 9 9 8 10 8 8 7 6 6 6 6 5 4 4 4 4 3 3 3 3 3 2 0 gcc expresso l i fpppp doducd tomcatv P r o g r a m Infinite 256 128 64 32 16 8 4 Infinite 256 128 64 32 16 8 4 19 3/4/09 CS252 S06 Lec9 Limits and SMT

  20. Outline • Review • Limits to ILP (another perspective) • Thread Level Parallelism • Multithreading • Simultaneous Multithreading • Power 4 vs. Power 5 • Head to Head: VLIW vs. Superscalar vs. SMT • Commentary • Conclusion 20 3/4/09 CS252 S06 Lec9 Limits and SMT

Recommend


More recommend