Overview • Computer architecture • Scaling performance and CMOS 1 • Trends in Microprocessor – Where have performance gains come from? Architecture – Modern superscalar processors – The limits of superscalar processors • Going parallel R05 Chip Multiprocessors (ACS MPhil) • This course Robert Mullins Chip Multiprocessors (ACS MPhil) 2 Computer architecture Computer architecture “Computer architecture is the science and art of “Computer architecture is the interface between what selecting and interconnecting hardware components to technology can provide and what the marketplace create computers that meet functional, performance demands” and cost goals” Mark Hill “Computer architecture is a science of trade-offs” “Computer architecture forms the bridge between Yale Patt application need and the capabilities of the underlying technology” Tilak Agerwala and Siddhartha Chatterjee Chip Multiprocessors (ACS MPhil) 3 Chip Multiprocessors (ACS MPhil) 4
Computer architecture Computer architecture • We cannot architect a new computer without defining performance, power and cost goals. The design process is all about understanding and making trade- “Computer architect's often err by preparing for offs yesterday's computations” • What is our target market and what applications will Bill Dally we be running? • The “best” architecture is a moving target (Easy to make the same error during a PhD!) – The needs of the marketplace change – Shifting fabrication technology characteristics Tomorrow's applications and technologies are not easy – New technologies to predict! • memory, packaging, compiler, languages, ... Chip Multiprocessors (ACS MPhil) 5 Chip Multiprocessors (ACS MPhil) 6 Historic performance gains Historic performance gains Burger's “end of the road paper” suggested performance would be limited to 12.5%/annum Predicted: 1997-2014 7.4x Actual: ~36x If at historical rate: 1720x Reproduced from “ Computer architecture: A quantitative approach ”, Hennessy/Patterson Chip Multiprocessors (ACS MPhil) 7 Chip Multiprocessors (ACS MPhil) 8
Microprocessor trends Historic performance gains • Microprocessor performance increased at a rate of ~52%/year between 1986-2002 – ~800X improvement over 16 years – How was such an improvement in performance achieved? – Is this a reasonable rate of performance growth given the advances in fabrication technology? Exe. time = Instr. count x CPI x Clock Period https://github.com/karlrupp/microprocessor-trend-data Chip Multiprocessors (ACS MPhil) 9 Chip Multiprocessors (ACS MPhil) 10 Historic performance gains Historic performance gains • Technology scaling • Gates per clock – 7 process generations – Less logic between pipeline registers – Scaling provides ~1.4x transistor – Reduction from ~100 to performance 10 gate delays improvment per – 10X generation • How? – 10.5X – Pipelining – (careful, this doesn't • 5 to 20 stages (~4X) automatically translate directly into – Circuit-level advances performance gains) • e.g. new logic families • ~2.5X Reproduced with kind permission Reproduced with kind permission of Mark Horowitz of Mark Horowitz Chip Multiprocessors (ACS MPhil) 11 Chip Multiprocessors (ACS MPhil) 12
Historic performance gains Historic performance gains • IPC & instr. count – ~5-8X improvement in SPECint/MHz – This is despite clock frequency improvements – Includes advances in compiler technology and impact of increased ~105X bus widths Improvement in SPECint95/Mhz over time Reproduced with kind permission of Mark Horowitz Reproduced from “CMOS VLSI Design” Weste/Harris (2005) Chip Multiprocessors (ACS MPhil) 13 Chip Multiprocessors (ACS MPhil) 14 Historic performance gains Historic performance gains • How was it possible to maintain and even decrease CPI (improve IPC) – Moore's law! – How were the additional transistors exploited? • Intel 386 to Pentium 4 – 386: 275K transistors (die size = 43mm 2 ) – P4: 42M transistors (die size = 217mm 2 ) • 5X from increased die size • 27X from technology scaling • Today's (2017) largest chips contain > 10 billion transistors Reproduced from CMOS VLSI Design, Weste and Harris (2005) Chip Multiprocessors (ACS MPhil) 15 Chip Multiprocessors (ACS MPhil) 16
The future of Moore's Law: 2D to 3D Moore’s Law • Beyond 2021 it won't be economically desirable to shrink transistor dimensions • Recently introduced vertical transistors (e.g. dual-gate and tri-gate) • Monolithic 3D predicted by 2024 • Roadmap to consider applications in future The latest ITRS Roadmap (2015) predicts that physical gate length (more of an end-to-end will not shrink beyond 2021. Earlier view vs. bottom-up) predictions (2013) were more optimistic. Chip Multiprocessors (ACS MPhil) Chip Multiprocessors (ACS MPhil) 18 Modern superscalar processors Modern superscalar processors • Revision (See Hennessy/Patterson) – Significant hardware support for Instruction Level Parallelism (ILP) in most commercial microprocessors • Multiple-issue architectures • Deep pipelines, branch prediction, speculative execution • Large on-chip caches (L1/L2/L3) • Out-of-order execution, register renaming • Dynamic memory address disambiguation • SIMD instructions • ... Chip Multiprocessors (ACS MPhil) 19 Chip Multiprocessors (ACS MPhil) 20
Limits of superscalar processors Limits of superscalar processors • Cost and complexity of extracting ILP • Pipeline depth limits – Diminishing returns – Interruptions to the pipeline (branches) – Increased complexity limits ability to optimise – Performance of the memory system design – Clocking overheads (registers/clock skew) • The underlying fabrication technology characteristics – Need to balance stages and maintain the atomicity are becoming more challenging too of some operations – Increases verification complexity and time – Limited ILP – Increases time-to-market – Power cost (See also “Optimal Pipeline Depth” link on Seminar 1 wiki page) Chip Multiprocessors (ACS MPhil) 21 Chip Multiprocessors (ACS MPhil) 22 Limits of superscalar processors Limits of superscalar processors • Interconnect versus transistor scaling – Smaller transistors = faster/lower power – Wires don't scale in the same way ☹ – Centralised structures don't scale well – Pressure to decentralise – Consider bypass network between FUs • Clustered implementations " Coming challenges in microarchitecture and architecture ", Ronen et al, 2001 Chip Multiprocessors (ACS MPhil) 23 Chip Multiprocessors (ACS MPhil) 24
Limits of superscalar processors Going parallel • Voltage scaling and power limits • Accept we can make little progress with single-thread performance – Voltage scaling has slowed • Look towards thread-level parallelism • 5V to 1V - gave us 25X power savings • 1V to 0.7V (limit at end of CMOS around 2020) – Achieve our performance gains in a new way: • Only 2X power savings left from voltage scaling! – Rapidly increase the number of cores – Sensible power limits already reached • 2X-3X per generation – Pressure to reduce power consumption – Don't scale the clock frequency • Process variation complications • Create simpler more power efficient cores instead – Fault tolerance requirements in the longer term Chip Multiprocessors (ACS MPhil) 25 Chip Multiprocessors (ACS MPhil) 26 Going parallel Going parallel • Going parallel is simple? Pawlowski (Intel) 2007 – Replicate existing processor designs to ease design process It is now 2018..... – Many applications already exist where thread-level Numbers of cores has scaled parallelism is plentiful less agressively than this. – We've had 30+ years of experience writing parallel programs In 2017 @ 14nm, High-end server part: 28 Core, Xeon (Skylake) 56 threads Clock frequency 2.5GHz (max turbo freq. 3.8GHz) TDP (power) = 205 W Chip Multiprocessors (ACS MPhil) 27 Chip Multiprocessors (ACS MPhil) 28
Going parallel Going parallel • Many new challenges: • Power is a first order design constraint – On-chip and off-chip communication – Power consumption is already at a sensible limit (for many applications we would like to reduce it) – Simpler cores and Amdahl's law – We are going to increase the number of cores by – Power constrained design 2-3X per generation – Support for the shared-memory paradigm? • Power savings? – Synchronization and thread-scheduling support? – Core shrink (<1.4X) – Everyone must now write scalable and correct – Simpler cores (1.4-2X?) parallel programs! – Some VDD savings – Need to add “uncore” logic too! – Techniques for adaptive EPI? Chip Multiprocessors (ACS MPhil) 29 Chip Multiprocessors (ACS MPhil) 30 Going parallel Future of multicore? • Beyond homogenous multicore – Power consumption is a limiting factor in the • “NAVIGO”, [Hempstead, design of multicore processors Wei and Brooks, 2011] – For many designs this has prompted the • Examined throughput integration of many specialized accelerators orientated workload • An ASIC implementation of an algorithm may be 10- • Suggest gains limited to 1000X more energy efficient that a software 35% per year due to implementation power constraints • e.g. Apple A8 SoC: – ~50% custom accelerators – ~25% CPUs (2) – ~25% GPU Chip Multiprocessors (ACS MPhil) 31 Chip Multiprocessors (ACS MPhil)
Recommend
More recommend