introduction
play

Introduction Vast transistor budgets, but .... Poor interconnect - PowerPoint PPT Presentation

Introduction Vast transistor budgets, but .... Poor interconnect scaling Pressure to decentralise designs 7 On-Chip Interconnection Networks Need to manage complexity and power Need for flexible/fault tolerant designs


  1. Introduction • Vast transistor budgets, but .... • Poor interconnect scaling – Pressure to decentralise designs 7 • On-Chip Interconnection Networks • Need to manage complexity and power • Need for flexible/fault tolerant designs • Parallel architectures Chip Multiprocessors (ACS MPhil) – Keep core complexity constant or simplify Robert Mullins – The result is need to interconnect lots of cores, memories and other IP cores. Introduction Introduction • On-chip communication requirements: • On-chip communication requirements: – High-performance – Simplicity (ease of design and verification) • Latency and Bandwidth • Structured, modular and regular – Flexibility • Optimize channel and router once • Move away from fixed application- – Efficiency specific wiring • Ability to share global wiring resources – Scalability between different flows • Number of modules is rapidly increasing – Fault tolerance (in the long term) • The existence of multiple communication paths between module pairs – Support for different traffic types and QoS

  2. Introduction Introduction • Don't we already know how to design • The design of the on-chip network is not an interconnection networks? isolated design decision (or afterthought) – Many existing network topologies, router designs – e.g. consider impact on cache coherency protocol and theory has already been developed for high- – What is the correct balance of resources (wires end supercomputers and telecom switches and transistors, silicon area, power etc.) between the on-chip network and computational – Yes, and we'll cover some of this material, but the resources? trade-offs on-chip lead to very different designs. – Where does the on-chip network stop and the design of a module or core start? • “ integrated microarchitectural networks ” – Does network simply blindly allow modules to communicate or does it have additional functionality? Chip Multiprocessors (ACS MPhil) 6 On-chip vs. Off-chip On-chip interconnect • Compare availability of pins and wiring tracks on-chip • Typical interconnect at 45nm node: to cost of pins/connectors and cables off-chip – 10-14 metal layers • Compare communication latencies on- and off-chip – Local interconnect (M1) – What is the impact on router and network design? • 65nm metal width, 65nm spacing • Applications and workloads • 7700 metal tracks/mm • Amount of memory available on-chip – Global (e.g. M10) – What is the impact on router design/flow control? • 400nm metal width, • Power budgets on- and off-chip 400nm spacing • 1250 metal tracks/mm • Need to map network to planar chip (or perhaps more • Remember global interconnects recently a 3D stack of dies) scale poorly when compared to transistors 9Cu+1Al process (Fujitsu 2007) Chip Multiprocessors (ACS MPhil) 7 Chip Multiprocessors (ACS MPhil) 8

  3. Bus-based interconnects Bus-based interconnects • Bus-based • Real bus implementations are typically switch interconnects based – Central arbiter – Multiplexers and unidirectional interconnects with provides access repeaters to bus – Tri-states are rarely used now – Logically the bus is – Interconnect itself may be pipelined simply viewed as a • A bus-based CMP usually exploits multiple set of wires shared unidirectional buses by all processors – e.g. address bus, response bus and data bus Chip Multiprocessors (ACS MPhil) 9 Chip Multiprocessors (ACS MPhil) 10 Bus-based interconnects for multicore? Bus-based interconnects for multicore? • Metal/wiring is cheap on- • Optimising bus-based solutions: chip! Repeated Bus Global Interconnect – Arbitrate for next cycle on current clock cycle • Avoid complexity of – Use wide, low-swing interconnects packet-switched networks O R R O R R – Limit broadcast to subset of processors? • Keep cache-coherency • Segment bus and filter redundant broadcasts to some simple segments by maintaining some knowledge of cache • Performance issues R R R R R R contents. So called, “Filtered Segmented Buses” – Centralised arbitration – Employ multiple buses – Low clock frequency – Move from electrical to on-chip optical solutions? (pipeline?) R R R R R R – Power? – Scalability? Shekhar Borkar (OCIN'06) Chip Multiprocessors (ACS MPhil) 11 Chip Multiprocessors (ACS MPhil) 12

  4. Filtered Segmented Bus Bus-based interconnects for multicore? • Filter broadcasts to • Exploiting multiple buses (or rings): segments with Bloom filter – Multiple address-interleaved buses • Energy savings possible vs. • e.g. Sun Wildfire/Starfire mesh and flattened butterfly – Use different buses for different message types networks (for 16, 32 and 64- cores) because routers can – Subspace snooping [Huh/Burger06] be removed • Associate (dynamic) address ranges with each bus. • For large numbers of cores Each subspace are regions of data that are shared by a multiple (address- stable subset of the processors. interleaved) buses are • This technique tackles snoop bandwidth limitations as required to avoid a all processors are not required to snoop all buses significant performance penalty due to contention – Exploit buses at the lowest level of a hierarchical network (e.g. mesh interconnecting tiles, where each tile is a group of cores connected by a bus) “Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks”, Udipi et al, HPCA 2010 Chip Multiprocessors (ACS MPhil) 13 Chip Multiprocessors (ACS MPhil) 14 Sun Starfire (UE10000) Ring Networks • Up to 64-way SMP using bus-based snooping protocol 4 processors + µ P µ P µ P µ P µ P µ P µ P µ P memory module per system board k -node ring (or k -ary 1-cube) $ $ $ $ $ $ $ $ Uses 4 interleaved address buses to scale snooping • Exploit short point-to-point interconnects protocol Board Interconnect Board Interconnect • Can support many concurrent data transfers • Can keep coherence protocol simple and avoid need 16x16 Data Crossbar for directory-based schemes Separate data – We may still broadcast transactions transfer over Memory Memory Module Module high bandwidth • Modest area requirements crossbar Slide from Krste Asanovic (Berkeley) Chip Multiprocessors (ACS MPhil) 15 Chip Multiprocessors (ACS MPhil) 16

  5. Ring Networks Ring Networks: Examples • Control • IBM – May be distributed – Power4, Power5 • Need to be a little careful to avoid possibility of deadlock • IBM/Sony/Toshiba (more later!) – Cell BE (PS3, HDTV, Cell blades, ...) – Or a centralised arbiter/scheduler may be used • Intel • e.g. IBM Cell BE and Larrabee both appear to use a centralised scheduler – Larabee (graphics), 8-core Xeon processor • Try and schedule as many concurrent (non-overlapping) • Kendall Square Research (1990's) transfers on each available ring as possible – Massively parallel supercomputer design • Trivial routers at each node – Ring of rings (hierarchical or multi-ring) topology – Simple routers are attractive as they don't • Cluster = 32 nodes connected in a ring introduce significant latency, power and area • Up to 34 clusters connected by higher level ring overheads Chip Multiprocessors (ACS MPhil) 17 Chip Multiprocessors (ACS MPhil) 18 Ring Networks: Example IBM Cell BE Ring Networks: Example Larrabee • Cell Broadband Engine – Message-passing style (no $ coherence) – Element Interconnect Bus (EIB) • 2-rings are provided in each direction • Cache coherent • Crossbar solution was • Bi-directional ring network, 512-bit wide links deemed too large – Short linked rings proposed for >16 processors – Routing decisions are made before injecting messages • The clockwise ring delivers on even clock cycles and the anticlockwise ring on odd clock cycles Chip Multiprocessors (ACS MPhil) 19 Chip Multiprocessors (ACS MPhil) 20

  6. Crossbar Networks Crossbar Networks • A crossbar switch is able to directly connect any input to any output without any intermediate stages – It is an example of a strictly non-blocking network • It can connect any input to any output, incrementally, • A 4x3 crossbar without the need the rearrange any of the circuits implemented using three currently set up. 4:1 multiplexers – The main limitation of a crossbar is its cost. Although very useful in small configurations, n x n • Each multiplexer selects a particular input to be crossbars can quickly become prohibitively connected to the expensive as their cost increases as n 2 corresponding output (Dally/Towles book Chapter 6) Chip Multiprocessors (ACS MPhil) 21 Chip Multiprocessors (ACS MPhil) 22 Crossbar Networks: Example Niagara Crossbar Networks: Example Cyclops • Crossbar switch interconnects 8 processors to banked on-chip L2 cache – A crossbar is actually provided in each direction: • Forward and Return • Simple cache coherence • IBM, US Dept. of Energy/Defense, Academia protocol • Full system 1M+ processors, 80 cores per chip – See earlier seminar • Interconnect: centralised 96x96 buffered crossbar switch with a 7-stage pipeline Reproduced from IEEE Micro, Mar'05 Chip Multiprocessors (ACS MPhil) 23 Chip Multiprocessors (ACS MPhil) 24

Recommend


More recommend