Introduction • Vast transistor budgets, but .... • Poor interconnect scaling – Pressure to decentralise designs 7 • On-Chip Interconnection Networks • Need to manage complexity and power • Need for flexible/fault tolerant designs • Parallel architectures Chip Multiprocessors (ACS MPhil) – Keep core complexity constant or simplify Robert Mullins – The result is need to interconnect lots of cores, memories and other IP cores. Introduction Introduction • On-chip communication requirements: • On-chip communication requirements: – High-performance – Simplicity (ease of design and verification) • Latency and Bandwidth • Structured, modular and regular – Flexibility • Optimize channel and router once • Move away from fixed application- – Efficiency specific wiring • Ability to share global wiring resources – Scalability between different flows • Number of modules is rapidly increasing – Fault tolerance (in the long term) • The existence of multiple communication paths between module pairs – Support for different traffic types and QoS
Introduction Introduction • Don't we already know how to design • The design of the on-chip network is not an interconnection networks? isolated design decision (or afterthought) – Many existing network topologies, router designs – e.g. consider impact on cache coherency protocol and theory has already been developed for high- – What is the correct balance of resources (wires end supercomputers and telecom switches and transistors, silicon area, power etc.) between the on-chip network and computational – Yes, and we'll cover some of this material, but the resources? trade-offs on-chip lead to very different designs. – Where does the on-chip network stop and the design of a module or core start? • “ integrated microarchitectural networks ” – Does network simply blindly allow modules to communicate or does it have additional functionality? Chip Multiprocessors (ACS MPhil) 6 On-chip vs. Off-chip On-chip interconnect • Compare availability of pins and wiring tracks on-chip • Typical interconnect at 45nm node: to cost of pins/connectors and cables off-chip – 10-14 metal layers • Compare communication latencies on- and off-chip – Local interconnect (M1) – What is the impact on router and network design? • 65nm metal width, 65nm spacing • Applications and workloads • 7700 metal tracks/mm • Amount of memory available on-chip – Global (e.g. M10) – What is the impact on router design/flow control? • 400nm metal width, • Power budgets on- and off-chip 400nm spacing • 1250 metal tracks/mm • Need to map network to planar chip (or perhaps more • Remember global interconnects recently a 3D stack of dies) scale poorly when compared to transistors 9Cu+1Al process (Fujitsu 2007) Chip Multiprocessors (ACS MPhil) 7 Chip Multiprocessors (ACS MPhil) 8
Bus-based interconnects Bus-based interconnects • Bus-based • Real bus implementations are typically switch interconnects based – Central arbiter – Multiplexers and unidirectional interconnects with provides access repeaters to bus – Tri-states are rarely used now – Logically the bus is – Interconnect itself may be pipelined simply viewed as a • A bus-based CMP usually exploits multiple set of wires shared unidirectional buses by all processors – e.g. address bus, response bus and data bus Chip Multiprocessors (ACS MPhil) 9 Chip Multiprocessors (ACS MPhil) 10 Bus-based interconnects for multicore? Bus-based interconnects for multicore? • Metal/wiring is cheap on- • Optimising bus-based solutions: chip! Repeated Bus Global Interconnect – Arbitrate for next cycle on current clock cycle • Avoid complexity of – Use wide, low-swing interconnects packet-switched networks O R R O R R – Limit broadcast to subset of processors? • Keep cache-coherency • Segment bus and filter redundant broadcasts to some simple segments by maintaining some knowledge of cache • Performance issues R R R R R R contents. So called, “Filtered Segmented Buses” – Centralised arbitration – Employ multiple buses – Low clock frequency – Move from electrical to on-chip optical solutions? (pipeline?) R R R R R R – Power? – Scalability? Shekhar Borkar (OCIN'06) Chip Multiprocessors (ACS MPhil) 11 Chip Multiprocessors (ACS MPhil) 12
Filtered Segmented Bus Bus-based interconnects for multicore? • Filter broadcasts to • Exploiting multiple buses (or rings): segments with Bloom filter – Multiple address-interleaved buses • Energy savings possible vs. • e.g. Sun Wildfire/Starfire mesh and flattened butterfly – Use different buses for different message types networks (for 16, 32 and 64- cores) because routers can – Subspace snooping [Huh/Burger06] be removed • Associate (dynamic) address ranges with each bus. • For large numbers of cores Each subspace are regions of data that are shared by a multiple (address- stable subset of the processors. interleaved) buses are • This technique tackles snoop bandwidth limitations as required to avoid a all processors are not required to snoop all buses significant performance penalty due to contention – Exploit buses at the lowest level of a hierarchical network (e.g. mesh interconnecting tiles, where each tile is a group of cores connected by a bus) “Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks”, Udipi et al, HPCA 2010 Chip Multiprocessors (ACS MPhil) 13 Chip Multiprocessors (ACS MPhil) 14 Sun Starfire (UE10000) Ring Networks • Up to 64-way SMP using bus-based snooping protocol 4 processors + µ P µ P µ P µ P µ P µ P µ P µ P memory module per system board k -node ring (or k -ary 1-cube) $ $ $ $ $ $ $ $ Uses 4 interleaved address buses to scale snooping • Exploit short point-to-point interconnects protocol Board Interconnect Board Interconnect • Can support many concurrent data transfers • Can keep coherence protocol simple and avoid need 16x16 Data Crossbar for directory-based schemes Separate data – We may still broadcast transactions transfer over Memory Memory Module Module high bandwidth • Modest area requirements crossbar Slide from Krste Asanovic (Berkeley) Chip Multiprocessors (ACS MPhil) 15 Chip Multiprocessors (ACS MPhil) 16
Ring Networks Ring Networks: Examples • Control • IBM – May be distributed – Power4, Power5 • Need to be a little careful to avoid possibility of deadlock • IBM/Sony/Toshiba (more later!) – Cell BE (PS3, HDTV, Cell blades, ...) – Or a centralised arbiter/scheduler may be used • Intel • e.g. IBM Cell BE and Larrabee both appear to use a centralised scheduler – Larabee (graphics), 8-core Xeon processor • Try and schedule as many concurrent (non-overlapping) • Kendall Square Research (1990's) transfers on each available ring as possible – Massively parallel supercomputer design • Trivial routers at each node – Ring of rings (hierarchical or multi-ring) topology – Simple routers are attractive as they don't • Cluster = 32 nodes connected in a ring introduce significant latency, power and area • Up to 34 clusters connected by higher level ring overheads Chip Multiprocessors (ACS MPhil) 17 Chip Multiprocessors (ACS MPhil) 18 Ring Networks: Example IBM Cell BE Ring Networks: Example Larrabee • Cell Broadband Engine – Message-passing style (no $ coherence) – Element Interconnect Bus (EIB) • 2-rings are provided in each direction • Cache coherent • Crossbar solution was • Bi-directional ring network, 512-bit wide links deemed too large – Short linked rings proposed for >16 processors – Routing decisions are made before injecting messages • The clockwise ring delivers on even clock cycles and the anticlockwise ring on odd clock cycles Chip Multiprocessors (ACS MPhil) 19 Chip Multiprocessors (ACS MPhil) 20
Crossbar Networks Crossbar Networks • A crossbar switch is able to directly connect any input to any output without any intermediate stages – It is an example of a strictly non-blocking network • It can connect any input to any output, incrementally, • A 4x3 crossbar without the need the rearrange any of the circuits implemented using three currently set up. 4:1 multiplexers – The main limitation of a crossbar is its cost. Although very useful in small configurations, n x n • Each multiplexer selects a particular input to be crossbars can quickly become prohibitively connected to the expensive as their cost increases as n 2 corresponding output (Dally/Towles book Chapter 6) Chip Multiprocessors (ACS MPhil) 21 Chip Multiprocessors (ACS MPhil) 22 Crossbar Networks: Example Niagara Crossbar Networks: Example Cyclops • Crossbar switch interconnects 8 processors to banked on-chip L2 cache – A crossbar is actually provided in each direction: • Forward and Return • Simple cache coherence • IBM, US Dept. of Energy/Defense, Academia protocol • Full system 1M+ processors, 80 cores per chip – See earlier seminar • Interconnect: centralised 96x96 buffered crossbar switch with a 7-stage pipeline Reproduced from IEEE Micro, Mar'05 Chip Multiprocessors (ACS MPhil) 23 Chip Multiprocessors (ACS MPhil) 24
Recommend
More recommend