7 • On-Chip Interconnection Networks Chip Multiprocessors (ACS MPhil) Robert Mullins
Introduction • Vast transistor budgets, but .... • Poor interconnect scaling – Pressure to decentralise designs • Need to manage complexity and power • Need for flexible/fault tolerant designs • Parallel architectures – Keep core complexity constant or simplify – The result is need to interconnect lots of cores, memories and other IP cores.
Introduction • On-chip communication requirements: – High-performance • Latency and Bandwidth – Flexibility • Move away from fixed application- specific wiring – Scalability • Number of modules is rapidly increasing
Introduction • On-chip communication requirements: – Simplicity (ease of design and verification) • Structured, modular and regular • Optimize channel and router once – Efficiency • Ability to share global wiring resources between different flows – Fault tolerance (in the long term) • The existence of multiple communication paths between module pairs – Support for different traffic types and QoS
Introduction • The design of the on-chip network is not an isolated design decision (or afterthought) – e.g. consider impact on cache coherency protocol – What is the correct balance of resources (wires and transistors, silicon area, power etc.) between the on-chip network and computational resources? – Where does the on-chip network stop and the design of a module or core start? • “ integrated microarchitectural networks ” – Does network simply blindly allow modules to communicate or does it have additional functionality?
Introduction • Don't we already know how to design interconnection networks? – Many existing network topologies, router designs and theory has already been developed for high- end supercomputers and telecom switches – Yes, and we'll cover some of this material, but the trade-offs on-chip lead to very different designs. Chip Multiprocessors (ACS MPhil) 6
On-chip vs. Off-chip • Compare availability of pins and wiring tracks on-chip to cost of pins/connectors and cables off-chip • Compare communication latencies on- and off-chip – What is the impact on router and network design? • Applications and workloads • Amount of memory available on-chip – What is the impact on router design/flow control? • Power budgets on- and off-chip • Need to map network to planar chip (or perhaps more recently a 3D stack of dies) Chip Multiprocessors (ACS MPhil) 7
On-chip interconnect • Typical interconnect at 45nm node: – 10-14 metal layers – Local interconnect (M1) • 65nm metal width, 65nm spacing • 7700 metal tracks/mm – Global (e.g. M10) • 400nm metal width, 400nm spacing • 1250 metal tracks/mm • Remember global interconnects scale poorly when compared to transistors 9Cu+1Al process (Fujitsu 2007) Chip Multiprocessors (ACS MPhil) 8
Bus-based interconnects • Bus-based interconnects – Central arbiter provides access to bus – Logically the bus is simply viewed as a set of wires shared by all processors Chip Multiprocessors (ACS MPhil) 9
Bus-based interconnects • Real bus implementations are typically switch based – Multiplexers and unidirectional interconnects with repeaters – Tri-states are rarely used now – Interconnect itself may be pipelined • A bus-based CMP usually exploits multiple unidirectional buses – e.g. address bus, response bus and data bus Chip Multiprocessors (ACS MPhil) 10
Bus-based interconnects for multicore? • Metal/wiring is cheap on- chip! Repeated Bus Global Interconnect • Avoid complexity of packet-switched networks O R R O R R • Keep cache-coherency simple • Performance issues R R R R R R – Centralised arbitration – Low clock frequency (pipeline?) R R R R R R – Power? – Scalability? Shekhar Borkar (OCIN'06) Chip Multiprocessors (ACS MPhil) 11
Bus-based interconnects for multicore? • Optimising bus-based solutions: – Arbitrate for next cycle on current clock cycle – Use wide, low-swing interconnects – Limit broadcast to subset of processors? • Segment bus and filter redundant broadcasts to some segments by maintaining some knowledge of cache contents. So called, “Filtered Segmented Buses” – Employ multiple buses – Move from electrical to on-chip optical solutions? Chip Multiprocessors (ACS MPhil) 12
Filtered Segmented Bus • Filter broadcasts to segments with Bloom filter • Energy savings possible vs. mesh and flattened butterfly networks (for 16, 32 and 64- cores) because routers can be removed • For large numbers of cores multiple (address- interleaved) buses are required to avoid a significant performance penalty due to contention “Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks”, Udipi et al, HPCA 2010 Chip Multiprocessors (ACS MPhil) 13
Bus-based interconnects for multicore? • Exploiting multiple buses (or rings): – Multiple address-interleaved buses • e.g. Sun Wildfire/Starfire – Use different buses for different message types – Subspace snooping [Huh/Burger06] • Associate (dynamic) address ranges with each bus. Each subspace are regions of data that are shared by a stable subset of the processors. • This technique tackles snoop bandwidth limitations as all processors are not required to snoop all buses – Exploit buses at the lowest level of a hierarchical network (e.g. mesh interconnecting tiles, where each tile is a group of cores connected by a bus) Chip Multiprocessors (ACS MPhil) 14
Sun Starfire (UE10000) • Up to 64-way SMP using bus-based snooping protocol 4 processors + µ P µ P µ P µ P µ P µ P µ P µ P memory module per system board $ $ $ $ $ $ $ $ Uses 4 interleaved address buses to scale snooping protocol Board Interconnect Board Interconnect 16x16 Data Crossbar Separate data transfer over Memory Memory Module Module high bandwidth crossbar Slide from Krste Asanovic (Berkeley) Chip Multiprocessors (ACS MPhil) 15
Ring Networks k -node ring (or k -ary 1-cube) • Exploit short point-to-point interconnects • Can support many concurrent data transfers • Can keep coherence protocol simple and avoid need for directory-based schemes – We may still broadcast transactions • Modest area requirements Chip Multiprocessors (ACS MPhil) 16
Ring Networks • Control – May be distributed • Need to be a little careful to avoid possibility of deadlock (more later!) – Or a centralised arbiter/scheduler may be used • e.g. IBM Cell BE and Larrabee both appear to use a centralised scheduler • Try and schedule as many concurrent (non-overlapping) transfers on each available ring as possible • Trivial routers at each node – Simple routers are attractive as they don't introduce significant latency, power and area overheads Chip Multiprocessors (ACS MPhil) 17
Ring Networks: Examples • IBM – Power4, Power5 • IBM/Sony/Toshiba – Cell BE (PS3, HDTV, Cell blades, ...) • Intel – Larabee (graphics), 8-core Xeon processor • Kendall Square Research (1990's) – Massively parallel supercomputer design – Ring of rings (hierarchical or multi-ring) topology • Cluster = 32 nodes connected in a ring • Up to 34 clusters connected by higher level ring Chip Multiprocessors (ACS MPhil) 18
Ring Networks: Example IBM Cell BE • Cell Broadband Engine – Message-passing style (no $ coherence) – Element Interconnect Bus (EIB) • 2-rings are provided in each direction • Crossbar solution was deemed too large Chip Multiprocessors (ACS MPhil) 19
Ring Networks: Example Larrabee • Cache coherent • Bi-directional ring network, 512-bit wide links – Short linked rings proposed for >16 processors – Routing decisions are made before injecting messages • The clockwise ring delivers on even clock cycles and the anticlockwise ring on odd clock cycles Chip Multiprocessors (ACS MPhil) 20
Crossbar Networks • A crossbar switch is able to directly connect any input to any output without any intermediate stages – It is an example of a strictly non-blocking network • It can connect any input to any output, incrementally, without the need the rearrange any of the circuits currently set up. – The main limitation of a crossbar is its cost. Although very useful in small configurations, n x n crossbars can quickly become prohibitively expensive as their cost increases as n 2 Chip Multiprocessors (ACS MPhil) 21
Crossbar Networks • A 4x3 crossbar implemented using three 4:1 multiplexers • Each multiplexer selects a particular input to be connected to the corresponding output (Dally/Towles book Chapter 6) Chip Multiprocessors (ACS MPhil) 22
Crossbar Networks: Example Niagara • Crossbar switch interconnects 8 processors to banked on-chip L2 cache – A crossbar is actually provided in each direction: • Forward and Return • Simple cache coherence protocol – See earlier seminar Reproduced from IEEE Micro, Mar'05 Chip Multiprocessors (ACS MPhil) 23
Crossbar Networks: Example Cyclops • IBM, US Dept. of Energy/Defense, Academia • Full system 1M+ processors, 80 cores per chip • Interconnect: centralised 96x96 buffered crossbar switch with a 7-stage pipeline Chip Multiprocessors (ACS MPhil) 24
Recommend
More recommend