Fall 2015 :: CSE 610 – Parallel Computer Architectures Tori and Meshes • n -cubes can have different radices in different dimensions – Example: 2 in Y , 3 in Z and 4 in X • Very regular: can construct an n +1 -dim cube by taking k 2,3,4-ary 3-cube n -dim cubes, arranging them in an array and connecting the corresponding nodes of neghibors k n channels k -ary k -ary k -ary . . . n -cube n -cube n -cube
Fall 2015 :: CSE 610 – Parallel Computer Architectures Tori and Meshes • Famous topologies in this family – Ring : k -ary 1 -cube – 2D and 3D grids – Hypercube : 2 -ary (binary)s n -cube • 1D or 2D map well to planar substrate for on-chip • 3D is easy to build in 3D spaces ( e.g. , a supercomputer) • Tori are edge symmetric ✓ Good for load balancing • Removing wrap-around links for mesh loses edge symmetry More traffic concentrated on center channels • Good path diversity • Exploit locality for near-neighbor traffic – Important for many scientific computations
Fall 2015 :: CSE 610 – Parallel Computer Architectures Tree • Diameter and average distance logarithmic – k -ary tree, height = log k N – address specified d-vector of radix k coordinates describing path down from root • Route up to common ancestor and down • Bisection BW?
Fall 2015 :: CSE 610 – Parallel Computer Architectures Fat Tree • Bandwidth remains constant at each level – Bisection BW scales with number of terminals • Unlike tree in which bandwidth decreases closer to root • Fat links can be implemented with increasing the BW (uncommon) or number of channels (more common)
Fall 2015 :: CSE 610 – Parallel Computer Architectures Butterfly (1/3) • Indirect network 0 1 0 0 0 • k -ary n -fly: k n terminals 00 10 20 1 1 – k : input/output degree of each switch 2 2 01 11 21 – n : number of stages 3 3 – Each stage has k n -1 k -by- k switches 4 4 02 12 22 • Example routing from 000 5 5 to 010 6 6 03 13 23 – Dest address used to directly 7 7 route packet – j th bit used to select output 2-ary 3-fly port at stage j 2 port switch, 3 stages
Fall 2015 :: CSE 610 – Parallel Computer Architectures Butterfly (2/3) • No path diversity | R xy | = 1 • Can add extra stages for diversity – Increases network diameter 0 0 x0 00 10 20 1 1 2 2 x1 01 11 21 3 3 4 4 x2 02 12 22 5 5 6 6 x3 03 13 23 7 7
Fall 2015 :: CSE 610 – Parallel Computer Architectures Butterfly (3/3) • Hop Count = log k N + 1 • Does not exploit locality – Hop count same regardless of location • Switch Degree = 2 k • Requires long wires to implement
Fall 2015 :: CSE 610 – Parallel Computer Architectures Clos Network (1/2) • 3-stage Clos – Input switches – Output switches – Middle switches • Parameters – m : # of middle switches – n : in/out degree of edge switches – r : # of input/output switches 3-stage Clos network with m = 5, n = 3, r = 4
Fall 2015 :: CSE 610 – Parallel Computer Architectures Clos Network (2/2) • Provides path diversity – |R xy | = m (number of middle switches) – One path through every middle switch • Can increase # of stages (and diversity) by replacing the middle stage with another clos network (2,2,2) Clos
Fall 2015 :: CSE 610 – Parallel Computer Architectures Folding Clos Networks • Can fold the network along the middle stage to share input/output switches • The right-hand side is a fat tree – Alternative impl. w/ more links instead of high-BW links
Fall 2015 :: CSE 610 – Parallel Computer Architectures And Other Topologies… • Many other topologies with different properties discussed in the literature – Omega networks – Benes networks – Bitonic networks – Flattened Butterfly – Dragonfly – Cube-connected cycles – HyperX – … • However, these are typically special purpose and not used in general purpose hardware
Fall 2015 :: CSE 610 – Parallel Computer Architectures Irregular Topologies • Common in MPSoC (Multiprocessor System-on-Chip) designs • MPSoC design leverages wide variety of IP blocks – Regular topologies may not be appropriate given heterogeneity – Customized topology • Often more power efficient and deliver better performance • Customize based on traffic characterization – Often synthesized using automatic tools
Fall 2015 :: CSE 610 – Parallel Computer Architectures Irregular Topology Example Run Run Inverse Inverse VLD length VLD length scan scan decoder decoder R R R AC/DC AC/DC iDCT iQuant iDCT iQuant predict predict R R R R R VOP Stripe VOP up samp Stripe up samp reconstr Memory reconstr Memory R R R R R R VOP ARM core VOP Padding ARM core Paddin Memory Memory g R R R
Fall 2015 :: CSE 610 – Parallel Computer Architectures Flow Control
Fall 2015 :: CSE 610 – Parallel Computer Architectures Flow Control Overview • Flow Control: determine allocation of resources to messages as they traverse network – Buffers and links – Significant impact on throughput and latency of network Flow Control Units: • Message : composed of one or more packets – If message size is <= maximum packet size only one packet created • Packet : composed of one or more flits • Flit : flow control digit • Phit : physical digit – Subdivides flit into chunks = to link width
Fall 2015 :: CSE 610 – Parallel Computer Architectures Flow Control Overview Message Header Payload Protocol view Packet Route Seq# Flow Control Body Flit Tail Flit Head Flit View Flit Type VCID Head, Body, Tail, Phit Head & Tail • Packet contains destination/route information – Flits may not all flits of a packet must take same route
Fall 2015 :: CSE 610 – Parallel Computer Architectures Switching • Different flow control techniques based on granularity • Message-based : allocation made at message granularity (circuit-switching) • Packet-based : allocation made to whole packets • Flit-based : allocation made on a flit-by-flit basis
Fall 2015 :: CSE 610 – Parallel Computer Architectures Message-Based Flow Control • Coarsest granularity • Circuit Switching – Pre-allocates resources across multiple hops • Source to destination • Resources = links (buffers not necessary) – Probe sent into network to reserve resources – Message does not need per-hop routing or allocation once probe sets up circuit • Good for transferring large amounts of data • No other message can use resources until transfer is complete – Throughput can suffer due setup and hold time for circuits – Links are idle until setup is complete
Fall 2015 :: CSE 610 – Parallel Computer Architectures Time-Space Diagram: Circuit-Switching D 0 D 0 D 0 D 0 0 S 08 A 08 T 08 8 8 8 8 D 0 D 0 D 0 D 0 1 S 08 A 08 T 08 8 8 8 8 D 0 D 0 D 0 D 0 S 08 S 28 S 28 Location 2 A 08 T 08 8 8 8 8 D 0 D 0 D 0 D 0 S 08 S 28 5 A 08 T 08 8 8 8 8 D 0 D 0 D 0 D 0 S 08 A 08 T 08 8 8 8 8 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Time Time to setup+ack circuit from 0 to 8 0 1 2 Time setup from 2 to 8 is blocked 3 4 5 6 7 8
Fall 2015 :: CSE 610 – Parallel Computer Architectures Packet-based Flow Control • Break messages into packets • Interleave packets on links – Better utilization • Requires per-node buffering to store in-flight packets • Two types of packet-based techniques – Store & Forward – Virtual Cut-Through
Fall 2015 :: CSE 610 – Parallel Computer Architectures Store & Forward (S&F) • Links and buffers are allocated to entire packet • Head flit waits at router until entire packet is received (Store) before being forwarded to the next hop (Forward) • Not suitable for on-chip – Requires buffering at each router to hold entire packet • Packet cannot traverse link until buffering allocated to entire packet – Incurs high per-hop latency (pays serialization latency at each hop)
Fall 2015 :: CSE 610 – Parallel Computer Architectures Time-Space Diagram: S&F 0 H B B B T 1 H B B B T Location 2 H B B B T 5 H B B B T H B B B T 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Time 0 1 2 3 4 5 6 7 8
Fall 2015 :: CSE 610 – Parallel Computer Architectures Virtual Cut-Through (VCT) • Links and Buffers allocated to entire packets • Flits can proceed to next hop before tail flit has been received by current router – Only if next router has enough buffer space for entire packet • Reduces the latency significantly compared to Store & Forward • Still requires large buffers – Unsuitable for on-chip
Fall 2015 :: CSE 610 – Parallel Computer Architectures Time-Space Diagram: VCT 0 H B B B T 1 H B B B T 2 H B B B T Location 5 H B B B T H B B B T 8 0 1 2 0 1 2 3 4 5 6 7 8 3 4 5 Time 6 7 8
Fall 2015 :: CSE 610 – Parallel Computer Architectures Time-Space Diagram: VCT (2) 0 H B B B T Cannot proceed because 1 H B B B T only 2 flit buffers available 2 H B B B T Location Insufficient 5 H B B B T Buffers H B B B T 8 0 1 2 0 1 2 3 4 5 6 7 8 9 10 11 3 4 5 Time 6 7 8
Fall 2015 :: CSE 610 – Parallel Computer Architectures Flit-Level Flow Control • Flit can proceed to next router when there is buffer space available for that flit – Improves over SAF and VCT by allocating buffers on a flit-by-flit basis – Help routers meet tight area/power constraints • Called Wormhole Flow Control ✓ More efficient buffer utilization (good for on-chip) ✓ Low latency Poor link utilization: if head flit becomes blocked, all links spanning length of packet are idle • Cannot be re-allocated to different packet • Suffers from head of line (HOL) blocking
Fall 2015 :: CSE 610 – Parallel Computer Architectures Wormhole Example • 6-flit Red holds this channel: Channel idle but channel remains idle red packet blocked buffers per until red proceeds behind blue input port • 2 4-flit packets – Red & Buffer full: blue blue cannot proceed Blocked by other packets
Fall 2015 :: CSE 610 – Parallel Computer Architectures Time-Space Diagram: Wormhole 0 H B B B T 1 H B B B T 2 Contention H B B B T Location 5 H B B B T H B B B T 8 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 Time 3 4 5 6 7 8
Fall 2015 :: CSE 610 – Parallel Computer Architectures Virtual Channel Flow Control • Virtual Channels : multiple flit queues per input port – Share same physical link (channel) • Used to combat HOL blocking in wormhole – Flits on different VC can pass blocked packet – Link utilization improved • VCs first proposed for deadlock avoidance – We’ll come back to this • Can be applied to any flow control – First proposed with wormhole
Fall 2015 :: CSE 610 – Parallel Computer Architectures VC Flow Control – Example 1 A (in) B (in) Out A (out) A (in) AH A1 A2 A3 A4 A5 AT 1 3 2 1 1 1 2 2 3 3 3 3 2 Occupancy B (out) B (in) BH B1 B2 B3 B4 B5 BT 1 2 2 3 3 3 3 3 3 3 2 2 1 1 Occupancy Out AH BH A1 B1 A2 B2 A3 B3 A4 B4 A5 B5 AT BT A (out) AH A1 A2 A3 A4 A5 AT B (out) BH B1 B2 B3 B4 B5 BT
Fall 2015 :: CSE 610 – Parallel Computer Architectures VC Flow Control – Example 2 • 6-flit buffers per input port • 3 flit buffers Buffer full: blue cannot proceed per VC Blocked by other packets
Fall 2015 :: CSE 610 – Parallel Computer Architectures Summary of techniques Links Buffers Comments Circuit- Messages N/A (buffer-less) Setup & Ack Switching Store and Packet Packet Head flit waits for Forward tail Virtual Cut Packet Packet Head can proceed Through Wormhole Packet Flit HOL Virtual Flit Flit Interleave flits of Channel different packets
Fall 2015 :: CSE 610 – Parallel Computer Architectures Buffer Backpressure
Fall 2015 :: CSE 610 – Parallel Computer Architectures Buffer Backpressure • Need mechanism to prevent buffer overflow – Avoid dropping packets – Upstream routers need to know buffer availability at downstream routers • Significant impact on throughput achieved by flow control • Two common mechanisms – Credits – On-off
Fall 2015 :: CSE 610 – Parallel Computer Architectures Credit-Based Flow Control • Upstream router stores credit counts for each downstream VC • Upstream router – When flit forwarded • Decrement credit count – Count == 0, buffer full, stop sending • Downstream router – When flit forwarded and buffer freed • Send credit to upstream router • Upstream increments credit count
Fall 2015 :: CSE 610 – Parallel Computer Architectures Credit Timeline Node 1 Node 2 t1 Flit departs router t2 Process Credit round t3 trip delay t4 Process t5 • Round-trip credit delay: – Time between when buffer empties and when next flit can be processed from that buffer entry • Single entry buffer would result in significant throughput degradation – Important to size buffers to tolerate credit turn-around
Fall 2015 :: CSE 610 – Parallel Computer Architectures Buffer Sizing • Prevent backpressure from limiting throughput – Buffers must hold # of flits >= turnaround time • Assume: – 1 cycle propagation delay for data and credits – 1 cycle credit processing delay – 3 cycle router pipeline • At least 6 flit buffers
Fall 2015 :: CSE 610 – Parallel Computer Architectures Actual Buffer Usage & Turnaround Delay 1 1 3 1 Credit Credit flit Actual buffer propagation pipeline propagation usage delay delay flit pipeline delay delay Flit leaves node 1 Node 0 processes New flit arrives at and credit is sent credit, freed Node 1 and to node 0 buffer reallocated reuses buffer to new flit New flit leaves Flit arrives at node 1 Node 0 receives Node 0 for Node 1 and uses buffer credit
Fall 2015 :: CSE 610 – Parallel Computer Architectures On-Off Flow Control • Credit requires upstream signaling for every flit • On-Off: decreases upstream signaling – Off signal: sent when number of free buffers falls below threshold F off – On signal: sent when number of free buffers rises above threshold F on
Fall 2015 :: CSE 610 – Parallel Computer Architectures On-Off Timeline F off threshold Node 1 Node 2 reached t1 F off set to prevent t2 flits arriving t3 before t4 from Process t4 overflowing F on threshold t5 reached F on set so that t6 Node 2 does Process not run out of t7 flits between t5 and t8 t8 • Less signaling but more buffering – On-chip buffers more expensive than wires
Fall 2015 :: CSE 610 – Parallel Computer Architectures Flow Control Summary • On-chip networks require techniques with lower buffering requirements – Wormhole or Virtual Channel flow control • Avoid dropping packets in on-chip environment – Requires buffer backpressure mechanism • Complexity of flow control impacts router micro-architecture
Fall 2015 :: CSE 610 – Parallel Computer Architectures Routing
Fall 2015 :: CSE 610 – Parallel Computer Architectures Routing Overview • Discussion of topologies assumed ideal routing • In practice… – Routing algorithms are not ideal • Goal: distribute traffic evenly among paths – Avoid hot spots, contention – More balanced closer throughput is to ideal • Keep complexity in mind – Routing delay can become significant with complex routing mechanisms
Fall 2015 :: CSE 610 – Parallel Computer Architectures Classifications of Routing Algorithms • Adaptivity : does take network state ( e.g. , congestion) into account? – Oblivious • Deterministic vs. non-deterministic – Adaptive • Hop count : are all allowed routes minimal? – Minimal – Non-minimal • Routing decision : where is it made? – Source routing – Per-hop routing • Implementation – Table – Circuit
Fall 2015 :: CSE 610 – Parallel Computer Architectures Routing Deadlock A B D C • Each packet is occupying a link and waiting for a link • Without routing restrictions, a resource cycle can occur – Leads to deadlock • To general ways to avoid – Deadlock-free routing: limit the set of turns the routing algorithm allows – Deadlock-free flow control: use virtual channels wisely • E.g., use Escape VCs
Fall 2015 :: CSE 610 – Parallel Computer Architectures Dimension Order Routing Turns in X-Y routing Turns in Y-X routing • Traverse network dimension by dimension – X-Y routing: can only turn to Y dimension after finished X – Y-X routing: can only turn to X dimension after finished Y • Deterministic and Minimal – Being deterministic implies oblivion but not often called so (term oblivious reserved for non-deterministic routing).
Fall 2015 :: CSE 610 – Parallel Computer Architectures Valiant’s Oblivious Routing Algorithm • An oblivious algorithm d ’ • To route from s to d – Randomly choose intermediate node d ’ – Route from s to d’ and from d’ to d • Randomizes any traffic pattern d – All patterns appear uniform random – Balances network load s • Non-minimal • Destroys locality
Fall 2015 :: CSE 610 – Parallel Computer Architectures Minimal Oblivious • Valiant’s: Load balancing but significant d ’ increase in hop count • Minimal Oblivious: some load balancing, but use shortest paths d – d ’ must lie within min quadrant – 6 options for d ’ s – Only 3 different paths
Fall 2015 :: CSE 610 – Parallel Computer Architectures Oblivious Routing • Valiant’s and Minimal Adaptive – Deadlock free when used in conjunction with X-Y routing • What if randomly choose between X-Y and Y-X routes? – Oblivious but not deadlock free! • How to make it deadlock free? – Need 2 virtual channels • Either version can be generalized to more than two phases – Choose more than one intermediate points
Fall 2015 :: CSE 610 – Parallel Computer Architectures Adaptive • Exploits path diversity • Uses network state to make routing decisions – Buffer occupancies often used – Relies on flow control mechanisms, especially back pressure • Local information readily available – Global information more costly to obtain – Network state can change rapidly • Use of local information can lead to non-optimal choices • Can be minimal or non-minimal
Fall 2015 :: CSE 610 – Parallel Computer Architectures Minimal Adaptive Routing d s • Local info can result in sub-optimal choices
Fall 2015 :: CSE 610 – Parallel Computer Architectures Non-minimal adaptive • Fully adaptive • Not restricted to take shortest path • Misrouting : directing packet along non-productive channel – Priority given to productive output – Some algorithms forbid U-turns • Livelock potential: traversing network without ever reaching destination – Mechanism to guarantee forward progress • Limit number of misroutings
Fall 2015 :: CSE 610 – Parallel Computer Architectures Non-minimal routing example d d s s Longer path with potentially Livelock: continue routing in lower latency cycle
Fall 2015 :: CSE 610 – Parallel Computer Architectures Adaptive Routing Example 0 1 2 3 4 5 6 7 • Should 3 route clockwise or counterclockwise to 7? – 5 is using all the capacity of link 5 6 • Queue at node 5 will sense contention but not at node 3 • Backpressure: allows nodes to indirectly sense congestion – Queue in one node fills up, it will stop receiving flits – Previous queue will fill up • If each queue holds 4 packets – 3 will send 8 packets before sensing congestion
Fall 2015 :: CSE 610 – Parallel Computer Architectures Adaptive Routing: Turn Model • Successful adaptive routing requires path diversity • Removing too many turns limits flexibility in routing – E.g., DOR eliminates 4 turns • N to E, N to W, S to E, S to W • Question: how to ensure deadlock freedom while removing a minimum set of turns? • Examples of valid turn models: North last West first Negative first
Fall 2015 :: CSE 610 – Parallel Computer Architectures Turn Model Routing Deadlock • What about eliminating turns NW and WN? • Not a valid turn elimination – Resource cycle results → Not all 2 -removals result in valid turn models
Fall 2015 :: CSE 610 – Parallel Computer Architectures Deadlock Avoidance Using VCs • Deadlock-free routing flow control to guarantee deadlock freedom give more flexible routing – VCs can break resource cycle if routing is not deadlock free • Each VC is time-multiplexed onto physical link – Holding VC = holding VC’s buffer queue not the physical link • We’ll consider two options: – VC ordering – Escape VCs Here, we are using VCs to deal with routing deadlocks. Using separate VCs for different message types (e.g., requests and responses in coherence protocols) to avoid protocol-level deadlocks is a different story.
Fall 2015 :: CSE 610 – Parallel Computer Architectures Option 1: VC Ordering C A0 A1 A B D B1 B0 D D D C B 0 1 C1 A C0 • All message sent through VC 0 until cross dateline • After dateline, assigned to VC 1 – Cannot be allocated to VC 0 again
Fall 2015 :: CSE 610 – Parallel Computer Architectures Option 2: Escape VCs • Enforcing order lowers VC utilization – Previous example: VC 1 underutilized • Escape VCs – Have onde VC that uses deadlock free routing – Example: VC 0 uses DOR, other VCs use arbitrary routing function – Access to VCs arbitrated fairly: packet always has chance of landing on escape VC
Fall 2015 :: CSE 610 – Parallel Computer Architectures Routing Algorithm Implementation • Source Tables – Entire route specified at source – Avoids per-hop routing latency – Unable to adapt dynamically to network conditions – Support reconfiguration (not specific to topology) – Can specify multiple possible routes per destination • Select randomly or adaptively • Node Tables – Store only next direction at each node – Smaller tables than source routing – Adds per-hop routing latency – Can specify multiple possible output ports per destination • Combinatorial circuits – Simple (e.g., DOR): low router overhead – Specific to one topology and one routing algorithm • Limits fault tolerance
Fall 2015 :: CSE 610 – Parallel Computer Architectures Router Microarchitecture
Fall 2015 :: CSE 610 – Parallel Computer Architectures Router Microarchitecture Overview • Focus on microarchitecture of Virtual Channel router • Router complexity increase with bandwidth demands – Simple routers built when high throughput is not needed • Wormhole flow control, no virtual channels, DOR routing, unpipelined , …
Fall 2015 :: CSE 610 – Parallel Computer Architectures Virtual Channel Router Credits Out Credits In VC Allocator Route Computa- tion Switch Allocator VC 1 VC 2 Input 1 Output 1 VC 3 VC 4 Input buffers VC 1 VC 2 Input 5 Output 5 VC 3 VC 4 Input buffers Crossbar switch
Fall 2015 :: CSE 610 – Parallel Computer Architectures Router Components • Input buffers, route computation logic, virtual channel allocator, switch allocator, crossbar switch • Most NoC routers are input buffered – Allows using single-ported memories • Buffer store flits for duration in router
Fall 2015 :: CSE 610 – Parallel Computer Architectures Baseline Router Pipeline BW RC VA SA ST LT • Canonical logical router pipeline – Fit into physical stages based on target frequency and stage delays – BW (Buffer Write) : decode input VC and write to buffer – RC (Route Computation) : determine output port – VA (VC Allocation) : determine VC to use on the output port – SA (Switch Allocation) : arbitrate for crossbar in and out ports – ST (Switch Traversal) : once granted the output port, traverse the switch – LT (Link Traversal) : bon voyage!
Fall 2015 :: CSE 610 – Parallel Computer Architectures Baseline Router Pipeline (2) 1 2 3 4 5 6 7 8 9 Cycle BW RC VA SA ST LT Head BW SA ST LT Body 1 BW SA ST LT Body 2 BW SA ST LT Tail • Head flit goes through all 6 stages • Body and Tail flits skip RC and VA – Route computation and VC allocation done only once per packet – Body and Tail flits inherit this info from the head flit • Tail flit de-allocates the VC
Fall 2015 :: CSE 610 – Parallel Computer Architectures Modules and Dependencies in Router Decode + Routing Switch Arbitration Crossbar Traversal Wormhole Router VC Decode + Routing Switch Arbitration Crossbar Traversal Allocation Virtual Channel Router VC Allocation Decode + Routing Crossbar Traversal Speculative Switch Arbitration Speculative Virtual Channel Router • Dependence between output of one module and input of another – Determine critical path through router – Cannot bid for switch port until routing performed
Fall 2015 :: CSE 610 – Parallel Computer Architectures Router Pipeline Performance • Baseline (no load) delay hops t serialization 5 cycles link delay • Incurs routing delay, adding to message delay – Ideally, only pay link delay • Also increases buffer turnaround time – Necessitates more buffers – Affects clock cycle time • Techniques to reduce pipeline stages
Fall 2015 :: CSE 610 – Parallel Computer Architectures Optimizations: Lookahead Routing • At current router perform routing computation for next router – Overlap with Buffer Write (BW) – Precomputing route allows flits to compete for VCs immediately after BW BW Head VA SA ST LT RC Body BW SA ST LT /Tail RC
Fall 2015 :: CSE 610 – Parallel Computer Architectures Pipeline Optimizations: Speculation • Assume that VC Allocation will be successful – Valid under low to moderate loads • Do VA and SA in parallel • If VA unsuccessful (no virtual channel returned) – Must repeat VA/SA in next cycle • Prioritize non-speculative requests – Body/tail flit already have VC info so they are not speculative BW VA Head ST LT RC SA Body BW VA ST LT /Tail RC SA
Fall 2015 :: CSE 610 – Parallel Computer Architectures Pipeline Optimizations: Bypassing • When no flits in input buffer – Speculatively enter ST – On port conflict, speculation aborted – In the first stage (setup) • Do lookahead routing → Just decode the head flit • Do SA and VA in parallel • Skip BW: do not write to buffer unless the speculation fails Head Setup ST LT Body Setup ST LT /Tail
Fall 2015 :: CSE 610 – Parallel Computer Architectures Pipeline Bypassing 1a Lookahead Routing VC Allocation Computation Inject N 1 1b S A E W Eject N S E • No buffered flits when A arrives W 2
Fall 2015 :: CSE 610 – Parallel Computer Architectures Speculation A succeeds in VA Virtual Channel but fails in SA, 2a Allocation retry SA 1a Lookahead Routing 2b Computation Switch Allocation 3 Port conflict Inject 1 detected 1c 1b N B 1 1c 1b A S E W Eject N S E W 4 3
Fall 2015 :: CSE 610 – Parallel Computer Architectures Buffer Organization Physical Virtual channels channels • Single buffer per input • Multiple fixed length queues per physical channel
Fall 2015 :: CSE 610 – Parallel Computer Architectures Buffer Organization VC 0 tail head VC 1 tail head • Multiple variable length queues – Multiple VCs share a large buffer – Each VC must have minimum 1 flit buffer • Prevent deadlock – More complex circuitry
Fall 2015 :: CSE 610 – Parallel Computer Architectures Buffer Organization • Many shallow VCs or few deep VCs? • More VCs ease HOL blocking – More complex VC allocator • Light traffic – Many shallow VCs – underutilized • Heavy traffic – Few deep VCs – less efficient, packets blocked due to lack of VCs
Fall 2015 :: CSE 610 – Parallel Computer Architectures Crossbar Organization • Heart of data path – Switches bits from input to output • High frequency crossbar designs challenging • Crossbar composed for many multiplexers – Common in low-frequency router designs i40 i30 i20 i10 i00 sel0 sel1 sel2 sel3 sel4 o0 o1 o2 o3 o4
Fall 2015 :: CSE 610 – Parallel Computer Architectures Crossbar Organization: Crosspoint Inject w columns N w rows S E W Eject N S E W • Area and power scale at O((pw) 2 ) – p: number of ports (function of topology) – w: port width in bits (determines phit/flit size and impacts packet energy and delay)
Fall 2015 :: CSE 610 – Parallel Computer Architectures Crossbar speedup 10:5 5:10 10:10 crossbar crossbar crossbar • Increase internal switch bandwidth • Simplifies allocation or gives better performance with a simple allocator – More inputs to select from higher probability each output port will be matched (used) each cycle • Output speedup requires output buffers – Multiplex onto physical link
Fall 2015 :: CSE 610 – Parallel Computer Architectures Arbiters and Allocators • Allocator : matches N requests to M resources • Arbiter : matches N requests to 1 resource • Resources – VCs (for virtual channel routers) – Crossbar switch ports
Recommend
More recommend