Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentín Puente and José Ángel Gregorio.
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 2 Outline • Motivation • Reactive traffic • End-to-end deadlock • Rotary solution • In order delivery • Evaluation • Conclusions
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 3 Motivation: NOCs for CMPs • CMP systems usually assume the presence of cache coherency mechanisms. • Cache coherence requirements for the communication subsystem: – Handle of reactive traffics (end-to-end deadlock). – In-order message delivery. • Solutions for these requirements should have a minimal impact on NoC technological boundaries.
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 4 Outline • Motivation • Reactive traffic • End-to-end deadlock • Rotary solution • In order delivery • Evaluation • Conclusions
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 5 Reactive Traffic Messages involved in a memory transaction depend one upon the other • Minimal 2 messages: CPU CPU CPU L1 L1 L1 – CPU-A requests a L2 L2 L2 cache line. – CPU-B L2 provides the block. CPU CPU CPU L1 L1 L1 MAIN L2 L2 L2 • Longer Dependencies: MEM – CPU-A requests a cache line. CPU CPU CPU L1 L1 L1 – The line is not in CPU-B L2, to memory. L2 L2 L2 – Memory provides the block. CHIP
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 6 Reactive Traffic This kind of communication can cause message-dependent deadlocks. Network Interface REP-OUT REP-IN REQ-OUT REQ-IN REP-OUT REP-IN REQ-OUT REQ-IN Crossbar Router B Router A 1 Router A and Router B flood the network with REQUEST messages 2 REQUEST messages are only attended if a REPLY can be generated The hole leaved by an attended REQUEST is occupied by another REQUEST 3 DEADLOCK: No more REQUESTS can be attended and REPLIES cannot reach destination.
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 7 Reactive Traffic A widely utilized solution to avoid this problem is buffer replication. REQ and REP travel through different buffering resources (virtual networks). REP-OUT REP-IN REQ-OUT REQ-IN REP-OUT REP-IN REQ-OUT REQ-IN Crossbar Router
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 8 Reactive Traffic Path replication solves end-to-end deadlock problem, but can seriously affect other relevant design aspects, such as area, complexity, power. Consumer Injector Crossbar Buffer Rtg. & Arb. 1 Message Type 4 Message Types Alpha 21364 router: 7 message types
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 9 Previous work: The Rotary Router [REF] P. Abad, V. Puente, P. Prieto, J.A. Gregorio, “Rotary Router: An Efficient Architecture for CMP Interconnection Networks”, International Symposium on Computer Architecture (ISCA), 2007.
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 10 Rotary Router Sketch Input Stage Packet Pre-Routing. Ring Selection. N Injector Free? Buffering Segment Stage No! Consumer Packet movement. Output arbitration. Buffer Output Stage E W Flow control. Packets storage. S
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 11 Rotary Router Advantages • Head of Line Blocking Avoidance. • Improved Buffering utilization. • Adaptive routing without virtual channels. • Centralized structures avoidance (Xbar, Arbiter). • Topology agnostic Deadlock avoidance Mechanism. 120 Normalized EDP (%) 100 80 60 40 20 Classic Rotary 0 LU JAVA FT HTTP
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 12 Reactive Traffic Continuous movement of packets inside the router rings allows the Rotary Router to implement a solution to end-to-end Deadlock without requiring path replication. COMMUNICATION PATTERN M-1 M-2 M-3 M-4 Lim M-3 Lim M-2 Lim M-1 Empty
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 13 Reactive Traffic Continuous movement of packets inside the router rings allows the Rotary Router to implement a solution to end-to-end Deadlock without requiring path replication. COMMUNICATION PATTERN M-1 Lim M-3 Lim M-3 M-2 Lim M-3 Lim M-3 Lim M-3 M-3 Lim M-2 Lim M-2 Lim M-2 Lim M-2 M-4 Lim M-2 Lim M-1 Lim M-1 Lim M-1 Lim M-1 Lim M-1 Empty Empty Empty Empty Inject Transit
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 14 Outline • Motivation • Reactive traffic • End-to-end deadlock • Rotary solution • In order delivery • Evaluation • Conclusions
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 15 In-Order Delivery • This requirement is imposed by some memory coherence protocols (v.gr. Token coherence protocol) or maintenance tasks. • In these cases, only specific transactions need to be ordered (v.gr. Persistent request deactivation) • Ordered messages represent only a small portion of total network traffic ( ~5% of total traffic).
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 16 In-Order Delivery Fulfilling this requirement is extremely simple for input buffered routers. It becomes a challenge for the Rotary Router: -Adaptive routing allows inter-router packet overtaking. -Internal router rings allow intra-router overtaking. N 2 Injector 1 Consumer Buffer 2 1 E W S
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 17 In-Order Delivery Inter-router overtaking is avoided through specific Routing decisions for in-order messages: -wraparound links will be avoided (Mesh) -Adaptive routing will not be allowed (DOR).
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 18 In-Order Delivery Intra-router overtaking needs a special mechanisms to be avoided. 0 IN IN IN OUT OUT 1 2 0 1 0 1 0 1
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 19 Outline • Motivation • Reactive traffic • End-to-end deadlock • Rotary solution • In order delivery • Evaluation • Conclusions
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 20 Performance Evaluation • Compared to three different routers Adaptive Bubble Router Deterministic Router Low Latency Router Latch DOR DOR DOR Adaptive
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 21 Performance Evaluation • Synthetic Traffic Patterns • Real Workloads – GEMS + SICOSYS. 16 4GB, 260 cycles, 320 GB/s Number of cores Main Memory Private, 32KB, 2-way, 64Bytes 16 bytes L1 I/D cache Command size block, 1-cycle SNUCA, 16x16 banks, 4 per router 8x8 Torus L2 cache Network Topology 128KB, 16-way, 3-cycles, Pseudo 128 bits / 1 cycle latency L2 cache bank Network Link LRU, 64 Bytes block
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 22 Performance Evaluation • Synthetic Traffic Patterns – 5 message types. – 32.000 messages of each type delivered. – Low-lat topology: Mesh. 500 ROTARY BADA BDOR LOW-LAT 400 300 200 100 0 RAND BIT-REV. MAT-TR PERM
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 23 Performance Evaluation • Real Workloads – Transactional & Scientific applications. ROTARY BADA 200 BDOR LOW-LAT 150 100 50 0 IS LU FT OLTP Java HTTP1 HTTP2
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 24 Outline • Motivation • Reactive traffic • End-to-end deadlock • Rotary solution • In order delivery • Evaluation • Conclusions
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 25 Conclusions • The Rotary Router has been the base to implement a mechanism able to deal with end- to-end deadlocks. • This mechanism does not require path replication. • We solve in-order delivery with a simple method which requires few extra hardware. • Flexible buffer utilization allows our router to obtain better performance results.
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 26 Questions?
Recommend
More recommend