reducing the interconnection
play

Reducing the Interconnection Network Cost of Chip Multiprocessors - PowerPoint PPT Presentation

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente and Jos ngel Gregorio. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 2 Outline Motivation Reactive traffic


  1. Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentín Puente and José Ángel Gregorio.

  2. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 2 Outline • Motivation • Reactive traffic • End-to-end deadlock • Rotary solution • In order delivery • Evaluation • Conclusions

  3. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 3 Motivation: NOCs for CMPs • CMP systems usually assume the presence of cache coherency mechanisms. • Cache coherence requirements for the communication subsystem: – Handle of reactive traffics (end-to-end deadlock). – In-order message delivery. • Solutions for these requirements should have a minimal impact on NoC technological boundaries.

  4. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 4 Outline • Motivation • Reactive traffic • End-to-end deadlock • Rotary solution • In order delivery • Evaluation • Conclusions

  5. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 5 Reactive Traffic Messages involved in a memory transaction depend one upon the other • Minimal 2 messages: CPU CPU CPU L1 L1 L1 – CPU-A requests a L2 L2 L2 cache line. – CPU-B L2 provides the block. CPU CPU CPU L1 L1 L1 MAIN L2 L2 L2 • Longer Dependencies: MEM – CPU-A requests a cache line. CPU CPU CPU L1 L1 L1 – The line is not in CPU-B L2, to memory. L2 L2 L2 – Memory provides the block. CHIP

  6. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 6 Reactive Traffic This kind of communication can cause message-dependent deadlocks. Network Interface REP-OUT REP-IN REQ-OUT REQ-IN REP-OUT REP-IN REQ-OUT REQ-IN Crossbar Router B Router A 1 Router A and Router B flood the network with REQUEST messages 2 REQUEST messages are only attended if a REPLY can be generated The hole leaved by an attended REQUEST is occupied by another REQUEST 3 DEADLOCK: No more REQUESTS can be attended and REPLIES cannot reach destination.

  7. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 7 Reactive Traffic A widely utilized solution to avoid this problem is buffer replication. REQ and REP travel through different buffering resources (virtual networks). REP-OUT REP-IN REQ-OUT REQ-IN REP-OUT REP-IN REQ-OUT REQ-IN Crossbar Router

  8. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 8 Reactive Traffic Path replication solves end-to-end deadlock problem, but can seriously affect other relevant design aspects, such as area, complexity, power. Consumer Injector Crossbar Buffer Rtg. & Arb. 1 Message Type 4 Message Types Alpha 21364 router: 7 message types

  9. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 9 Previous work: The Rotary Router [REF] P. Abad, V. Puente, P. Prieto, J.A. Gregorio, “Rotary Router: An Efficient Architecture for CMP Interconnection Networks”, International Symposium on Computer Architecture (ISCA), 2007.

  10. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 10 Rotary Router Sketch Input Stage  Packet Pre-Routing.  Ring Selection. N Injector Free? Buffering Segment Stage No! Consumer  Packet movement.  Output arbitration. Buffer Output Stage E W  Flow control.  Packets storage. S

  11. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 11 Rotary Router Advantages • Head of Line Blocking Avoidance. • Improved Buffering utilization. • Adaptive routing without virtual channels. • Centralized structures avoidance (Xbar, Arbiter). • Topology agnostic Deadlock avoidance Mechanism. 120 Normalized EDP (%) 100 80 60 40 20 Classic Rotary 0 LU JAVA FT HTTP

  12. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 12 Reactive Traffic Continuous movement of packets inside the router rings allows the Rotary Router to implement a solution to end-to-end Deadlock without requiring path replication. COMMUNICATION PATTERN M-1 M-2 M-3 M-4 Lim M-3 Lim M-2 Lim M-1 Empty

  13. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 13 Reactive Traffic Continuous movement of packets inside the router rings allows the Rotary Router to implement a solution to end-to-end Deadlock without requiring path replication. COMMUNICATION PATTERN M-1 Lim M-3 Lim M-3 M-2 Lim M-3 Lim M-3 Lim M-3 M-3 Lim M-2 Lim M-2 Lim M-2 Lim M-2 M-4 Lim M-2 Lim M-1 Lim M-1 Lim M-1 Lim M-1 Lim M-1 Empty Empty Empty Empty Inject Transit

  14. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 14 Outline • Motivation • Reactive traffic • End-to-end deadlock • Rotary solution • In order delivery • Evaluation • Conclusions

  15. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 15 In-Order Delivery • This requirement is imposed by some memory coherence protocols (v.gr. Token coherence protocol) or maintenance tasks. • In these cases, only specific transactions need to be ordered (v.gr. Persistent request deactivation) • Ordered messages represent only a small portion of total network traffic ( ~5% of total traffic).

  16. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 16 In-Order Delivery Fulfilling this requirement is extremely simple for input buffered routers. It becomes a challenge for the Rotary Router: -Adaptive routing allows inter-router packet overtaking. -Internal router rings allow intra-router overtaking. N 2 Injector 1 Consumer Buffer 2 1 E W S

  17. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 17 In-Order Delivery Inter-router overtaking is avoided through specific Routing decisions for in-order messages: -wraparound links will be avoided (Mesh) -Adaptive routing will not be allowed (DOR).

  18. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 18 In-Order Delivery Intra-router overtaking needs a special mechanisms to be avoided. 0 IN IN IN OUT OUT 1 2 0 1 0 1 0 1

  19. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 19 Outline • Motivation • Reactive traffic • End-to-end deadlock • Rotary solution • In order delivery • Evaluation • Conclusions

  20. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 20 Performance Evaluation • Compared to three different routers Adaptive Bubble Router Deterministic Router Low Latency Router Latch DOR DOR DOR Adaptive

  21. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 21 Performance Evaluation • Synthetic Traffic Patterns • Real Workloads – GEMS + SICOSYS. 16 4GB, 260 cycles, 320 GB/s Number of cores Main Memory Private, 32KB, 2-way, 64Bytes 16 bytes L1 I/D cache Command size block, 1-cycle SNUCA, 16x16 banks, 4 per router 8x8 Torus L2 cache Network Topology 128KB, 16-way, 3-cycles, Pseudo 128 bits / 1 cycle latency L2 cache bank Network Link LRU, 64 Bytes block

  22. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 22 Performance Evaluation • Synthetic Traffic Patterns – 5 message types. – 32.000 messages of each type delivered. – Low-lat topology: Mesh. 500 ROTARY BADA BDOR LOW-LAT 400 300 200 100 0 RAND BIT-REV. MAT-TR PERM

  23. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 23 Performance Evaluation • Real Workloads – Transactional & Scientific applications. ROTARY BADA 200 BDOR LOW-LAT 150 100 50 0 IS LU FT OLTP Java HTTP1 HTTP2

  24. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 24 Outline • Motivation • Reactive traffic • End-to-end deadlock • Rotary solution • In order delivery • Evaluation • Conclusions

  25. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 25 Conclusions • The Rotary Router has been the base to implement a mechanism able to deal with end- to-end deadlocks. • This mechanism does not require path replication. • We solve in-order delivery with a simple method which requires few extra hardware. • Flexible buffer utilization allows our router to obtain better performance results.

  26. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 26 Questions?

Recommend


More recommend