Reducing the Interconnection Network Cost of Chip Multiprocessors - PowerPoint PPT Presentation

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentín Puente and José Ángel Gregorio.

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 2 Outline • Motivation • Reactive traffic • End-to-end deadlock • Rotary solution • In order delivery • Evaluation • Conclusions

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 3 Motivation: NOCs for CMPs • CMP systems usually assume the presence of cache coherency mechanisms. • Cache coherence requirements for the communication subsystem: – Handle of reactive traffics (end-to-end deadlock). – In-order message delivery. • Solutions for these requirements should have a minimal impact on NoC technological boundaries.

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 5 Reactive Traffic Messages involved in a memory transaction depend one upon the other • Minimal 2 messages: CPU CPU CPU L1 L1 L1 – CPU-A requests a L2 L2 L2 cache line. – CPU-B L2 provides the block. CPU CPU CPU L1 L1 L1 MAIN L2 L2 L2 • Longer Dependencies: MEM – CPU-A requests a cache line. CPU CPU CPU L1 L1 L1 – The line is not in CPU-B L2, to memory. L2 L2 L2 – Memory provides the block. CHIP

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 6 Reactive Traffic This kind of communication can cause message-dependent deadlocks. Network Interface REP-OUT REP-IN REQ-OUT REQ-IN REP-OUT REP-IN REQ-OUT REQ-IN Crossbar Router B Router A 1 Router A and Router B flood the network with REQUEST messages 2 REQUEST messages are only attended if a REPLY can be generated The hole leaved by an attended REQUEST is occupied by another REQUEST 3 DEADLOCK: No more REQUESTS can be attended and REPLIES cannot reach destination.

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 7 Reactive Traffic A widely utilized solution to avoid this problem is buffer replication. REQ and REP travel through different buffering resources (virtual networks). REP-OUT REP-IN REQ-OUT REQ-IN REP-OUT REP-IN REQ-OUT REQ-IN Crossbar Router

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 8 Reactive Traffic Path replication solves end-to-end deadlock problem, but can seriously affect other relevant design aspects, such as area, complexity, power. Consumer Injector Crossbar Buffer Rtg. & Arb. 1 Message Type 4 Message Types Alpha 21364 router: 7 message types

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 9 Previous work: The Rotary Router [REF] P. Abad, V. Puente, P. Prieto, J.A. Gregorio, “Rotary Router: An Efficient Architecture for CMP Interconnection Networks”, International Symposium on Computer Architecture (ISCA), 2007.

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 10 Rotary Router Sketch Input Stage  Packet Pre-Routing.  Ring Selection. N Injector Free? Buffering Segment Stage No! Consumer  Packet movement.  Output arbitration. Buffer Output Stage E W  Flow control.  Packets storage. S

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 11 Rotary Router Advantages • Head of Line Blocking Avoidance. • Improved Buffering utilization. • Adaptive routing without virtual channels. • Centralized structures avoidance (Xbar, Arbiter). • Topology agnostic Deadlock avoidance Mechanism. 120 Normalized EDP (%) 100 80 60 40 20 Classic Rotary 0 LU JAVA FT HTTP

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 12 Reactive Traffic Continuous movement of packets inside the router rings allows the Rotary Router to implement a solution to end-to-end Deadlock without requiring path replication. COMMUNICATION PATTERN M-1 M-2 M-3 M-4 Lim M-3 Lim M-2 Lim M-1 Empty

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 13 Reactive Traffic Continuous movement of packets inside the router rings allows the Rotary Router to implement a solution to end-to-end Deadlock without requiring path replication. COMMUNICATION PATTERN M-1 Lim M-3 Lim M-3 M-2 Lim M-3 Lim M-3 Lim M-3 M-3 Lim M-2 Lim M-2 Lim M-2 Lim M-2 M-4 Lim M-2 Lim M-1 Lim M-1 Lim M-1 Lim M-1 Lim M-1 Empty Empty Empty Empty Inject Transit

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 15 In-Order Delivery • This requirement is imposed by some memory coherence protocols (v.gr. Token coherence protocol) or maintenance tasks. • In these cases, only specific transactions need to be ordered (v.gr. Persistent request deactivation) • Ordered messages represent only a small portion of total network traffic ( ~5% of total traffic).

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 16 In-Order Delivery Fulfilling this requirement is extremely simple for input buffered routers. It becomes a challenge for the Rotary Router: -Adaptive routing allows inter-router packet overtaking. -Internal router rings allow intra-router overtaking. N 2 Injector 1 Consumer Buffer 2 1 E W S

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 17 In-Order Delivery Inter-router overtaking is avoided through specific Routing decisions for in-order messages: -wraparound links will be avoided (Mesh) -Adaptive routing will not be allowed (DOR).

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 18 In-Order Delivery Intra-router overtaking needs a special mechanisms to be avoided. 0 IN IN IN OUT OUT 1 2 0 1 0 1 0 1

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 20 Performance Evaluation • Compared to three different routers Adaptive Bubble Router Deterministic Router Low Latency Router Latch DOR DOR DOR Adaptive

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 21 Performance Evaluation • Synthetic Traffic Patterns • Real Workloads – GEMS + SICOSYS. 16 4GB, 260 cycles, 320 GB/s Number of cores Main Memory Private, 32KB, 2-way, 64Bytes 16 bytes L1 I/D cache Command size block, 1-cycle SNUCA, 16x16 banks, 4 per router 8x8 Torus L2 cache Network Topology 128KB, 16-way, 3-cycles, Pseudo 128 bits / 1 cycle latency L2 cache bank Network Link LRU, 64 Bytes block

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 22 Performance Evaluation • Synthetic Traffic Patterns – 5 message types. – 32.000 messages of each type delivered. – Low-lat topology: Mesh. 500 ROTARY BADA BDOR LOW-LAT 400 300 200 100 0 RAND BIT-REV. MAT-TR PERM

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 23 Performance Evaluation • Real Workloads – Transactional & Scientific applications. ROTARY BADA 200 BDOR LOW-LAT 150 100 50 0 IS LU FT OLTP Java HTTP1 HTTP2

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 25 Conclusions • The Rotary Router has been the base to implement a mechanism able to deal with end- to-end deadlocks. • This mechanism does not require path replication. • We solve in-order delivery with a simple method which requires few extra hardware. • Flexible buffer utilization allows our router to obtain better performance results.

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 26 Questions?

Reducing the Interconnection Network Cost of Chip Multiprocessors - PowerPoint PPT Presentation

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente and Jos ngel Gregorio. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 2 Outline Motivation Reactive traffic

Interconnection, Peering IXPs What and How Interconnection 2 Interconnection The Internet is

Interconnection Application Options and Process Jason Foster, Sr. Interconnection Specialist

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

ISO generator interconnection queue Bob Emmert Sr. Manager, Interconnection Resources Board of

Decision on interconnection process enhancements for independent study and fast track processes

synchronous networks an interconnection structure with an interconnection structure with

Decision on Interconnection Process Enhancements Track 4 Robert Emmert Manager,

Evolution of Interconnection Joseph Lorenzo Hall March 11, 2015 Princeton CITP Global Conference

Low Power and Reliable Interconnection with Low Power and Reliable Interconnection with

Interconnection Networks for Parallel Computers Interconnection networks carry data between

SoC Design Lecture 10: On-Chip Interconnection Networks Lecture 10: On Chip Interconnection

Endogenizing Interconnection Measurement and Economics Christopher S. Yoo University of

Interconnection Structures Patrick Happ Raul Queiroz Feitosa Objective To present key issues

Pricing Interconnection: one regulatory economists perspective William Lehr MIT

Dave Mark Intrinsic Algorithm Reducing the world to mathematical equations! Reducing

EDISON, INC. ESG: Committed to Reducing Emissions ESG: Committed to Reducing Methane Emissions

5194.01: Introduction to High-Performance Deep Learning Mesh-TensorFlow & SparkNet Shen Wang

Sidecars and Service Meshes Next level scaling for microservices User Interface(s): HTML, CSS,

Parallel I/O and Parallel Refinement Chris Richardson Garth Wells DCSE project NAG

Mesh-Free Applications for Static and Dynamically Changing Node Configurations Natasha Flyer

Contrasting Topologies for Regular Interconnection Networks under the Constraints of Nanoscale

Open Cloud Mesh Peter Szegedi GANT Amsterdam 17th TF-Storage, Pisa, Italy 14/10/2015 Networks

Compact Behavioural Modelling Behavioural Modelling Compact of Electromagnetic Effects of

One of the major challenges in the design and verification of manycore systems is cache coherency.

Sambuz

Useful Links

Newsletter

Mail Us

Reducing the Interconnection Network Cost of Chip Multiprocessors - PowerPoint PPT Presentation

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente and Jos ngel Gregorio. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 2 Outline Motivation Reactive traffic

Interconnection, Peering IXPs What and How Interconnection 2 Interconnection The Internet is

Interconnection Application Options and Process Jason Foster, Sr. Interconnection Specialist

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

ISO generator interconnection queue Bob Emmert Sr. Manager, Interconnection Resources Board of

Decision on interconnection process enhancements for independent study and fast track processes

synchronous networks an interconnection structure with an interconnection structure with

Decision on Interconnection Process Enhancements Track 4 Robert Emmert Manager,

Evolution of Interconnection Joseph Lorenzo Hall March 11, 2015 Princeton CITP Global Conference

Low Power and Reliable Interconnection with Low Power and Reliable Interconnection with

Interconnection Networks for Parallel Computers Interconnection networks carry data between

SoC Design Lecture 10: On-Chip Interconnection Networks Lecture 10: On Chip Interconnection

Endogenizing Interconnection Measurement and Economics Christopher S. Yoo University of

Interconnection Structures Patrick Happ Raul Queiroz Feitosa Objective To present key issues

Pricing Interconnection: one regulatory economists perspective William Lehr MIT

Dave Mark Intrinsic Algorithm Reducing the world to mathematical equations! Reducing

EDISON, INC. ESG: Committed to Reducing Emissions ESG: Committed to Reducing Methane Emissions

5194.01: Introduction to High-Performance Deep Learning Mesh-TensorFlow &amp; SparkNet Shen Wang

Sidecars and Service Meshes Next level scaling for microservices User Interface(s): HTML, CSS,

Parallel I/O and Parallel Refinement Chris Richardson Garth Wells DCSE project NAG

Mesh-Free Applications for Static and Dynamically Changing Node Configurations Natasha Flyer

Contrasting Topologies for Regular Interconnection Networks under the Constraints of Nanoscale

Open Cloud Mesh Peter Szegedi GANT Amsterdam 17th TF-Storage, Pisa, Italy 14/10/2015 Networks

Compact Behavioural Modelling Behavioural Modelling Compact of Electromagnetic Effects of

One of the major challenges in the design and verification of manycore systems is cache coherency.

Sambuz

Useful Links

Newsletter

Mail Us

5194.01: Introduction to High-Performance Deep Learning Mesh-TensorFlow & SparkNet Shen Wang