a variable pipeline on chip router optimized to traffic
play

A Variable-pipeline On-chip Router Optimized to Traffic Pattern - PowerPoint PPT Presentation

A Variable-pipeline On-chip Router Optimized to Traffic Pattern Yuto Hirata (Keio University) Hiroki Matsutani (University of Tokyo) Michihiro Koibuchi (National Institute of Informatics) Hideharu Amano(Keio University) Japan Outline


  1. A Variable-pipeline On-chip Router Optimized to Traffic Pattern Yuto Hirata (Keio University) Hiroki Matsutani (University of Tokyo) Michihiro Koibuchi (National Institute of Informatics) Hideharu Amano(Keio University) Japan

  2. Outline • NoC is the heart of many-core processor • Router pipeline affects performance and power – Various existing pipeline structures • Trade-off between latency, throughput, and power • We propose a variable-pipeline router(1,2,3cycle) – 1cycle mode has lowest latency – 2cycle mode is better at throughput and power – 3cycle mode is used to avoid hotspot

  3. Our target region Number of PEs (caches are not included) picoChip PC102 picoChip PC205 256 ClearSpeed CSX700 Hundreds of simple PEs 128 Intel 80-core ClearSpeed CSX600 64 TILERA TILE64 Target 32 MIT RAW UT TRIPS (OPN) 16 STI Cell BE Chip multi- processor (CMP) 8 Sun T1 Sun T2 4 Intel Core, IBM Power7 AMD Opteron 2 2002 2004 2006 2008 2010?

  4. Our target: NoC for future CMPs • 8-CPU CMP example – 8 CPUs (each has a private L1 cache) – Shared L2 cache (divided into 64 banks) [Beckmann, MICRO’04] UltraSPARC L1 cache (I & D) (16kB) L2 cache bank (256kB, 4-way)

  5. Table of Contents • Trade-off Problem of Router Structures • Solution: Variable-pipeline Router • Evaluation • Related Work • Conclusions

  6. Conventional On-chip Router • Module • Pipeline stage – Input channel – Routing computation(RC) – Crossbar switch – VC allocation(VA) – Output channel – Switch allocation(SA) – Switch traversal(ST) Output buffer Arbiter North Head flit RC VA SA ST North SA ST East Body flit 1 East West West Body flit 2 SA ST South South Tail flit SA ST Core Core 1 2 3 4 5 6 cycle 7 Input channel X-bar Output channel

  7. 1cycle Router Pipeline • Trade-off between router pipeline structures • 1cycle pipeline structure Good: 1-cycle transfer → Lowest communication latency Weak: Sequential execution of NRC/VSA and ST stages → Lowest frequency and throughput NRC Link Link VA ST ・・・ ・・・ Link Link SA ※ Router

  8. 2cycle Router Pipeline • 2cycle Pipeline Good: NRC and VSA(VA+SA) are executed in parallel → Highest frequency Modest: 2-cycle transfer → Shorter communication latency NRC Link Link VA ST ・・・ ・・・ Link Link SA ※ Router

  9. 3cycle Router Pipeline • 3cycle pipeline Good: Adaptive routing(Duato’s protocol) → Avoid the hotspots Modest: Medium frequency Weak: 3-cycle transfer → large communication latency in cycles VA1 Link Link SA1 ・・・ ・・・ RC ST VA2 Link Link SA2 ※ Router

  10. Trade-off of Pipeline Structures • Pros(red) and cons(blue) of each pipeline structure Pipeline depth Operating freq. Throughput Latency 1cycle Low Low Lowest (deterministic routing) 2cycle High High Lower (deterministic routing) 3cycle Mid Mid High (adaptive routing) • An optimal pipeline depends on traffic requirement → Our solution: switch the pipeline structures dynamically

  11. Table of Contents • Trade-off Problem of Router Structures • Solution: Variable-pipeline Router • Evaluation • Related Work • Conclusions

  12. Variable-pipeline(VP) Router • Using DVFS – High throughput by increasing freq. and voltage • Increasing the number of pipeline stages – Low latency by decreasing freq. and voltage • Decreasing the number of pipeline stages • Local processor changes the router pipeline structure for each application mode Routing Purpose Duato’s protocol 3cycle Avoiding hotspot (adaptive) 2cycle DOR (deterministic) High throughput, and Low power 1cycle DOR (deterministic) Low latency

  13. Design of Variable-pipeline Router • Each mode uses the different path NRC VA/ SA RC Input channel Output channel

  14. Design of Variable-pipeline Router • 1cycle mode VA/ NRC SA RC Input channel Output channel

  15. Design of Variable-pipeline Router • 2cycle mode VA/ NRC SA RC Input channel Output channel

  16. Design of Variable-pipeline Router • 3cycle mode NRC VA RC / SA Input channel Output channel

  17. Design of Variable-pipeline Router • 5-port(NEWS + Core) for 2-D Mesh/Torus • Flit width: 66bit(64bit data + 2bit flit type) – Packet size is variable

  18. Design of Variable-pipeline Router • Reconfiguration takes only a single cycle – when no packets arrive

  19. 70  Most of modules are shared by different modes 60  Shared by 1-, 2-, and 3-cycle 50 Area(kGates) mode other 40  Input buffers, NRC(RC) and crossbar VA/SA modules 30  Input buffer is dominating the input router area 20 channel  Shared by 2-, and 3-cycle 10 mode 0  Output latch in the output port 3cycle router

  20. Fault Tolerance for RC module • When router A and B are running on 1 or 2-cycle mode with look-ahead (NRC), – if the NRC of router B fails, router C executes both NC and NRC Router A Router B Router C RC RC RC NRC NRC NRC Packet is including Packet has NO Router C executes the NRC information NRC information both RC and NRC

  21. Table of Contents • Trade-off Problem of Router Structures • Solution: Variable-pipeline Router • Evaluation • Related Work • Conclusions

  22. Evaluation Items • RTL simulation –NC-Verilog8.1 • Design Synthesis –Design Compiler 2007.12-SP3 –Nangate 45nm library typical(1.2V, 25 ℃ ) • Network Simulation –GEMS/Simics • Full system simulator –Flit-level network simulator • Application –9 benchmarks from SPLASH-2

  23. Target 8-core Processor  Sun Solaris 9, Sun Studio 12  Routers are connected by 2-dimentional mesh UltraSPARC L1Cache(I & D) (16kB) L2Cache bank (256kB, 4-way) On-chip router

  24. 1. Hardware synthesis results – Area(kGates), frequency(MHz), power(bit/J) 2. Full system CMP simulation results GEMS/Simics simulator; SPLASH-2 benchmark – application execution time – Collect the packet trace 3. The average hop count for each traces – get the zero-load latency data 4. Network simulation results using packet traces – Maximum throughput and power consumption

  25. Area Overhead • Area of Variable-pipeline router – Increased by 13.3% – Input buffer is dominant in routers 80 13.3% 70 60 Area[kGates] 50 other 40 output channel 30 crossbar 20 input channel 10 0 1cycle 2cycle 3cycle VP router

  26. Operating Frequency • Frequency of each pipeline stage • Supply voltage: 0.6V to 1.2V – As supply voltage increases, frequency is improved • VP router has 12% frequency overhead 700 Frequency[MHz] 600 1cycle router 500 400 2cycle router 12% 300 200 3cycle router 100 0 VP router(1cycle 0.6 0.7 0.8 0.9 1 1.1 1.2 mode) Supply Voltage[V]

  27. Application Execution Time • Execute SPLASH-2 benchmark for 1, 2, 3cycle router – Lower execution time is better – 2cycle is best 1 Normalized execution time 0.8 1cycle 0.6 2cycle 0.4 3cycle 0.2 0

  28. • Flight time without packet conflicts – Strongly affect to performance – Lower latency is better • 1-cycle mode is best 50 Zero-load latency(nsec) 40 1cycle 30 2cycle 20 3cycle 10 0

  29. • 2-cycle router achieves the highest throughput • Overhead of the adaptive 3-cycle router is a bottleneck 1 Normalized maximum 0.8 throughput 1cycle 0.6 2cycle 0.4 3cycle 0.2 0

  30. Power Consumption • 2cycle mode is best 18 Power consumption[mW] 16 14 12 10 1cycle 8 2cycle 6 3cycle 4 2 0 0.3 0.5 0.7 0.9 1.1 1.3 Throughput[M flit/sec] SPLASH-2 radiosity benchmark

  31. Table of Contents • Trade-off Problem of Router Structures • Solution: Variable-pipeline Router • Evaluation • Related Work • Conclusions

  32. Related Work 1. Pipeline integration of processors (Shimada 、 2007) – Multiple pipeline stages are integrated into a stage when freq decreases • Using DVFS • Power efficiency improves 2. Router micro-architecture optimizing pipelines – Speculative router – VA,SA in parallel (Peh, HPCA00) – Prediction Router (Matsutani, HPCA09) – Look-ahead(LA) router (Galles, HOTI’96) • NRC and VSA can be executed in parallel → We integrated different pipeline stages on an on -chip router

  33. • On-chip router is the heart of NoCs –Various existing pipeline structures • Trade-off between latency, throughput, and power • We designed a variable-pipeline router –Switching 1-, 2-, and adaptive 3-cycle pipelines A variable-pipeline router micro-architecture

Recommend


More recommend