iCAST 2012 Seoul, Korea July 21-24 2012 ONoC-SPL: Customized Network-on-Chip (NoC) Architecture and Prototyping for Data-intensive Computation Applications Akram Ben Ahmed, Kenichi Mori, Abderazek Ben Abdallah The University of Aizu School of Computer Science and Engineering, Adaptive Systems Laboratory, Aizu-Wakamatsu, Japan. Email:d8141104@u-aizu.ac.jp The University of Aizu Adaptive systems lab 1
Outline • Background • ONoC-SPL architecture – OASIS2-NoC overview – SPL Insertion Algorithm • Evaluation • Conclusion The University of Aizu Adaptive systems lab 2
Outline • Background • ONoC-SPL architecture – OASIS2-NoC overview – SPL Insertion Algorithm • Evaluation • Conclusion The University of Aizu Adaptive systems lab 3
Background: Bus-based system Vs. NoC Wait Wait Core1 Core2 Core3 Data Data Data Memory Memory I/O 1 2 Bus based system Parallelism problem High latency The University of Aizu Adaptive systems lab 3
Background: Bus-based system Vs. NoC Input buffer Processing Element Router Unidirectional link Network Interface NoC based system [Carloni2009, Ben2006] The University of Aizu Adaptive systems lab 5
Background: Bus-based system Vs. NoC NoC based system [Carloni2009, Ben2006] The University of Aizu Adaptive systems lab 5
Background: NoC Challenges - Routing [Sulivan1977, Seo2005] Path selection has an impact on the system performance The University of Aizu Adaptive systems lab 5
Background: NoC Challenges - Routing [Sulivan1977, Seo2005] - Flow control [Agarwal2009, Pullini2005] Efficient flow control is crucial The University of Aizu Adaptive systems lab 5
Background: Bus-based system Vs. NoC - Routing [Sulivan1977, Seo2005] - Flow control [Agarwal2009, Pullini2005] - Topology Mesh [Zhang2011] Uniform connection Large hop count The long distance affects the latency, throughput and power The University of Aizu Adaptive systems lab 5
Background: Bus-based system Vs. NoC - Routing [Sulivan1977, Seo2005] - Flow control [Agarwal2009, Pullini2005] - Topology Mesh [Zhang2011] Torus [Dally1986] Connects the network extremities to reduce the inter-node distance - Increasing complexity - Different wire lengths - Clock skew The University of Aizu Adaptive systems lab 5
Background: Bus-based system Vs. NoC - Routing [Sulivan1977, Seo2005] - Flow control [Agarwal2009, Pullini2005] - Topology Mesh [Zhang2011] Torus [Dally1986] Customized [Bolotin2004] Especially designed for specific application - Long design time - Difficult to implement The University of Aizu Adaptive systems lab 5
Background: OASIS2-NoC • 4x4 Mesh topology • Wormhole-like switching • Stall-and-Go flow control • 20 bits flit OASIS2-NoC 4x4 network system [*] [*] K. Mori, A. Esch , A. Ben Abdallah, K., Kuroda, ” Advanced Design Issue for OASIS Network-on-Chip Architecture ”, IEEE , International Conference on BWCCA, pp.74-79, 2010. The University of Aizu Adaptive systems lab 5
Background: Motivation • In OASIS2-NoC, PEs are connected uniformly and it suffers from large hop count between any (source, destination) pair – Significantly degrades the overall performance especially for Data intensive applications • Using synthetic traffic in High-level simulation do not reveal the real system performance – Not enough to evaluate the NoC router’s parameters (flow control, Buffer size and routing) effects and trade-offs – Not accurate hardware and performance evaluation The University of Aizu Adaptive systems lab 5
Background: Contributions • Proposal of an optimized version of OASIS-2, named ONoC-SPL, customized with a Short- Pass-Link (SPL) – To reduce the communication latency for long range and high frequency communication • Prototyping ONoC-SPL on FPGA with synthetic and real applications – To evaluate accurate Power consumption, Area utilization and Performance The University of Aizu Adaptive systems lab 5
Outline • Background • ONoC-SPL architecture – OASIS2-NoC architecture – SPL Insertion Algorithm • Evaluation • Conclusion The University of Aizu Adaptive systems lab 15
Outline • Background • ONoC-SPL architecture – OASIS2-NoC architecture – SPL Insertion Algorithm • Evaluation • Conclusion The University of Aizu Adaptive systems lab 16
OASIS2-NoC: Router architecture BW RC/SA CT The University of Aizu Adaptive systems lab 17
OASIS2-NoC: Router architecture Input module Input data enter to these modules - Input buffer (BW) - Look-Ahead-XY routing (RC) The University of Aizu Adaptive systems lab 18
OASIS2-NoC: Router architecture Arbiter and flow control • Arbiter : Handles the arbitration between the different input port request (SA) • Stall/Go : Includes the flow control module The University of Aizu Adaptive systems lab 19
OASIS2-NoC: Router architecture Crossbar Handles the transfer of flits to their appropriate channels depending on the information received from the arbiter (CT) The University of Aizu Adaptive systems lab 20
OASIS2-NoC: Arbitration & flow control Flow control mechanism Arbitration mechanism Matrix arbiter When the priority i > j, P(i,j) becomes Avoiding buffer overflow method is Stall/Go 1 and P(j, i) become 0 highest highest (a) (b) The University of Aizu Adaptive systems lab 21
Outline • Background • ONoC-SPL architecture – OASIS2-NoC architecture – Short-Pass-Link (SPL) Customization • Evaluation • Conclusion The University of Aizu Adaptive systems lab 22
Short-Pass-Link (SPL) Customization SPL • ONoC-SPL employs mesh topology with Short Pass Link(SPL) – To reduce the latency caused by the high number of of hops The University of Aizu Adaptive systems lab 23
SPL insertion process: Algorithm The number of SPL decision Insert commu. selection Simulation and Insertion The University of Aizu Adaptive systems lab 24
SPL insertion process: Example Dimension reversal with SPL Hotspot with SPL Communication Communication frequency 2 SPL inserted frequency 2 SPL inserted Distance Distance (0,3) -> (1,0): 0.294 (0,3) -> (1,0): 4 (3,0) -> (0,3): 6 (3,0) -> (0,3): 0.125 -(3,0) -> (0,3) -(0,3) -> (1,0) (3,3) -> (1,1): 0.235 (3,3) -> (1,1): 4 (0,3) -> (3,0): 6 (0,3) -> (3,0): 0.125 -(0,3) -> (3,0) -(3,3) -> (1,1) (2,0) -> (2,3): 0.235 The University of Aizu Adaptive systems lab 25
Outline • Background • ONoC-SPL architecture – OASIS2-NoC overview – SPL Insertion Algorithm • Evaluation • Conclusion The University of Aizu Adaptive systems lab 26
Evaluation: Evaluation methodology • Design Tools Dimen. Hotspot JPEG – Language: Verilog-HDL Network size info. Behavior – Software: Quartus II 11.0 NoC partitioning parameter Model – Simulation tool: ModelSim- Verilog- RTL code Hardware Altera 6.6 HDL compile – Device: Stratix III FPGA board Synthesis Quartus II • Target applications RGB bitstream FPGA Stratix III – Dimension Reversal 24'b001101100101001101101110; 24'b001101110101010001101111; 24'b010001110110010001111111; 24'b010110100111011110010010; 24'b011001011000000010011011; 24'b011010001000001110011110; 24'b011001000111101110010101; – Hotspot 24'b010101100110110010000101; 24'b001110010101011001110001; 24'b010000000101110101111000; – JPEG encoder Execution Hardware time complexity The University of Aizu Adaptive systems lab 27
Evaluation: Simulation Configuration The University of Aizu Adaptive systems lab 28
Evaluation: Hardware complexity • Extra area less than 5% • 6.5% speed reduction • Slight 1% power overhead The University of Aizu Adaptive systems lab 29
Evaluation: Performance (Execution time) Execution time 30 ONoC-SPL execution time decreased by 30.1 % on average 25 -16.9 -16.1 20 +7.3 +11.3 Dimension Reversal (μs) 15 Hotspot(μs) -29.7 JPEG time (x10^1 ms) -31.0 10 -43.7 5 0 OASIS ONoC-SPL1 ONoC-SPL2 ONoC-SPL3 The University of Aizu Adaptive systems lab 30
Evaluation: Performance (Throughput) Throughput (flits/cycle) ONoC-SPL throughput enhanced 32.3 % on average +49.6 +0.01 +24.8 + 24.8 +11.3 + 22.6 0.0 The University of Aizu Adaptive systems lab 31
Outline • Background • ONoC-SPL architecture – OASIS2-NoC overview – SPL Insertion Algorithm • Evaluation • Conclusion The University of Aizu Adaptive systems lab 32
Conclusion • Proposal of an optimized version of 2D-NoC named ONoC-SPL • SPL insertion algorithm is proposed to reduce the high frequency communication latency • Prototyping on FPGA for accurate performance and hardware complexity evaluation using synthetic traffic and real workload The University of Aizu Adaptive systems lab 33
Conclusion • The execution time has decreased with 30.1% and the throughput has enhanced by 32.3% in average when comparing the proposed system with previous systems • Performance gain was obtained with an extra hardware under 5% observing a slight 0.49% power consumption overhead in average The University of Aizu Adaptive systems lab 34
Recommend
More recommend