Channel Slicing: a Way to Build Fast Routers for Asynchronous NoCs Wei Song and Doug Edwards The University of Manchester 15/09/2009 Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Content • Asynchronous NoCs • Channel Slicing – Motivation – Sliced sub-channels – Flow control • An asynchronous wormhole router – Implementation details – Performances Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Network-on-Chip (NoC) (0,0) (0,1) (0,2) (0,3) PE NI (1,0) (1,1) (1,2) (1,3) RT (2,0) (2,1) (2,2) (2,3) PE: Processor Element NI: Network Interface (3,0) (3,1) (3,2) (3,3) RT: router Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Synchronous/Asynchronous • Synchronous • Asynchronous – Fast – Slow !! • Intel 80-tile 4GHz 65nm • ASPIN 714MHz 90nm • DSPIN 408MHz 130nm • ANoC 220MHz 130nm – Small – Large • DSPIN 0.161mm 2 • ANoC 0.211mm 2 – Power Consuming – Power Efficient • 10.39mW (250MHz) • 3.69mW (160MHz) – Sensitive to variation – Tolerance to variation – Complex clock tree – No clock tree Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Content • Asynchronous NoCs • Channel Slicing – Motivation – Sliced sub-channels – Flow control • An asynchronous wormhole router – Implementation details – Performances Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Asynchronous Pipelines • CHAIN ( Bainbridge’02 ) – 4 phase 1-of-4 pipelines • QoS NoC ( Felicijan’04 ) – 8-bit, Four 4 phase 1-of-4 pipelines • ANoC ( Beigne’05 ) – 32-bit 16 4 phase 1-of-4 pipelines • SpiNNaker ( Plana’07 ) – Several 1-of-4/2-of-7 pipelines • ASPIN ( Sheibanyrad’08 ) – 32-bit 16 dual-rail pipelines / bundled-data • MANGO ( Bjerregaard’05 ) & QNoC ( Dobkin’09 ) – Bundled-data Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Completion Detection 16-bit ack of sub-channels 2-bit C C 8 d_i 16 d_o 4 2-bit C C CD CD ack_i ack_o ack Advantages: data on all sub-channels are synchronized, ease the time division multiple access (TDMA) techniques, such as virtual channel and TDMA Drawbacks: low speed (66% on CD) Advanced Processor Technology Group 2014/5/13 The School of Computer Science
ChSlice: implementation 2-bit C C d_i 16 d_o 2-bit C C CD CD ack_i ack_o 2-bit d_o 0 d_i 0 C C ack_o 0 ack_i 0 16 2-bit d_o 15 d_i 15 C C ack_o 15 ack_i 15 Advanced Processor Technology Group 2014/5/13 The School of Computer Science
How to do it in a router? Arbiter Arbiter other ports other ports crossbar crossbar data-path ack Arbiter Arbiter other ports other ports crossbar crossbar data-path ack Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Flow control H D D D D D D T time head data routing sub-channels H D D D D D D T H D D D D D D T H D D D D D D T H D D D D D D T H D D D D D D T H D D D D D D T time head routing data Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Content • Asynchronous NoCs • Channel Slicing – Motivation – Sliced sub-channels – Flow control • An asynchronous wormhole router – Implementation details – Performances Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Router: structure 80 80 d_i_0 d_o_0 16 16 ack_i_0 ack_o_0 arbiter ctl 5 input 5 output ports ports 80 80 d_i_4 d_o_4 16 16 ack_i_4 ack_o_4 arbiter ctl Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Router: data path input buffer crossbar output buffer ib_d ic_d ob_d 0 0 oc_d 0 1 1 1 2 op_d 2 2 ip_d 3 3 3 eof eof eof gnt ib_pa eof ip_a ob_a op_a oc_a ib_a ic_a rt_err acken Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Re-Synchronization input buffer crossbar output buffer ib_d ic_d ob_d 0 0 oc_d 0 1 1 1 2 2 2 op_d ip_d 3 3 3 eof eof eof gnt ib_pa eof ip_a ob_a op_a oc_a ib_a ic_a rt_err ic_a rt_err rt_dec eof acken ch_fin-/1 ch_fin-/2 rt_dec+ rt_err+ rt_dec+ rt_err- acken-/1 acken-/2 ch_fin+/1 ch_fin+/2 eof+/1 eof+/2 eof-/2 ic_a- ic_a+ acken+/2 ch_fin acken eof-/1 normal frame acken+/1 faulty frame Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Routing Decision rt_err rt_dec gnt_w gnt_s gnt_n gnt_l rt_en ch_fin_a ib_a 0 east arbiter M E M E local_x ib_d 0 [0..3] 4 ir_n > ib_a 1 4 - b i t ( 1 - o f - 4 ) gnts from other ports M E M E < 8 c o m p a r a t o r ib_d 1 [0..3] 4 = target_x ib_a 2 ir_e target_y M E M E > ib_d 2 [0..3] 4 4 - b i t ( 1 - o f - 4 ) < 8 c o m p a r a t o r ib_a 3 = ir_w or_s ib_d 3 [0..3] or_w 4 or_n or_l local_y ir_l rt_dec rt_err ch_fin 0 ch_fin 15 rt_dec+ rt_err+ rt_en+ rt_en-/1 rt_en-/2 ch_fin_a- ch_fin_a ch_fin_a+/1 ch_fin_a+/2 rt_dec- rt_err- rt_en normal frame faulty frame Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Router: layout • Faraday 130nm Technology • 32-bit, 5 ports, XY routing algorithm • 0.3x0.3mm (12.6K gates, 0.050mm 2 ) • Typical corner (25 o C 1.2V) • Cycle period 2.2 ns (1.82GByte/s per port) • Equivalent to 450MHz Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Compare with other routers Tech Period Period Pipeline Style Other (nm) (ns) (Hz) Sliced Wormhole 130 2.2 450M 4-phase 1-of-4 Standard cell Synchronized Wormhole 130 2.8 360M 4-phase 1-of-4 Standard cell ANoC 130 4.0 250M 4-phase 1-of-4 Customized Cell Lib ASPIN 90 0.88 1.13G Customized FIFO Dual-Rail / Bundled-Data QNoC 180 4.8 208M Bundled-data Delay line MANGO 120 1.26 790M Bundled-data Delay line DSPIN 130 2.45 408M Synchronous circuit Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Speed vs. Data Width Sliced Wormhole QNoC Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Speed and Area Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Question? Advanced Processor Technology Group 2014/5/13 The School of Computer Science
Recommend
More recommend