deadlock by virtual
play

DEADLOCK BY VIRTUAL CHANNEL ADDRESS/DATA FIFO DECOUPLING Ka-Ming - PowerPoint PPT Presentation

BREAKING MULTICAST DEADLOCK BY VIRTUAL CHANNEL ADDRESS/DATA FIFO DECOUPLING Ka-Ming Keung, Akhilesh Tyagi Iowa State University On-Chip System with On-Chip Network Many tiles on a chip Communication among Tiles is supported by 2D


  1. BREAKING MULTICAST DEADLOCK BY VIRTUAL CHANNEL ADDRESS/DATA FIFO DECOUPLING Ka-Ming Keung, Akhilesh Tyagi Iowa State University

  2. On-Chip System with On-Chip Network • Many tiles on a chip • Communication among Tiles is supported by 2D Mesh Network

  3. Adaptive Routing  Allows packets being router through less congested channel.

  4. Native Multicast Support  Avoid redundant unicast packets  Decrease Network Load  Reduce Packet Latency

  5. Adaptive Routing + Native Multicast Support  Allow dynamic multicast packet divergent points  Decrease Network Load

  6. Path Based Adaptive Routing

  7. Valid Path Choice Route Stop 0 Stop 1 Stop 2 Stop 3 OE Valid Viol 1 WWNN (2,2) (1,2) (0,2) (0,3) Free Yes 2 WNWN (2,2) (1,2) (1,3) (0,3) Even No 3 WNNW (2,2) (1,2) (1,3) (1,4) Even No 4 NWWN (2,2) (2,3) (1,3) (0,3) Odd Yes 5 NWNW (2,2) (2,3) (1,3) (1,4) Both No 6 NNWW (2,2) (2,3) (2,4) (1,4) Odd Yes Odd-Even Turn Model(Chiu et. al.) to ensure the network is deadlock free. Only route 1,4,6 are valid. Route 2,3,5 violate the odd-even routing rule

  8. Path Based Adaptive Routing

  9. Path Selection  Channel Congestion(CC x,y,j ) is measured by the total Channel Demand(CD x,y,i,j ) by all router input buffers: CC x,y,j =CD x,y,north,j +CD x,y,east,j +CD x,y,west,j + CD x,y,south,j +CD x,y,local,j  Path Congestion ( PC i ) is the sum of the channel congestion along the path.   PC CC i x y j , ,  Pick the valid path i with the lowest PC i

  10. Observation Range  Intuition: Bigger observation range leads to better network performance.  Bigger observation range requires More congestion status wires from the remote router Longer cost computation path  Potentially affects router clock frequency More adders and comparators  Higher Area Cost

  11. Observation Range Uniform Traffic Test: • Low-Load Latency • Stay the same • Throughput • 5x5 is 29% higher than 3x3 • 7x7 is 6% higher than 5x5 Route Computation Path • 9x9 is 5.5% higher than 7x7 • 2000 RC Path picosecond • 5x5 is 219ps longer than 3x3 1500 • 7x7 is 439ps longer than 5x5 1000 • 9x9 is 453ps longer than 7x7 500 0 • We pick 5x5 to avoid RC stage OR3x3 OR5x5 OR7x7 OR9x9 becomes the critical stage

  12. Virtual Destinations  Not all destination lies within the observation range  For those destinations, we assume they lie on the observation range boundary

  13. Multicast Adaptive Routing  Objective: Reduce the number of buffer write by diverging the packet as late as possible

  14. Multicast Adaptive Routing Rule 1(XY Destinations):  If the packet has directions in North, East, West and South, packet will be routed to the corresponding direction.

  15. Multicast Adaptive Routing Rule 2 (Quadrant Destinations) :  In minimal routing, destinations at the quadrants can be routed horizontally(D h ) or vertically(D v ). If the packet has destination on either D h or D v , quadrant destinations will be routed to that direction.

  16. Multicast Adaptive Routing Rule 3 (Quadrant Destinations):  Group the destinations which can’t be routed by Rule 2 to a single routing direction.

  17. Multicast Adaptive Routing Rule 4 (Quadrant Destinations):  Destinations which can’t be routed by Rule 3 are routed using unicast adaptive routing to the virtual destination at the corner of the observation range.

  18. Unicast Deadlock  Lock because of channel dependence  XY-Routing is free from Unicast Deadlock  Previous Solutions: Ordered nodes and 1. virtual channel (Dally et al.) West-First, North-Last 2. and Negative-first (Glass et al.) Odd-Even Routing 3. (Chiu)

  19. Multicast Deadlock  Lock because of channel dependence  Even XY-Routing could suffer from Multicast Deadlock  Example: Tile(1,1) sends multicast packet 55 to Tile (0,1), (3,1) Tile(2,1) sends multicast packet 77 to Tile(0,1), (3,1) Packet 55 does not release (0,1) E until it gets (3,1) W Packet 77 does not release (3,1) W until it gets (0,1) E

  20. Multicast Deadlock Previous Solution 1:  Send four packets to regions (X+,Y+),(X+,Y-), (X-,Y+) and (X-,Y-) separately. (Lin et al.)

  21. Multicast Deadlock Previous Solution 2:  Hamiltonian Path Pre-compute deadlock free path and store it in the packet header. Routers route the packet following the stored path (Lin et al.)

  22. Multicast Deadlock Previous Solution 3:  Planar Network(Chien et al.) Use two subnet networks X+ and X-. X+ sub-network for packet with increasing X co-ordinate. X- sub-network for packet with non-increasing X co-ordinate.

  23. Multicast Deadlock Simple Solution:  Use Virtual Cut-through routing instead of wormhole routing.  Router (0,1) East and (2,1) West can store the whole packet 55

  24. Multicast Deadlock  (0,1) East channel and (3,1) West channel are empty when the deadlock occurs.  (1,1) out has no new flit for (0,1) East (Packet 55)  (2,1) out has no new flit for (3,1) West (Packet 77)  Deadlock is broken if packet 55 releases (0,1) East Channel and packet 77 releases (3,1) West channel

  25. Address-Data FIFO Decoupling

  26. Example: Each Virtual Channel can store 2 addr flits + 2 data flits Packet 77

  27. Example: Virtual Channel can store 2 addr flits + 2 data flits

  28. Packet 77 Received

  29. Synthetic Traffic • Four Types of synthetic traffics: – Uniform Traffic – Transpose Traffic (x,y)  (N-1-y,N-1-x) – Transpose2 Traffic (x,y)  (y,x) – Tornado Traffic • Multicast Group Size: 10 • Multicast Probability: 5%

  30. Experimental Setup • Mesh Size: 20x20 • Flit Size: 128-bit • Simulation Cycle: 30000 • Packet Length: 10 flits • #Virtual Channel: – 3 unicast channels (unicast router) – 2 unicast + 1 multicast channel (multicast router) • Virtual Channel Depth: – 14 (Virtual Cut-Through) – 9 (Address-Data FIFO decoupling)

  31. Throughput (Uniform) Throughput (Transpose) 1600000 1400000 1400000 1200000 # Flits Arrived # Flits Arrived 1200000 1000000 xy_vc14 1000000 800000 800000 xy_vc9 600000 600000 padap_vc14 400000 400000 padap_vc9 200000 200000 0 0 Unicast Multicast Unicast Multicast Throughput (Transpose2) Throughput (Tornado) 1400000 140000 1200000 120000 # Flits Arrived # Flits Arrived 1000000 100000 xy_vc14 800000 80000 xy_vc9 600000 60000 padap_vc14 400000 40000 padap_vc9 200000 20000 0 0 Unicast Multicast Unicast Multicast

  32. Low Congestion Latency Low Congestion Latency (Transpose) (Uniform) 120 115 115 110 xy_vc14 Cycles Cycles 110 105 xy_vc9 105 100 padap_vc14 95 100 padap_vc9 90 95 Unicast Multicast Unicast Multicast Low Congestion Latency Low Congestion Latency (Tornado) (Transpose2) 140 120 138 136 115 134 xy_vc14 Cycles Cycles 110 132 xy_vc9 130 padap_vc14 105 128 padap_vc9 126 100 124 95 122 Unicast Multicast Unicast Multicast

  33. Energy Consumption Energy Consumption (Transpose) (Uniform) 440 420 430 pJ/flit arrived pJ/flit arrived 420 400 xy_vc14 410 xy_vc9 400 380 390 padap_vc14 380 360 padap_vc9 370 340 360 Unicast Multicast Unicast Multicast Energy Consumption (Tornado) Energy Consumption 550 (Transpose2) 540 440 pJ/flit arrived 530 pJ/flit arrived xy_vc14 520 420 xy_vc9 510 400 padap_vc14 500 490 padap_vc9 380 480 360 470 Unicast Multicast Unicast Multicast

  34. Energy Consumption Energy Consumption (Transpose) (Uniform) 460 440 pJ / Flit Arrived 450 pJ / Flit Arrived xy_vc9_unica 440 st 420 430 xy_vc9_multic 420 400 ast 410 padap_vc9_u 380 400 nicast 390 360 380 padap_vc9_m 100000 600000 1100000 100000 600000 ulticast # Flits Arrived # Flits Arrived Energy Consumption (Tornado) Energy Consumption 580 (Transpose2) pJ / Flit Arrived xy_vc9_unica 460 560 pJ / Flit Arrived st 440 xy_vc9_multic 540 ast 420 520 padap_vc9_u 400 nicast 380 500 padap_vc9_m 100000 600000 100000 600000 ulticast # Flits Arrived # Flits Arrived

  35. FPGA Traffic • CPU controls the application jobs scheduling and placement • Each tile contains its own configuration bitstream controller

  36. Applications (b) MPEG4 Decoder (d) MPEG2 Encoder (c) MPEG2 Decoder (a) MPEG4 Encoder

  37. Experimental Setup • Mesh Size: 20x20 • Flit Size: 128-bit • Simulation Cycle: 200,000,000 • Virtual Channel Depth: 14 • Max Packet Length: – 10 (Virtual Cut-Through) – 20 (Address-Data FIFO decoupling) • #Virtual Channel: – 3 unicast channels (unicast router) – 2 unicast + 1 multicast channel (multicast router)

  38. Average Tile Configuration Time • Adaptive routing 90000 can reduce the 80000 configuration time 70000 by at most 10% 60000 • With address-data Cycles 50000 decoupling, configuration time 40000 can be reduced by 30000 at most 25% 20000 • Multicast support 10000 reduces 0 configuration time by at most 40%

  39. Average Application Runtime • Adaptive routing 530000 can reduce the 520000 application runtime 510000 by at most 6% 500000 • With address-data 490000 Cycles decoupling, 480000 application runtime 470000 can be reduced by 460000 at most 10% 450000 • Multicast support 440000 reduces application 430000 runtime by at most 4%

Recommend


More recommend