globally synchronized frames for guaranteed quality of
play

GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN - PowerPoint PPT Presentation

GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN ON-CHIP NETWORKS Jae W. Lee (MIT) Man Cheuk Ng (MIT) Krste Asanovic (UC Berkeley) June 23 th 2008 ISCA-35, Beijing, China Resource sharing increases performance variation


  1. GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN ON-CHIP NETWORKS Jae W. Lee (MIT) Man Cheuk Ng (MIT) Krste Asanovic (UC Berkeley) June 23 th 2008 ISCA-35, Beijing, China

  2. Resource sharing increases performance variation � Resource sharing ( + ) reduces hardware cost P P P P P P P P ( - ) increases performance variation P P P P P P P P multi-hop on-chip network multi-hop on-chip n work � This performance variation becomes larger and larger as L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ mem mem mem mem bank bank bank bank bank bank cont cont bank bank cont. cont. the number of sharers (cores) increases. Jae W. Lee (2 / 33)

  3. Desired quality-of-service from shared resources � Performance isolation P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P (fairness) multi-hop on-chip network multi-hop on-chip n work L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ mem mem mem mem bank bank bank bank bank bank cont cont bank bank cont. cont. (hotspot) (hotspot) accepted throughput [MB/s] minimum guaranteed BW minimum guaranteed BW processor ID 0 1 2 3 4 5 6 7 8 9 A B C D E F Jae W. Lee (3 / 33)

  4. Desired quality-of-service from shared resources � Performance isolation P P P P P P P P P P P P P P P P (fairness) multi-hop on-chip n multi-hop on-chip network work � Differentiated services (flexibility) L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ mem mem mem mem bank bank bank bank bank bank cont cont bank bank cont. cont. (hotspot) (hotspot) accepted accepted throughput [MB/s] throughput [MB/s] minimum guaranteed BW minimum guaranteed BW differen diff erentia tiated ted allocation processor processor ID ID 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 3 4 5 6 7 8 9 A B C D E F Jae W. Lee (4 / 33)

  5. Resources w/ centralized arbitration are well investigated � Resources with P+ P+ P+ P+ P+ P+ centralized arbitration L1$ L1$ L1$ L1$ L1$ L1$ R R R � SDRAM controllers P+ P+ P+ P+ P+ P+ on-chip L1$ L1$ L1$ L1$ L1$ L1$ � L2 cache banks routers R R R L2$ L2$ mem mem � They have a single entry bank bank ctrl ctrl R R R point for all requests. → QoS is relatively easier [MICRO ’06] [MICRO ’06] [HPCA ‘02] [HPCA ‘02] and well investigated. [PACT ’07] [PACT ’07] [ICS ‘04] [ICS ‘04] [USENIX sec. ’07] [USENIX sec. ’07] [ISCA ‘07] [ISCA ‘07] [IBM ’07] [IBM ’07] … [MICRO ’07] [MICRO ’07] [ISCA ’08] [ISCA ’08] ... ... Jae W. Lee (5 / 33)

  6. QoS from on-chip networks is a challenge � Resources with P+ P+ P+ P+ P+ P+ distributed arbitration L1$ L1$ L1$ L1$ L1$ L1$ R R R � multi-hop on-chip networks P+ P+ P+ P+ P+ P+ on-chip L1$ L1$ L1$ L1$ L1$ L1$ routers R R R � They have distributed L2$ L2$ mem mem arbitration points. bank bank ctrl ctrl → QoS is more difficult. R R R � Off-chip solutions cannot be directly applied because of resource constraints. Jae W. Lee (6 / 33)

  7. We guarantee QoS for flows physical link � Flow: a sequence of packets shared by 3 flows between a unique pair of end nodes (src and dest) R R R R � physical links shared by flows � multiple stages of arbitration R R R R for each packet � We provide guaranteed QoS R R R R to each flow with: � minimum bandwidth R R R R guarantees � bounded maximum delay hotspot resource Jae W. Lee (7 / 33)

  8. Locally fair ⇒ globally fair arbitration arbitration arbitration point 1 point 2 point 3 SRC D SRC D DEST DEST SRC C SRC C SRC B SRC B SRC A SRC A channel rate = C [Gb/s] With locally fair round-robin (RR) arbitration: � Throughput (Flow A) = (0.5) C � Throughput (Flow B) = (0.5) 2 C � Throughput (Flow C) = Throughput (Flow D) = (0.5) 3 C → Throughput of a flow decreases exponentially as its distance to the destination (hotspot) increases. Jae W. Lee (8 / 33)

  9. Motivational simulation � In 8x8 mesh network with RR arbitration (hotspot at (8, 8)) 7 65 4 3 21 accepted throughput accepted throughput [flits/cycle/node] [flits/cycle/node] hotspot 0.06 0.06 8 0.04 0.04 8 7 6 5 4 3 2 1 0.02 0.02 8x8 2D mesh 8x8 2D mesh 0 0 8 8 7 7 node index (Y) node index (Y) 6 1 1 6 2 2 5 5 ) ) ) ) X X 3 3 X X 4 4 ( ( ( ( 4 4 x x x x 3 3 e e 5 5 e e d d 2 d d 6 2 6 n n n n i i 1 7 i i 1 7 e e e e 8 8 d d d d o o o o n n n n w/ minimal-adaptive routing w/ dimension-ordered routing locally-fair round-robin scheduling → globally unfair bandwidth usage Jae W. Lee (9 / 33)

  10. Desired bandwidth allocation: an example � Taken from simulation results with GSF: accepted throughput accepted throughput [flits/cycle/node] [flits/cycle/node] 0.06 0.06 0.04 0.04 0.02 0.02 0 0 8 8 7 7 n node index (Y) 1 1 6 6 o 2 2 5 d 5 3 3 ) 4 e 4 X ) 4 4 X ( 3 3 x 5 ( i 5 e x n 2 2 6 6 d e n d d 7 1 7 1 n i e e 8 i 8 d e x o d o n n ( Y ) Differentiated allocation Fair allocation Jae W. Lee (10 / 33)

  11. Globally Synchronized Frames (GSF) provide guaranteed QoS guaranteed QoS with minimum bandwidth guarantees and maximum delay to each flow in multi- hop on-chip networks: � with high network utilization comparable to best-effort virtual-channel router � with minimal area/energy overhead by avoiding per-flow queues/structures in on-chip routers → scalable to # of concurrent flows Jae W. Lee (11 / 33)

  12. Outline of this talk � Motivation � Globally-Synchronized Frames: a step-by-step development of mechanism � Implementation of GSF router � Evaluation � Related work � Conclusion Jae W. Lee (12 / 33)

  13. GSF takes a frame-based approach shared physical link frame # fram e # R R R R 4 R R R R 3 2 R R R R 1 0 R R R R time time � Frame is a coarse quantization of time. � The network can transport a finite number of flits during this interval. � We constrain each flow source to inject a certain number of flits per frame. � shorter frames → coarser BW control but lower maximum delay � typically 1-100s Kflits / frame (over all flows) in 8x8 mesh network Jae W. Lee (13 / 33)

  14. Admission control of flows shared physical link frame # fram e # R R R R 4 R R R R 3 2 R R R R 1 0 R R R R time time � Admission control: reject a new flow if it would make the network unable to transport all the injected flits within a frame interval Jae W. Lee (14 / 33)

  15. Single frame does not service bursty traffic well frame # fram e # 5 4 3 2 1 0 time time regulated src regulated src bursty src bursty src � Both traffic sources have the same long-term rate: 2 flits / frame. � Allocating 2 flits / frame penalizes the bursty source. Jae W. Lee (15 / 33)

  16. Overlapping multiple frames to help bursty traffic 7 fram frame # e # 6 6 5 5 5 4 4 4 3 3 3 2 2 2 2 future frames 2 future frames 1 1 0 head frame head frame time time � Overlapping multiple frames Overlapping multiple frames to multiply injection slots � Sources can inject flits into future frames (w/ separate per-frame buffers) � Older frames have higher priorities for contended channels. � Drain time of head frame does not change. � Future frames can use unclaimed BW by older frames. � Maximum network delay < 3 * (frame interval) � Best-effort traffic: always lowest priority (throughput ↑ ) Jae W. Lee (16 / 33)

  17. Reclamation of frame buffers 7 fram frame # e # Frame Frame 6 6 window window 5 5 5 shift shift 4 4 4 3 3 3 2 2 2 VC2 VC2 VC1 VC1 1 1 VC0 VC0 0 time time epoch epoch epoch epoch epoch epoch 0 1 2 3 4 5 � Per-frame buffers (at each node) = virtual channels � At every frame window shift, frame buffers (or VCs) associated with the earliest frame in the previous epoch are reclaimed for the new futuremost frame. Jae W. Lee (17 / 33)

  18. Early reclamation improves network throughput 7 7 7 7 7 fram frame # e # Frame Frame Frame Frame 6 6 6 6 6 6 6 window window window window 5 5 5 5 5 5 5 5 5 shift shift shift shift 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 0 0 0 time time e0 epoch e1 e2 epoch e3 epoch e4 e5 epoch epoch epoch e6 e7 0 1 2 3 4 5 � Observation: Head frame usually drains much earlier than frame interval → low buffer utilization � Terminate head frame early if empty Terminate head frame early if empty � Use a global barrier network to confirm no pending packet in router or source queue belongs to head frame. � Empty buffers are reclaimed much faster and overall throughput increases. (by >30% for hotspot traffic pattern) Jae W. Lee (18 / 33)

  19. GSF in action � GSF in action: two-router network example (3 VCs) Flow A Flow A Flow B Flow B Flow C Flow C Flow D Flow D VC 0 VC 0 VC 0 VC 0 A B C (Fr0) (Fr0) (Fr0) (Fr0) VC 1 VC 1 VC 1 VC 1 A C B (Fr1) (Fr1) (Fr1) (Fr1) VC 2 VC 2 VC 2 VC 2 A B D (Fr2) (Fr2) (Fr2) (Fr2) Frame 0 Frame 0 active frame Frame 1 Frame 1 window: Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5 Jae W. Lee (19 / 33) •••

Recommend


More recommend