rogue rdma over generic unconverged ethernet
play

RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent - PowerPoint PPT Presentation

RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi, Aditya Akella, Mike Swift RDMA Overview RDMA Zero Copy USER Application Application Buffer Buffer KERNEL HARWARE Kernel Bypass


  1. RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi, Aditya Akella, Mike Swift

  2. RDMA Overview RDMA Zero Copy USER Application Application Buffer Buffer KERNEL HARWARE Kernel Bypass Protocol Offload

  3. RDMA Overview RDMA Zero Copy USER Application Application Buffer Buffer KERNEL HARWARE Kernel Bypass Protocol Offload Low Latency, High throughput, Low CPU utilization

  4. RDMA Overview RDMA Zero Copy USER Application Application Buffer Buffer KERNEL HARWARE Kernel Bypass Protocol Offload Low Latency, High throughput, Low CPU utilization • RoCE : a protocol that provides RDMA over a lossless Ethernet network

  5. Priority Flow Control Server/ Switch/ Switch Server RoCE assumes Ethernet network to be lossless – achieved by enabling Priority Flow Control (PFC).

  6. Priority Flow Control Server/ Switch/ Switch Server Pause frame RoCE assumes Ethernet network to be lossless – achieved by enabling Priority Flow Control (PFC).

  7. Motivation

  8. Motivation

  9. Motivation HOL Blocking

  10. Motivation HOL Blocking Unfairness

  11. Motivation • Data center providers are reluctant to enable PFC – Instead, isolate RDMA traffic and TCP traffic HOL Blocking Unfairness

  12. Motivation • Data center providers are reluctant to enable PFC – Instead, isolate RDMA traffic and TCP traffic HOL Blocking Unfairness • RDMA has not seen the uptake it deserves

  13. Can we run RDMA over generic Ethernet network without any reliance on PFC ?

  14. Can we run RDMA over generic Ethernet network without any reliance on PFC ? RoCE + PFC Congestion Control No packet drop

  15. Can we run RDMA over generic Ethernet network without any reliance on PFC ? RoCE + PFC RoGUE Congestion Control No packet drop

  16. Can we run RDMA over generic Ethernet network without any reliance on PFC ? RoCE + PFC RoGUE Congestion Control Congestion Control No packet drop

  17. Can we run RDMA over generic Ethernet network without any reliance on PFC ? RoCE + PFC RoGUE Congestion Control Congestion Control Retransmission No packet drop

  18. Can we run RDMA over generic Ethernet network without any reliance on PFC ? RoCE + PFC RoGUE Congestion Control Congestion Control Retransmission No packet drop yet retain low latency, CPU utilization

  19. RoCE Overview RDMA APP Verb Send QUEUE QP Signal Receive QUEUE CPU Completion QUEUE RNIC Brake the animations

  20. RoCE Overview RDMA APP Send QUEUE QP Signal Receive QUEUE CPU Completion QUEUE RNIC Verb Brake the animations

  21. RoCE Overview RDMA APP Send QUEUE QP Signal Receive QUEUE CPU Completion QUEUE RNIC Verb Brake the animations

  22. RoCE Overview RDMA APP Send QUEUE QP Signal Receive QUEUE CPU Completion QUEUE Signal RNIC Verb Brake the animations

  23. Where to fix: HW or SW? Software Hardware ✅ Easy to implement ✅ Low CPU utilization, Low Latency ❌ Packet level congestion ❌ It requires to work with signals are unavailable NIC vendor ❌ High CPU utilization if per- ❌ Heterogeneous network packet operations hardware with non- standard protocol implementation ❌ Complicates network evolution

  24. RoGUE Overview Congestion Control Loss Recovery CPU RNIC

  25. RoGUE Overview Congestion Control Loss Recovery Congestion Control loop CPU-efficient CPU segmenting RNIC

  26. RoGUE Overview Congestion Control Loss Recovery Congestion Control loop CPU-efficient CPU segmenting RNIC Hardware timestamp to measure RTT Hardware rate limiter to pace packets

  27. RoGUE Overview Congestion Control Loss Recovery Congestion Control loop Shadow Queue Pair CPU-efficient CPU segmenting RNIC Hardware timestamp to measure RTT Hardware rate limiter to pace packets

  28. RoGUE Overview Congestion Control Loss Recovery Congestion Control loop Shadow Queue Pair CPU-efficient CPU segmenting RNIC Hardware timestamp to measure RTT Hardware retransmission Hardware rate limiter to pace packets

  29. Congestion Signal Sender Switch Receive r Packets from different flows

  30. Congestion Signal Sender Switch Receive r RTT ACK Packets from different flows

  31. Congestion Signal Sender Switch Receive r RTT ACK Packets from different flows

  32. Congestion Signal Sender Switch Receive r RTT ACK RTT ACK Packets from different flows

  33. Congestion Signal Sender Switch Receive • RTT is high, the queue r builds up, reduce the sending rate • RTT is low, network is RTT idle, increase the ACK sending rate RTT ACK Packets from different flows

  34. CPU Efficient Segmenting Host RNIC RNIC Two key questions • Verb 1, 2, How large a verb should • 3, 4, 5 RoGUE send? Verb 6 How often should the RNIC • Signal 1 signaled? Small Verb (< 64KB) • signal every 64KB • CPU utilization (< 20%) • Large Verb (>= 64KB) • chunk, and signal every 64KB. • • CPU utilization (< 10%)

  35. CPU Efficient Segmenting Host RNIC RNIC Two key questions • Verb 1, 2, How large a verb should • 3, 4, 5 RoGUE send? Verb 6 How often should the RNIC • Signal 1 signaled? Small Verb (< 64KB) • signal every 64KB • CPU utilization (< 20%) • Large Verb (>= 64KB) • chunk, and signal every 64KB. • • CPU utilization (< 10%)

  36. CPU Efficient Segmenting Host RNIC RNIC Two key questions • Verb 1, 2, How large a verb should • 3, 4, 5 RoGUE send? Verb 6 How often should the RNIC • Signal 1 signaled? Verb 6 packets Small Verb (< 64KB) • Signal 2 signal every 64KB • CPU utilization (< 20%) • Signal 3 Large Verb (>= 64KB) • chunk, and signal every 64KB. • • CPU utilization (< 10%)

  37. RTT measurement Host RNIC RNIC T enc_s1 Verb 1 T enc_s2 Verb 1 packets Send Ack 1 Signal 1 T comp_s1 Send Ack 2 T comp_s2

  38. RTT measurement Host RNIC RNIC T enc_s1 Verb 1 T enc_s2 Verb 2 Verb 1 packets Send Ack 1 Signal 1 Verb 2 packets T comp_s1 Send Ack 2 Signal 2 T comp_s2

  39. RTT measurement Host RNIC RNIC T enc_s1 Verb 1 T enc_s2 T start_si =max( Verb 2 Verb i enqueued, Verb 1 packets last packet of Verb i-1 goes out of NIC) Send Ack 1 Signal 1 Verb 2 packets T comp_s1 Send Ack 2 Signal 2 T comp_s2

  40. RTT measurement Host RNIC RNIC T enc_s1 Verb 1 T enc_s2 T start_si =max( Verb 2 Verb i enqueued, Verb 1 packets last packet of Verb i-1 goes out of T start_s2 NIC) Send Ack 1 Signal 1 Verb 2 packets T comp_s1 Send Ack 2 Signal 2 T comp_s2

  41. RTT measurement Host RNIC RNIC T enc_s1 Verb 1 T enc_s2 T start_si =max( Verb 2 Verb i enqueued, Verb 1 packets last packet of Verb i-1 goes out of T start_s2 NIC) RTT i = T comp_si - T start_si - bytes/ Send Ack 1 Signal 1 rate_limit Verb 2 packets T comp_s1 Send Ack 2 Signal 2 T comp_s2

  42. RTT measurement Host RNIC RNIC T enc_s1 Verb 1 T enc_s2 T start_si =max( Verb 2 Verb i enqueued, Verb 1 packets last packet of Verb i-1 goes out of T start_s2 NIC) RTT i = T comp_si - T start_si - bytes/ Send Ack 1 Signal 1 rate_limit Verb 2 packets T comp_s1 RTT is measured by Hardware timestamp. Send Ack 2 Signal 2 T comp_s2

  43. Congestion Response

  44. Congestion Response • Similar to TCP Vegas, and Timely

  45. Congestion Response • Similar to TCP Vegas, and Timely • If congestion window >= 64KB, window-based + rate limiter

  46. Congestion Response • Similar to TCP Vegas, and Timely • If congestion window >= 64KB, window-based + rate limiter • If congestion window < 64KB, rate limiter only

  47. Congestion Response • Similar to TCP Vegas, and Timely • If congestion window >= 64KB, window-based + rate limiter • If congestion window < 64KB, rate limiter only • Rate limiter is offloaded to RNIC

  48. Evaluation • Mellanox ConnectX-3 Pro 10Gbps RNICs, DCQCN • Baselines: DCTCP , DCQCN

  49. Evaluation-Cluster Experiments • Each of 16 hosts generates 1MB RPC for random destinations and send 1KB RPC once every ten 1MB RPC Flow Completion Time (ms) 7 700 Flow Completion Time (us) 6 600 5 500 4 400 3 300 2 200 1 100 0 0 10 25 50 75 10 25 50 75 Network Load (%) Network Load (%) RoGUE RoCE (w/ DCQCN) DCTCP (a) (b) Large RPCs (1MB) - Median FCT Small RPCs (1KB) - 90th %ile FCT

  50. Evaluation-Congestion Response 10 flow 0 flow 1 8 flow 2 Throughput (Gbps) flow 3 flow 4 6 4 2 0 0 50 100 150 200 Time (s)

  51. Evaluation-CPU Utilization 60 DCTCP 50 RoCE (READ RC) CPU Utilization (%) RoGUE (READ RC) 40 30 20 10 0 Client Server

  52. Summary • It is possible to support RoCE without relying on PFC • Judicious division of labor between SW and HW to do the congestion control and retransmission, yet retain a low CPU utilization • RoGUE supports RC and UC transport types of CC • Evaluation results validate that RoGUE has competitive performance with native RoCE

Recommend


More recommend