elastic rss
play

Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs - PowerPoint PPT Presentation

Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs Alexander Rucker Tushar Swamy, Muhammad Shahbaz, and Kunle Olukotun Stanford University August 17, 2019 How do we meet tail latency constraints? 1 Existing systems have


  1. Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs Alexander Rucker Tushar Swamy, Muhammad Shahbaz, and Kunle Olukotun Stanford University August 17, 2019

  2. How do we meet tail latency constraints? 1

  3. Existing systems have several limitations. Random Hashing • Load imbalance • Over provisioned Centralized Scheduling • Dedicated core • Limited throughput 2 NIC

  4. Existing systems have several limitations. Random Hashing • Load imbalance • Over provisioned Centralized Scheduling • Dedicated core • Limited throughput 2 NIC NIC Sched

  5. How do we scalably & CPU-efgiciently meet tail latency constraints? 3

  6. eRSS uses all cores for useful work and runs at line rate. 4 eRSS

  7. Design

  8. eRSS’s packet processing maps to a PISA NIC with map-reduce extensions. 5 Programmable NIC On-chip Core (ARM or PowerPC) H o PHV D s e P t p a a r C r s s P e e r U r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block

  9. 1. Assign each packet to an application. • For example, use IP address or port number. 6 Programmable NIC On-chip Core (ARM or PowerPC) H PHV o D s e P t p a r a C r s s P e e r U r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block

  10. 2. Estimate the per-packet workload. • Can use any set of packet header fields (currently, only packet size). • Model is periodically trained by the CPU. 7 Programmable NIC On-chip Core (ARM or PowerPC) H PHV o D s e P t Workload p a Estimation r a C r s per s P e Application e r U r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block

  11. 3. Determine core count for the application. • Compare allocated cores to exponential moving average of workload. • Use heuristics and hysteresis to avoid ringing. 8 Programmable NIC On-chip Core (ARM or PowerPC) H PHV o D s e P t Workload Core p a Estimation Allocation r a C r s per per s P e Application Application e r U r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block

  12. 4. Select a virtual core. • Virtual cores within each application are allocated densely, starting at 0. • Packets are hashed & the best allocated core is chosen. 9 Programmable NIC On-chip Core (ARM or PowerPC) H PHV o D s Consistent e P t Workload Core Hashing p a with Weights Estimation Allocation r a C r s per per per s P e Application’s Application Application e r U Virtual Core r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block

  13. 5. Estimate queue depths. • Queues are estimated per-virtual core. • Estimates are used to adjust consistent hashing weights. 10 Programmable NIC On-chip Core (ARM or PowerPC) Update weights H (in 10µs) PHV o D s Consistent e P Queue-Depth t Workload Core Hashing p a Estimation with Weights Estimation Allocation r a C r s per per per per Application’s s P e Application’s Application Application Virtual Core e r U Virtual Core r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block

  14. 6. Map the virtual core to a physical core. • CPU assigns each physical core to an application as an active/slack core. 11 Programmable NIC On-chip Core (ARM or PowerPC) Update weights H (in 10µs) PHV o D s Consistent e P Queue-Depth t V2P Core Workload Core Hashing p a Estimation Mapping with Weights Estimation Allocation r a C r s per per per per per Application’s s P e Application’s Application Application Application Virtual Core e r U Virtual Core r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block • Look up ⟨ Application, Virtual Core ⟩ → Physical Core in match-action table.

  15. 1. An application needs additional headroom. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 12 Host CPUs App1 L eRSS Manager

  16. 2. The core is initially running a batch job. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 13 Host CPUs App1 L eRSS Manager

  17. 3. The sofuware manager starts and pins a sleeping thread to the core. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 14 Host CPUs App1 eRSS Manager

  18. 4. When the NIC allocates a core, it wakes up the resident thread. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 15 Host CPUs App1 Interrupt eRSS Manager

  19. 5. Cores can run any server sofuware, incl. distributed work stealing or preemption. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 16 Host CPUs App1 eRSS Manager

  20. 6. Upon deallocation, the packet thread sleeps and the OS schedules a batch job. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 17 Host CPUs App1 eRSS Manager

  21. Preliminary Evaluation

  22. We simulate eRSS’s performance on a synthetic model. • Packets have Poisson-distributed inter-arrival times. • Packet processing time correlates with size and added noise. 18 • Packet sizes are representative of Internet trafgic.

  23. eRSS responds quickly to load variations. 1 Time (ms) Cores Allocated RSS Req. Trafgic (Gbps) 5 4 3 2 0 0 64 48 32 16 0 40 30 20 10 19

  24. eRSS responds quickly to load variations. 1 Time (ms) Cores Allocated eRSS-a (90% load) RSS Req. Trafgic (Gbps) 5 4 3 2 0 0 64 48 32 16 0 40 30 20 10 19

  25. eRSS responds quickly to load variations. 2 Time (ms) Cores Allocated eRSS-c (75% load) eRSS-a (90% load) RSS Req. Trafgic (Gbps) 5 4 3 1 0 0 64 48 32 16 0 40 30 20 10 19

  26. eRSS deallocates slowly to ensure queues are drained. 2 Time (ms) Cores Allocated eRSS-c (75% load) eRSS-a (90% load) RSS Req. Trafgic (Gbps) 5 4 3 1 0 0 64 48 32 16 0 40 30 20 10 20 L

  27. eRSS adds controllable tail latency. 0 SLO eRSS-c (75% load) eRSS-a (90% load) RSS CDF 100 10 1 0.8 0.6 0.4 0.2 1 21 L 0 . 1 Latency ( µs )

  28. Future Work & Summary

  29. • Core scheduling with Reinforcement Learning (RL) • Replace consistent hashing for distributing packets between cores. eRSS will be extended with ML. • Workload estimation • Use packet header fields and deep packet inspection to gather statistics. • Replace heuristics for adding/removing cores to an application. 22 • Efgicient core scheduling requires accurate workload estimates.

  30. eRSS will be extended with ML. • Workload estimation • Use packet header fields and deep packet inspection to gather statistics. • Replace heuristics for adding/removing cores to an application. • Replace consistent hashing for distributing packets between cores. 22 • Efgicient core scheduling requires accurate workload estimates. • Core scheduling with Reinforcement Learning (RL)

  31. eRSS meets tail latency constraints while saving cores. • Parameters control trade-ofg between core use and tail latency. 23 • eRSS runs at line rate using slight extensions to existing NICs. • eRSS is compatible with a variety of sofuware solutions. • eRSS can be extended with ML for automatic operation.

  32. eRSS scalably & CPU-efgiciently meets tail latency constraints. Questions? 24

  33. eRSS adds a controllable amount of additional queue depth. 2 Deepest Queue (kiB) eRSS-c (75% load) eRSS-a (90% load) RSS Req. Trafgic (Gbps) 5 4 3 1 0 0 30 20 10 0 40 30 20 10 Time (ms)

  34. eRSS minimizes breaking flows. 0.7 0.8 0.9 1.0 0 2 4 6 8 10 CDF Break Counts eRSS-a (90% load) eRSS-c (75% load)

Recommend


More recommend