Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs Alexander Rucker Tushar Swamy, Muhammad Shahbaz, and Kunle Olukotun Stanford University August 17, 2019
How do we meet tail latency constraints? 1
Existing systems have several limitations. Random Hashing • Load imbalance • Over provisioned Centralized Scheduling • Dedicated core • Limited throughput 2 NIC
Existing systems have several limitations. Random Hashing • Load imbalance • Over provisioned Centralized Scheduling • Dedicated core • Limited throughput 2 NIC NIC Sched
How do we scalably & CPU-efgiciently meet tail latency constraints? 3
eRSS uses all cores for useful work and runs at line rate. 4 eRSS
Design
eRSS’s packet processing maps to a PISA NIC with map-reduce extensions. 5 Programmable NIC On-chip Core (ARM or PowerPC) H o PHV D s e P t p a a r C r s s P e e r U r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block
1. Assign each packet to an application. • For example, use IP address or port number. 6 Programmable NIC On-chip Core (ARM or PowerPC) H PHV o D s e P t p a r a C r s s P e e r U r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block
2. Estimate the per-packet workload. • Can use any set of packet header fields (currently, only packet size). • Model is periodically trained by the CPU. 7 Programmable NIC On-chip Core (ARM or PowerPC) H PHV o D s e P t Workload p a Estimation r a C r s per s P e Application e r U r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block
3. Determine core count for the application. • Compare allocated cores to exponential moving average of workload. • Use heuristics and hysteresis to avoid ringing. 8 Programmable NIC On-chip Core (ARM or PowerPC) H PHV o D s e P t Workload Core p a Estimation Allocation r a C r s per per s P e Application Application e r U r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block
4. Select a virtual core. • Virtual cores within each application are allocated densely, starting at 0. • Packets are hashed & the best allocated core is chosen. 9 Programmable NIC On-chip Core (ARM or PowerPC) H PHV o D s Consistent e P t Workload Core Hashing p a with Weights Estimation Allocation r a C r s per per per s P e Application’s Application Application e r U Virtual Core r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block
5. Estimate queue depths. • Queues are estimated per-virtual core. • Estimates are used to adjust consistent hashing weights. 10 Programmable NIC On-chip Core (ARM or PowerPC) Update weights H (in 10µs) PHV o D s Consistent e P Queue-Depth t Workload Core Hashing p a Estimation with Weights Estimation Allocation r a C r s per per per per Application’s s P e Application’s Application Application Virtual Core e r U Virtual Core r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block
6. Map the virtual core to a physical core. • CPU assigns each physical core to an application as an active/slack core. 11 Programmable NIC On-chip Core (ARM or PowerPC) Update weights H (in 10µs) PHV o D s Consistent e P Queue-Depth t V2P Core Workload Core Hashing p a Estimation Mapping with Weights Estimation Allocation r a C r s per per per per per Application’s s P e Application’s Application Application Application Virtual Core e r U Virtual Core r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block • Look up ⟨ Application, Virtual Core ⟩ → Physical Core in match-action table.
1. An application needs additional headroom. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 12 Host CPUs App1 L eRSS Manager
2. The core is initially running a batch job. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 13 Host CPUs App1 L eRSS Manager
3. The sofuware manager starts and pins a sleeping thread to the core. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 14 Host CPUs App1 eRSS Manager
4. When the NIC allocates a core, it wakes up the resident thread. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 15 Host CPUs App1 Interrupt eRSS Manager
5. Cores can run any server sofuware, incl. distributed work stealing or preemption. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 16 Host CPUs App1 eRSS Manager
6. Upon deallocation, the packet thread sleeps and the OS schedules a batch job. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 17 Host CPUs App1 eRSS Manager
Preliminary Evaluation
We simulate eRSS’s performance on a synthetic model. • Packets have Poisson-distributed inter-arrival times. • Packet processing time correlates with size and added noise. 18 • Packet sizes are representative of Internet trafgic.
eRSS responds quickly to load variations. 1 Time (ms) Cores Allocated RSS Req. Trafgic (Gbps) 5 4 3 2 0 0 64 48 32 16 0 40 30 20 10 19
eRSS responds quickly to load variations. 1 Time (ms) Cores Allocated eRSS-a (90% load) RSS Req. Trafgic (Gbps) 5 4 3 2 0 0 64 48 32 16 0 40 30 20 10 19
eRSS responds quickly to load variations. 2 Time (ms) Cores Allocated eRSS-c (75% load) eRSS-a (90% load) RSS Req. Trafgic (Gbps) 5 4 3 1 0 0 64 48 32 16 0 40 30 20 10 19
eRSS deallocates slowly to ensure queues are drained. 2 Time (ms) Cores Allocated eRSS-c (75% load) eRSS-a (90% load) RSS Req. Trafgic (Gbps) 5 4 3 1 0 0 64 48 32 16 0 40 30 20 10 20 L
eRSS adds controllable tail latency. 0 SLO eRSS-c (75% load) eRSS-a (90% load) RSS CDF 100 10 1 0.8 0.6 0.4 0.2 1 21 L 0 . 1 Latency ( µs )
Future Work & Summary
• Core scheduling with Reinforcement Learning (RL) • Replace consistent hashing for distributing packets between cores. eRSS will be extended with ML. • Workload estimation • Use packet header fields and deep packet inspection to gather statistics. • Replace heuristics for adding/removing cores to an application. 22 • Efgicient core scheduling requires accurate workload estimates.
eRSS will be extended with ML. • Workload estimation • Use packet header fields and deep packet inspection to gather statistics. • Replace heuristics for adding/removing cores to an application. • Replace consistent hashing for distributing packets between cores. 22 • Efgicient core scheduling requires accurate workload estimates. • Core scheduling with Reinforcement Learning (RL)
eRSS meets tail latency constraints while saving cores. • Parameters control trade-ofg between core use and tail latency. 23 • eRSS runs at line rate using slight extensions to existing NICs. • eRSS is compatible with a variety of sofuware solutions. • eRSS can be extended with ML for automatic operation.
eRSS scalably & CPU-efgiciently meets tail latency constraints. Questions? 24
eRSS adds a controllable amount of additional queue depth. 2 Deepest Queue (kiB) eRSS-c (75% load) eRSS-a (90% load) RSS Req. Trafgic (Gbps) 5 4 3 1 0 0 30 20 10 0 40 30 20 10 Time (ms)
eRSS minimizes breaking flows. 0.7 0.8 0.9 1.0 0 2 4 6 8 10 CDF Break Counts eRSS-a (90% load) eRSS-c (75% load)
Recommend
More recommend