Soft Softwar are e Da Data Planes: ta Planes: You Can’t Always Spin to Win Hossein Golestani , Amirhossein Mirhosseini, Thomas F. Wenisch University of Michigan ACM Symposium on Cloud Computing (SoCC) November 22, 2019 adacenter.org @ADA_Center This work is supported by the Semiconductor Research Corporation (SRC) and DARPA
What’s Up in the Cloud? • Virtual μ s-scale computing era … VM #1 VM #n Address Firewall Translation Load Routing Balancing Server #1 Server #2 Network function Microservices I/O virtualization virtualization • Service objectives • High throughput High-speed I/O * • Low average/tail latency * Image credits: Mellanox, Intel Software Data Planes: You Can’t Always Spin to Win 2
Softw Softwar are e Stac Stacks: Under ks: Under Revision vision • Then vs. now User app User app CPU I/O Kernel … … Kernel CPU … CPU I/O … I/O CPU I/O • Kernel-bypass architectures (just a handful) Andromeda [NSDI’18] mTCP [NSDI’14] Shinjuku [NSDI’19] Arrakis [OSDI’14] ReFlex [ASPLOS’17] Snap [SOSP’19] IX [OSDI’14] Shenango [NSDI’19] ZygOS [SOSP’17] Software Data Planes: You Can’t Always Spin to Win 3
Softwar Softw are e Da Data ta Planes Planes • Key mechanisms • User-level shared queues • Spin-polling cores I/O • Fast notification by cache coherence write signals SPDK • Widely adopted in industry STORAGE PERFORMANCE DEVELOPMENT KIT Software Data Planes: You Can’t Always Spin to Win 4
Spin Spin-polling: polling: Not a Not a Panacea anacea • An easy-to-use and fast model for communication and signaling • But far from ideal, especially when scaled • We show that spin-based data planes: • Perform more work when there is less • Are not scalable to many cores • Are not scalable to many queues • Are not well-suited for shared queues Software Data Planes: You Can’t Always Spin to Win 5
Outline Outline • Introduction to Software Data Planes • Methodology • Characterization of Software Data Plane Challenges • Solution Directions • Conclusion Software Data Planes: You Can’t Always Spin to Win 6
Methodology Methodolog • Setup • DPDK-based applications • Skylake cores • 100GbE Mellanox NIC • Experiments Inefficiencies of spin-polling 1 Lack of queue scalability 2 3 Impracticality of queue sharing Software Data Planes: You Can’t Always Spin to Win 7
Inef Inefficiencies ficiencies of of Spin Spin-polling polling • Polling “tax” • Body of poll loop 1 • Useless polling on idle queues (possibly causing cache misses) • Affects throughput scalability with cores (1) While forever: NIC … 2 Port 1 (2) For each RX queue: (3) Read packets from RX queue; Core (4) If there are any packets: Route packets using LPM * ; (5) NIC … Port 2 (6) Send packets to TX queue(s); 3 * LPM: Longest Prefix Match Polling tax can be 20-28% of total CPU cycles even in 100% load Software Data Planes: You Can’t Always Spin to Win 8
IPC IPC != Useful != Useful Wor ork • IPC (Instructions Per Cycle) of routing core at varying loads 1 2.75 2.50 2.25 IPC of routing core 2.00 1.75 1.50 1.25 2 1 queue 1.00 0.75 4 queues 0.50 8 queues 0.25 0.00 0 5 10 15 20 25 30 3 Routing throughput (Mpps) IPC decreases as load increases, resulting in energy inefficiency , fast aging , and severe co-runner interference Software Data Planes: You Can’t Always Spin to Win 9
Ef Effect ect on SMT on SMT Co Co-runner unner • More (useless) instructions executed in lighter traffic 1 2.5 • Co-running: 2.24 • Matrix mult 2.0 IPC of matrix mult 1.56 • Spin-based routing (0-100% load) 1.54 1.5 2 1.0 • Executed on: 0.5 • SMT cores of a physical CPU • Different physical CPUs 0.0 3 Not collocated Collocated Collocated Routing-0% Routing-100% Useless spinning wastes execution resources of an SMT co-runner Software Data Planes: You Can’t Always Spin to Win 10
Lac Lack of k of Queue Scala Queue Scalability bility • Traffic flows spread among multiple queues 1 • Limited size of CPU caches: a performance antagonist • Experiment 2 • Forwarding packets by a single core NIC … • Scaling up the number of queues Port 1 Core 3 NIC … Port 2 Software Data Planes: You Can’t Always Spin to Win 11
Ef Effect ect on La on Latenc tency • Round-trip latency of packet forwarding • Light traffic (minimal queuing delay) 1 25 Average latency ( μ s) 20 15 2 10 5 0 0 64 128 192 256 320 384 448 512 Number of queues 3 Latency is severely affected as queue heads fall out of L1/L2 caches Software Data Planes: You Can’t Always Spin to Win 12
Ef Effect ect on P on Peak T eak Thr hroughput oughput • Balanced traffic: Passing through all queues • Unbalanced traffic: Passing through only one queue 1 40 Balanced 35 Throughput (Mpps) Unbalanced 30 25 2 20 15 10 5 0 0 64 128 192 256 320 384 448 512 3 Total number of queues Cache misses not interleaved with transmits severely hurt peak throughput in unbalanced traffic Software Data Planes: You Can’t Always Spin to Win 13
Scale-up Queuing Scale up Queuing Is Is Impr Impractical actical • (a) Scale-out vs. (b) Scale-up queuing (shared queue) 1 Core Core 1 1 … … … Core Core n n 2 (a) (b) • Scale-up queuing • Strong theoretical merits 3 • Synchronization disadvantage Software Data Planes: You Can’t Always Spin to Win 14
Scale Scale-out out vs. Scale vs. Scale-up up • Processing hiccups cause head-of-line (HoL) blocking in scale-out 1 • Round-trip latency with 10 parallel cores (a) No hiccups 400 400 (a) (b) (b) 1μs processing hiccup Average latency ( μ s) Average latency ( μ s) 350 350 300 300 with 1% probability 250 250 2 200 200 150 150 100 100 Scale-out Scale-out 50 50 Scale-up Scale-up 0 0 0 20 40 60 0 20 40 60 3 Throughput (Mpps) Throughput (Mpps) Although effective in avoiding HoL blocking, spin-polling in scale-up queuing saturates at lower loads Software Data Planes: You Can’t Always Spin to Win 15
Futur Future Da e Data ta Planes Planes Software Data Planes: You Can’t Always Spin to Win 16
Solution Dir Solution Direction(s) ection(s) • QWAIT , a multi-address monitoring scheme • Inspired by x86 MWAIT • Avoids polling tax, useless polling, and disruption to SMT co-runners • Needs hardware support • Programming model similar to select - case in Go QWAIT (queue_set): case queue_1: process_queue_1(); … case queue_n: process_queue_n(); Software Data Planes: You Can’t Always Spin to Win 17
Conc Conclusion lusion • Key mechanisms of software data planes • User-level shared queues • Spin-polling cores • Although easy-to-use and low-latency, software data planes have deficiencies, especially when scaled • Using DPDK, we quantified these deficiencies: • Incurring polling overhead and useless work • Not scalable to many cores/queues • Not well-suited for scale-up queuing Software Data Planes: You Can’t Always Spin to Win 18
Q Q & & A Thank you! Software Data Planes: You Can’t Always Spin to Win 19
Recommend
More recommend