Protocol stacks and multicore scalability The evolving hardware-software interface or Why we love and hate offload MSN 2010 Robert N. M. Watson University of Cambridge Portions of this work supported by Juniper Networks, Inc.
Idealised network for an OS developer magic Network Network NIC NIC stack stack goodness goodness The somebody else's problem cloud 2
Things are getting a bit sticky at the end host * * … and end host-like middle nodes: proxies, application firewalls, anti-spam, anti-virus, … 3
Packets-per-second (PPS) scales with bandwidth, but per-core limits reached ➮ Transition to multicore Even today’s bandwidth achieved only with protocol offload to the NIC ➮ But just specific protocols, workloads 4
Contemporary network stack scalability themes 5
• Counting instructions ➞ cache misses • Lock contention ➞ cache line contention • Locking ➞ finding parallelism opportunities • Work ordering, classification, distribution • NIC offload of even more protocol layers • Vertical integrated work distribution/affinity 6
Why we love offload Better performance, no protocol changes * * It sounds good so it must be true! 7
10Gb/s Full TCP , iSCSI, RDMA, ... offload MultiQ: RSS, CAMs, MIPS, … IP fragmentation/TSO/LRO 1Gb/s Checksum offload, VLAN en/decap Interrupt moderation PIO ➞ DMA rings 100Mb/s 8
Reducing effective PPS with offload 9
TCP Segmentation Offload (TSO) Userspace Kernel Hardware Link layer + Application Socket TCP IP Device driver 2k, 4k, MSS MSS 9k, 16k Kernel copies Ethernet frame Checksum Data stream IP header TCP header TCP segmentation in data to encapsulation, + transmit from encapsulation encapsulation mbufs + insert in application clusters descriptor ring user thread ithread Move TCP segmentation from TCP layer to hardware Reduce effective PPS to improve OS performance 10
Large Receive Offload (LRO) * Hardware Kernel Userspace Linker layer + driver Device IP TCP + Socket Socket Application Data stream Receive, validate Interpret and Kernel copies Reassemble Strip IP Strip TCP Look up to ethernet, IP, TCP strips link out mbufs + segments header header and deliver application checksums layer header clusters to socket ithread user thread Move TCP segment reassembly from network protocol to device driver * Interestingly, LRO is often done in software 11
Varying TSO and LRO − bandwidth 3 − vanilla TSO and LRO off from now on 8 ● 7 ● 6 ● 5 ● 4 ● 3 ● 2 ● 1 ● ● 2 − LRO 8 ● 7 ● Processes 6 ● 5 ● 4 ● ● 3 ● ● 2 ● 1 ● ● ● 1 − LRO+TSO 8 ● 7 ● 6 ● ● ● 5 ● 4 ● 3 ● ● 2 ● 1 ● ● ● 2 4 6 Net bandwidth in Gb/s 12
What about the wire protocol? • Packet format remains the same • Transmit/receive code essentially identical • Just shifted segmentation/reassembly • Effective ACK behaviour has changed! • ACK every 6-8 segments instead of every 2 segments! 13
Managing contention and the search for parallelism * * Again, try not to change the protocol… 14
Lock contention $!!"# ,!"# +!"# *!"# )!"# (!"# '!"# "6785# &!"# "6-/3# %!"# $!"# "090# !"# # # # # # # # # 0 0 0 0 0 0 0 0 0 0 5 5 5 5 5 5 5 5 #!" 0 0 0 0 0 0 4 4 0 0 0 0 0 0 5 5 5 5 5 5 . . 4 4 4 4 4 4 3 3 +" 2 2 . . . . . . $ $ 3 3 3 3 3 3 2 2 2 2 2 2 1 1 . . % % & & ' ' *" 0 0 1 1 1 1 1 1 . . . . . . / / . 0 0 0 0 0 0 - / / / / / / )" . . . - - - (" '" 2.31." &" 31." %" $" #" !" #",-./011" $",-./01101" %",-./01101" &",-./01101" 15
Varying locking strategy − bandwidth 4 − single queue exclusive locking 8 ● 7 ● 6 ● 5 ● 4 ● 3 ● 2 ● 1 ● 3 − single queue read locking 8 ● 7 ● 6 ● 5 ● 4 ● 3 ● Processes 2 ● ● 1 ● ● 2 − multi queue exclusive locking 8 ● 7 ● 6 ● 5 ● 4 ● 3 ● ● ● 2 ● 1 ● 1 − multi queue read locking 8 ● 7 ● 6 ● 5 ● 4 ● 3 ● 2 ● 1 ● ● 1 2 3 4 Net bandwidth in Gb/s 16
TCP input path Potential dispatch points Hardware Kernel Userspace Linker layer Device IP TCP + Socket Socket Application + driver Validate Reassemble Receive, Interpret and Validate Data stream Kernel copies checksum, Look up segments, validate strips link checksum, strip to out mbufs + strip IP socket deliver to checksum layer header TCP header application clusters header socket ithread netisr software ithread user thread 17
Work distribution • Parallelism implies work distribution • Must keep work ordered • Establish flow-CPU affinity • Microsoft Receive-Side Steering (RSS) • More fine-grained solutions (CAMs, etc) ⚠ MTCP watch out! ⚠ The Toeplitz catastrophe 18
Varying dispatch strategy − bandwidth 3 − single 8 ● 7 ● 6 ● 5 ● 4 ● 3 ● 2 ● ● 1 ● 2 − single_link_proto 8 ● 7 ● ● Processes 6 ● 5 ● 4 ● 3 ● 2 ● 1 ● 1 − multi 8 ● 7 ● 6 ● 5 ● 4 ● 3 ● 2 ● 1 ● 1 2 3 4 Net bandwidth in Gb/s 19
Why we hate offload 20
“Layering violations” are not invisible • Hardware bugs harder to work around • Instrumentation below socket layer affected • BPF, firewalls, traffic management, etc. • Interface migration more difficult • All your protocols were not created equal • Not all TOEs equal: SYN, TIMEWAIT, etc. 21
Protocol implications • Unsupported protocols and workloads see: • Internet-wide PMTU applied to PCI • Limited or no checksum offload • Ineffectual NIC-side load balancing • Another nail in “deploy a new protocol” coffin? (e.g., SCTP , even multi-path TCP) • Ideas about improving protocol design? 22
Structural problems • Replicated implementation and maintenance responsibility • Difficult field upgrade • Host vs. NIC interop problems • Composability problem for virtualisation • Encodes flow affinity policies in hardware 23
The vertical affinity problem 24
Hardware-only RSS NIC Queue 0 Queue Queue 2 Queue 3 Queue 4 Queue 5 Queue 6 Queue 7 Ithread 0 Ithread Ithread 2 Ithread 3 Ithread 4 Ithread 5 Ithread 6 Ithread 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Awkwardly Network stack goodness random distribution Socket 0 Socket 1 Socket 2 Socket 3 Socket 4 Socket 5 Socket 6 Socket 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Application 25
OS-aligned RSS NIC Queue 0 Queue Queue 2 Queue 3 Queue 4 Queue 5 Queue 6 Queue 7 Ithread 0 Ithread Ithread 2 Ithread 3 Ithread 4 Ithread 5 Ithread 6 Ithread 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Network stack goodness Socket 0 Socket 1 Socket 2 Socket 3 Socket 4 Socket 5 Socket 6 Socket 7 Is this better? Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Application 26
• Applications can express execution affinity • How to align with network stack and network interface affinity? • Sockets API inadequate; easy to imagine simple extensions but are they sufficient? • How to deal with hardware vs. software policy mismatches? 27
Reality for an OS developer Quite a lot less magic Switches / Application Application Switches / goodness goodness Network Network Routers Routers stack stack NIC NIC The somebody else's problem cloud 28
Key research areas • Explore programmability, debuggability, and traceability of heterogenous network stack • Security implications of intelligent devices, diverse/new execution substrates, and single intermediate format • Protocol impact: “end-to-end” endpoints shifting even further 29
Q&A 30
Recommend
More recommend