PASTE: A Network Programming Interface for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Giuseppe Lettieri (Università di Pisa) Lars Eggert and Douglas Santry (NetApp) USENIX NSDI 2018
Review: Memory Hierarchy Slow, block-oriented persistence CPU 5-50 ns Caches Byte access w/ load/store 70 ns Main Memory Block access w/ 100-1000s us HDD / SSD system calls
Review: Memory Hierarchy Fast, byte-addressable persistence CPU 5-50 ns Caches Byte access w/ 70 ns -1000s ns load/store Main Memory Block access w/ 100-1000s us HDD / SSD system calls
Networking is faster than disks/SSDs 1.2KB durable write over TCP/HTTP Cables, NICs, TCP/IP, Syscall, PCIe bus, socket API physical media Client Server SSD 23us 1300us
Networking is slower than NVMM 1.2KB durable write over TCP/HTTP Cables, NICs, TCP/IP, Memcpy, memory bus, socket API physical media Client Server NVMM 23us 2us
Networking is slower than NVMM nevts = epoll_wait(fds) 1.2KB durable write over TCP/HTTP for (i =0; i < nevts; i++) { read(fds[i], buf); ... Cables, NICs, TCP/IP, Memcpy, memory bus, memcpy(nvmm, buf); socket API physical media ... write(fds[i], reply) Client Server NVMM } Client Client
Innovations at both stacks Network stack Storage stack MegaPipe [OSDI’12] NVTree [FAST’15] Seastar NVWal [ASPLOS’16] mTCP [NSDI’14] NOVA [FAST’16] IX [OSDI’14] Decibel [NSDI’17] Stackmap [ATC’16] LSNVMM [ATC’17]
Stacks are isolated Costs of Network stack Storage stack moving data MegaPipe [OSDI’12] NVTree [FAST’15] Seastar NVWal [ASPLOS’16] mTCP [NSDI’14] NOVA [FAST’16] IX [OSDI’14] Decibel [NSDI’17] Stackmap [ATC’16] LSNVMM [ATC’17]
Bridging the gap Network stack Storage stack MegaPipe [OSDI’12] NVTree [FAST’15] Seastar PASTE NVWal [ASPLOS’16] mTCP [NSDI’14] NOVA [FAST’16] IX [OSDI’14] Decibel [NSDI’17] Stackmap [ATC’16] LSNVMM [ATC’17]
PASTE Design Goals ● Durable zero copy ○ DMA to NVMM ● Selective persistence ○ Exploit modern NIC’s DMA to L3 cache ● Persistent data structures ○ Indexed, named packet buffers backed fy a file ● Generality and safety ○ TCP/IP in the kernel and netmap API ● Best practices from modern network stacks ○ Run-to-completion, blocking, busy-polling, batching etc
PASTE in Action Plog /mnt/pm/plog App thread pbuf len off Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm 20 27 [ 0 ] 21 26 TCP/IP [ 4 ] Pbufs 22 25 [ 8 ] 23 24 NIC
PASTE in Action Plog /mnt/pm/plog App thread pbuf len off Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm 20 27 [ 0 ] 21 26 TCP/IP [ 4 ] Pbufs 22 25 [ 8 ] 23 24 NIC
PASTE in Action ● poll() system call Plog 1. Run NIC I/O and TCP/IP /mnt/pm/plog App thread pbuf len off Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm 20 27 [ 0 ] 21 26 TCP/IP [ 4 ] Pbufs 22 25 [ 8 ] 23 24 NIC
PASTE in Action ● poll() system call Plog 1. Run NIC I/O and TCP/IP /mnt/pm/plog ○ Got 6 in-order TCP App thread pbuf len off segments Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm 20 27 [ 0 ] 21 26 TCP/IP [ 4 ] Pbufs 22 25 [ 8 ] 23 24 NIC
PASTE in Action ● poll() system call Plog 1. Run NIC I/O and TCP/IP /mnt/pm/plog ○ They are set to Pring App thread pbuf len off slots Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm tail 0 27 [ 0 ] 1 6 TCP/IP [ 4 ] Pbufs 2 5 [ 8 ] 3 4 NIC
PASTE in Action ● Return from poll() Plog 1. Run NIC I/O and TCP/IP /mnt/pm/plog App thread pbuf len off Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm tail 0 27 [ 0 ] 1 6 TCP/IP [ 4 ] Pbufs 2 5 [ 8 ] 3 4 NIC
PASTE in Action Plog 1. Run NIC I/O and TCP/IP 2. Read data on Pring /mnt/pm/plog App thread pbuf len off Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm tail 0 27 [ 0 ] 1 6 TCP/IP [ 4 ] Pbufs 2 5 [ 8 ] 3 4 NIC
PASTE in Action ● flush Pbuf data from Plog 1. Run NIC I/O and TCP/IP 2. Read data on Pring /mnt/pm/plog CPU cache to DIMM App thread 3. Flush Pbuf(s) pbuf len off ○ clflush(opt) instruction Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm tail 0 27 [ 0 ] 1 6 TCP/IP [ 4 ] Pbufs 2 5 [ 8 ] 3 4 NIC
PASTE in Action ● Pbuf is persistent Plog 1. Run NIC I/O and TCP/IP 2. Read data on Pring /mnt/pm/plog data representation App thread 3. Flush Pbuf(s) pbuf len off 4. Flush Plog entry(ies) ○ Base address is static 1 96 120 Zero copy i.e., file (/mnt/pm/pp) user ○ Buffers can be kernel Ppool (shared memory) Pring recovered after reboot slot [0] File system /mnt/pm/pp [7] cur /mnt/pm tail 27 0 [ 0 ] 1 6 [ 4 ] TCP/IP Pbufs 2 5 [ 8 ] 3 4 NIC
PASTE in Action ● Prevent the kernel Plog 1. Run NIC I/O and TCP/IP 2. Read data on Pring /mnt/pm/plog from recycling the App thread 3. Flush Pbuf(s) pbuf len off 4. Flush Plog entry(ies) buffer 1 96 120 5. Swap out Pbuf(s) Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm tail 0 27 [ 0 ] 8 6 TCP/IP [ 4 ] Pbufs 2 5 [ 8 ] 3 4 NIC
PASTE in Action ● Same for Pbuf 2 and 6 Plog 1. Run NIC I/O and TCP/IP 2. Read data on Pring /mnt/pm/plog App thread 3. Flush Pbuf(s) pbuf len off 4. Flush Plog entry(ies) 1 96 120 5. Swap out Pbuf(s) Zero copy 2 96 768 6 96 987 user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm tail 0 27 [ 0 ] 8 10 TCP/IP [ 4 ] Pbufs 9 5 [ 8 ] 3 4 NIC
PASTE in Action ● Advance cur Plog 1. Run NIC I/O and TCP/IP 2. Read data on Pring /mnt/pm/plog ○ Return buffers in slot App thread 3. Flush Pbuf(s) pbuf len off 4. Flush Plog entry(ies) 0-6 to the kernel at 1 96 120 5. Swap out Pbuf(s) Zero copy 2 96 768 6. Update Pring next poll() 6 96 987 user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] /mnt/pm tail 0 27 [ 0 ] cur 8 10 TCP/IP [ 4 ] Pbufs 9 5 [ 8 ] 3 4 NIC
PASTE in Action Plog 1. Run NIC I/O and TCP/IP 2. Read data on Pring /mnt/pm/plog App thread 3. Flush Pbuf(s) pbuf len off Write-Ahead Logs 4. Flush Plog entry(ies) 1 96 120 5. Swap out Pbuf(s) Zero copy 2 96 768 6. Update Pring 6 96 987 user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] /mnt/pm tail 0 27 [ 0 ] cur 8 10 TCP/IP [ 4 ] Pbufs 9 5 [ 8 ] 3 4 NIC
PASTE in Action 3 5 Plog /mnt/pm/plog 0 5 7 1. Run NIC I/O and TCP/IP B+tree 2. Read data on Pring App thread 3. Flush Pbuf(s) ( 1 , 96, 120) 4. Flush Plog entry(ies) ( 2 , 96, 987) 5. Swap out Pbuf(s) Zero copy ( 6 , 96, 512) 6. Update Pring user kernel ● We can organize various Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] /mnt/pm data structures in Plog tail 0 27 [ 0 ] cur 8 10 TCP/IP [ 4 ] Pbufs 9 5 [ 8 ] 3 4 NIC
Evaluation 1. How does PASTE outperform existing systems? 2. Is PASTE applicable to existing applications? 3. Is PASTE useful for systems other than file/DB storage?
How does PASTE outperform existing systems? 64B What if we use more complex data structures? 1280B WAL B+tree (all writes)
How does PASTE outperform existing systems? 64B 1280B WAL B+tree (all writes)
Is PASTE applicable to existing applications? ● Redis YCSB (read mostly) YCSB (update heavy)
Is PASTE useful for systems other than DB/file storage? ● Packet logging prior to forwarding ○ Fault-tolerant middlebox [Sigcomm’15] ○ Traffic recording ● Extend mSwitch [SOSR’15] ○ Scalable NFV backend switch
Conclusion ● PASTE is a network programming interface that: ○ Enables durable zero copy to NVMM ○ Helps apps organize persistent data structures on NVMM ○ Lets apps use TCP/IP and be protected ○ Offers high-performance network stack even w/o NVMM https://github.com/luigirizzo/netmap/tree/paste micchie@sfc.wide.ad.jp or @michioh
Multicore Scalability ● WAL throughput
Further Opportunity with Co-designed Stacks ● What if we use higher access latency NVMM? ○ e.g., 3D-Xpoint ● Overlap flushes and processing with clflushopt and mfence before system call (triggers packet I/O) ○ See the paper for results Examine Examine request request clflushopt clflushopt Systemcall mfence Systemcall time Receive new Send requests responses Wait for flushes done
Experiment Setup ● Intel Xeon E5-2640v4 (2.4 Ghz) ● HPE 8GB NVDIMM (NVDIMM-N) ● Intel X540 10 GbE NIC ● Comparison ○ Linux and Stackmap [ATC’15] (current state-of-the art) ○ Fair to use the same kernel TCP/IP implementation
Recommend
More recommend